Service | Microsoft Docs article | Related commit history on GitHub | Change details |
---|---|---|---|
active-directory | Fido2 Compatibility | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/authentication/fido2-compatibility.md | The following tables show which transports are supported for each platform. Supp |||--|--| | Edge | ❌ | ❌ | ❌ | | Chrome | ✅ | ❌ | ❌ |-| Firefox | ✅ | ❌ | ❌ | +| Firefox | ❌ | ❌ | ❌ | ### iOS |
active-directory | Product Permissions Analytics Reports | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/cloud-infrastructure-entitlement-management/product-permissions-analytics-reports.md | You can view the Permissions Analytics Report information directly in the Permis 2. Locate the **Permissions Analytics Report** in the list, then select it. 3. Select which Authorization System you want to generate the PDF download for (AWS, Azure, or GCP). >[!NOTE]- > The PDF can only be downloaded for one Authorization System at a time. If more than one Authorization System is selected, the **Export PDF** button will be disabled. -4. To download the report in PDF format, click on **Export PDF**. + > You can download a PDF report for up to 10 authorization systems at one time. The authorization systems must be part of the same cloud environment (for example, 1- 10 authorization systems that are all on Amazon Web Service (AWS)). The following message displays: **Successfully started to generate PDF report**. - Once the PDF is generated, the report is automatically sent to your email. + Once the PDF is generated, the report(s) is automatically sent to your email. <!## Add and remove tags in the Permissions analytics report |
active-directory | Msal Android Single Sign On | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/msal-android-single-sign-on.md | The following browsers have been tested to see if they correctly redirect to the | - | :--: | -: | -: | -: | -: | -: | | Nexus 4 (API 17) | pass | pass | not applicable | not applicable | not applicable | not applicable | | Samsung S7 (API 25) | pass<sup>1</sup> | pass | pass | pass | fail | pass |-| Huawei (API 26) | pass<sup>2</sup> | pass | fail | pass | pass | pass | | Vivo (API 26) | pass | pass | pass | pass | pass | fail | | Pixel 2 (API 26) | pass | pass | pass | pass | fail | pass |-| Oppo | pass | not applicable<sup>3</sup> | not applicable | not applicable | not applicable | not applicable | +| Oppo | pass | not applicable<sup>2</sup> | not applicable | not applicable | not applicable | not applicable | | OnePlus (API 25) | pass | pass | pass | pass | fail | pass | | Nexus (API 28) | pass | pass | pass | pass | fail | pass | | MI | pass | pass | pass | pass | fail | pass | <sup>1</sup>Samsung's built-in browser is Samsung Internet.<br/>-<sup>2</sup>Huawei's built-in browser is Huawei Browser.<br/> -<sup>3</sup>The default browser can't be changed inside the Oppo device setting. +<sup>2</sup>The default browser can't be changed inside the Oppo device setting. ## Next steps |
active-directory | Howto Manage Local Admin Passwords | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/devices/howto-manage-local-admin-passwords.md | This feature is now available in the following Azure clouds: - Azure Global - Azure Government-- Microsoft Azure operated by 21Vianetated by 21Vianet+- Microsoft Azure operated by 21Vianet ### Operating system updates |
active-directory | Manage Device Identities | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/devices/manage-device-identities.md | To view or copy BitLocker keys, you need to be the owner of the device or have o - Security Administrator - Security Reader -## View and filter your devices (preview) +## View and filter your devices -In this preview, you have the ability to infinitely scroll, reorder columns, and select all devices. You can filter the device list by these device attributes: +You can filter the device list by these attributes: - Enabled state - Compliant state - Join type (Microsoft Entra joined, Microsoft Entra hybrid joined, Microsoft Entra registered) - Activity timestamp-- OS Type and Version+- OS type and OS version - Windows is displayed for Windows 11 and Windows 10 devices (with KB5006738). - Windows Server is displayed for [supported versions managed with Microsoft Defender for Endpoint](/mem/intune/protect/mde-security-integration#supported-platforms). - Device type (printer, secure VM, shared device, registered device) In this preview, you have the ability to infinitely scroll, reorder columns, and - Administrative unit - Owner -To enable the preview in the **All devices** view: --1. Sign in to the [Microsoft Entra admin center](https://entra.microsoft.com) as at least a [Global Reader](../roles/permissions-reference.md#global-reader). -1. Browse to **Identity** > **Devices** > **All devices**. -1. Select the **Preview features** button. -1. Turn on the toggle that says **Enhanced devices list experience**. Select **Apply**. -1. Refresh your browser. --You can now experience the enhanced **All devices** view. - ## Download devices Global readers, Cloud Device Administrators, Intune Administrators, and Global Administrators can use the **Download devices** option to export a CSV file that lists devices. You can apply filters to determine which devices to list. If you don't apply any filters, all devices are listed. An export task might run for as long as an hour, depending on your selections. If the export task exceeds 1 hour, it fails, and no file is output. The exported list includes these device identity attributes: `displayName,accountEnabled,operatingSystem,operatingSystemVersion,joinType (trustType),registeredOwners,userNames,mdmDisplayName,isCompliant,registrationTime,approximateLastSignInDateTime,deviceId,isManaged,objectId,profileType,systemLabels,model` +The following filters can be applied for the export task: ++- Enabled state +- Compliant state +- Join type +- Activity timestamp +- OS type +- Device type + ## Configure device settings If you want to manage device identities by using the Microsoft Entra admin center, the devices need to be either [registered or joined](overview.md) to Microsoft Entra ID. As an administrator, you can control the process of registering and joining devices by configuring the following device settings. |
active-directory | External Collaboration Settings Configure | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/external-identities/external-collaboration-settings-configure.md | External collaboration settings can be configured by using the Microsoft Graph A With the Guest Inviter role, you can give individual users the ability to invite guests without assigning them a global administrator or other admin role. Assign the Guest inviter role to individuals. Then make sure you set **Admins and users in the guest inviter role can invite** to **Yes**. -Here's an example that shows how to use PowerShell to add a user to the Guest Inviter role: +Here's an example that shows how to use Microsoft Graph PowerShell to add a user to the `Guest Inviter` role: +++```powershell ++Import-Module Microsoft.Graph.Identity.DirectoryManagement ++$roleName = "Guest Inviter" +$role = Get-MgDirectoryRole | where {$_.DisplayName -eq $roleName} +$userId = <User Id/User Principal Name> ++$DirObject = @{ + "@odata.id" = "https://graph.microsoft.com/v1.0/directoryObjects/$userId" + } ++New-MgDirectoryRoleMemberByRef -DirectoryRoleId $role.Id -BodyParameter $DirObject -``` -Add-MsolRoleMember -RoleObjectId 95e79109-95c0-4d8e-aee3-d01accf2d47b -RoleMemberEmailAddress <RoleMemberEmailAddress> ``` ## Sign-in logs for B2B users See the following articles on Microsoft Entra B2B collaboration: - [What is Microsoft Entra B2B collaboration?](what-is-b2b.md) - [Adding a B2B collaboration user to a role](./add-users-administrator.md)+ |
active-directory | Tenant Restrictions V2 | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/external-identities/tenant-restrictions-v2.md | Tenant restrictions v2 policies can't be directly enforced on non-Windows 10, Wi ### Migrate tenant restrictions v1 policies to v2 +Migration of Tenant Restrictions from V1 to V2 is an one time operation. Once you have moved from TRv1 to TRv2 on proxy, no client side changes are required and any policy changes need to just happen on the cloud via Entra portal. + On your corporate proxy, you can move from tenant restrictions v1 to tenant restrictions v2 by changing this tenant restrictions v1 header: `Restrict-Access-To-Tenants: <allowed-tenant-list>` |
active-directory | Whats New Archive | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/fundamentals/whats-new-archive.md | All Devices List: - Columns can be reordered via drag and drop - Select all devices -For more information, see: [Manage devices in Azure AD using the Azure portal](../devices/manage-device-identities.md#view-and-filter-your-devices-preview). +For more information, see: [Manage devices in Azure AD using the Azure portal](../devices/manage-device-identities.md#view-and-filter-your-devices). Smart Lockout now synchronizes the lockout state across Azure AD data centers, s - + |
active-directory | Best Practices | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/roles/best-practices.md | Some roles include privileged permissions, such as the ability to update credent :::image type="content" source="./media/best-practices/privileged-role-assignments-warning.png" alt-text="Screenshot of the Microsoft Entra roles and administrators page that shows the privileged role assignments warning." lightbox="./media/best-practices/privileged-role-assignments-warning.png"::: - You can identity roles, permissions, and role assignments that are privileged by looking for the **PRIVILEGED** label. For more information, see [Privileged roles and permissions in Microsoft Entra ID](privileged-roles-permissions.md). + You can identify roles, permissions, and role assignments that are privileged by looking for the **PRIVILEGED** label. For more information, see [Privileged roles and permissions in Microsoft Entra ID](privileged-roles-permissions.md). <a name='7-use-groups-for-azure-ad-role-assignments-and-delegate-the-role-assignment'></a> |
active-directory | Amazon Business Provisioning Tutorial | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/saas-apps/amazon-business-provisioning-tutorial.md | This section guides you through the steps to configure the Microsoft Entra provi ![Screenshot of Token.](media/amazon-business-provisioning-tutorial/test-connection.png) + For **Tenant URL** and **Authorization endpoint** values please use the table below ++ |Country|Tenant URL|Authorization endpoint + |||| + |Canada|https://na.business-api.amazon.com/scim/v2/|https://www.amazon.ca/b2b/abws/oauth?state=1&redirect_uri=https://portal.azure.com/TokenAuthorize&applicationId=amzn1.sp.solution.ee27ec8c-1ee9-4c6b-9e68-26bdc37479d3| + |Germany|https://eu.business-api.amazon.com/scim/v2/|https://www.amazon.de/b2b/abws/oauth?state=1&redirect_uri=https://portal.azure.com/TokenAuthorize&applicationId=amzn1.sp.solution.ee27ec8c-1ee9-4c6b-9e68-26bdc37479d3| + |Spain|https://eu.business-api.amazon.com/scim/v2/|https://www.amazon.es/b2b/abws/oauth?state=1&redirect_uri=https://portal.azure.com/TokenAuthorize&applicationId=amzn1.sp.solution.ee27ec8c-1ee9-4c6b-9e68-26bdc37479d3| + |France|https://eu.business-api.amazon.com/scim/v2/|https://www.amazon.fr/b2b/abws/oauth?state=1&redirect_uri=https://portal.azure.com/TokenAuthorize&applicationId=amzn1.sp.solution.ee27ec8c-1ee9-4c6b-9e68-26bdc37479d3| + |GB/UK|https://eu.business-api.amazon.com/scim/v2/|https://www.amazon.co.uk/b2b/abws/oauth?state=1&redirect_uri=https://portal.azure.com/TokenAuthorize&applicationId=amzn1.sp.solution.ee27ec8c-1ee9-4c6b-9e68-26bdc37479d3| + |India|https://eu.business-api.amazon.com/scim/v2/|https://www.amazon.in/b2b/abws/oauth?state=1&redirect_uri=https://portal.azure.com/TokenAuthorize&applicationId=amzn1.sp.solution.ee27ec8c-1ee9-4c6b-9e68-26bdc37479d3| + |Italy|https://eu.business-api.amazon.com/scim/v2/|https://www.amazon.it/b2b/abws/oauth?state=1&redirect_uri=https://portal.azure.com/TokenAuthorize&applicationId=amzn1.sp.solution.ee27ec8c-1ee9-4c6b-9e68-26bdc37479d3| + |Japan|https://jp.business-api.amazon.com/scim/v2/|https://www.amazon.co.jp/b2b/abws/oauth?state=1&redirect_uri=https://portal.azure.com/TokenAuthorize&applicationId=amzn1.sp.solution.ee27ec8c-1ee9-4c6b-9e68-26bdc37479d3| + |Mexico|https://na.business-api.amazon.com/scim/v2/|https://www.amazon.com.mx/b2b/abws/oauth?state=1&redirect_uri=https://portal.azure.com/TokenAuthorize&applicationId=amzn1.sp.solution.ee27ec8c-1ee9-4c6b-9e68-26bdc37479d3| + |US|https://na.business-api.amazon.com/scim/v2/|https://www.amazon.com/b2b/abws/oauth?state=1&redirect_uri=https://portal.azure.com/TokenAuthorize&applicationId=amzn1.sp.solution.ee27ec8c-1ee9-4c6b-9e68-26bdc37479d3| + 1. In the **Notification Email** field, enter the email address of a person or group who should receive the provisioning error notifications and select the **Send an email notification when a failure occurs** check box. ![Screenshot of Notification Email.](common/provisioning-notification-email.png) This section guides you through the steps to configure the Microsoft Entra provi |active|Boolean|| |emails[type eq "work"].value|String|| |name.givenName|String||- |name.givenName|String|| - |externalId|String|| + |name.familyName|String|| |externalId|String||+ |roles|List of appRoleAssignments [appRoleAssignments]|| 1. Under the **Mappings** section, select **Synchronize Microsoft Entra groups to Amazon Business**. |
active-directory | Diffchecker Provisioning Tutorial | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/saas-apps/diffchecker-provisioning-tutorial.md | + + Title: 'Tutorial: Configure Diffchecker for automatic user provisioning with Microsoft Entra ID' +description: Learn how to automatically provision and deprovision user accounts from Microsoft Entra ID to Diffchecker. +++writer: twimmers ++ms.assetid: fe6b1d92-06e7-4933-9ef0-7aecd6b9b495 ++++ Last updated : 10/09/2023++++# Tutorial: Configure Diffchecker for automatic user provisioning ++This tutorial describes the steps you need to perform in both Diffchecker and Microsoft Entra ID to configure automatic user provisioning. When configured, Microsoft Entra ID automatically provisions and deprovisions users to [Diffchecker](https://www.diffchecker.com) using the Microsoft Entra provisioning service. For important details on what this service does, how it works, and frequently asked questions, see [Automate user provisioning and deprovisioning to SaaS applications with Microsoft Entra ID](../app-provisioning/user-provisioning.md). ++## Supported capabilities +> [!div class="checklist"] +> * Create users in Diffchecker. +> * Remove users in Diffchecker when they do not require access anymore. +> * Keep user attributes synchronized between Microsoft Entra ID and Diffchecker. +> * [Single sign-on](diffchecker-tutorial.md) to Diffchecker (recommended). ++## Prerequisites ++The scenario outlined in this tutorial assumes that you already have the following prerequisites: ++* [A Microsoft Entra tenant](../develop/quickstart-create-new-tenant.md) +* A user account in Microsoft Entra ID with [permission](../roles/permissions-reference.md) to configure provisioning (for example, Application Administrator, Cloud Application administrator, Application Owner, or Global Administrator). +* A user account in Diffchecker with Admin permissions. ++## Step 1: Plan your provisioning deployment +* Learn about [how the provisioning service works](../app-provisioning/user-provisioning.md). +* Determine who will be in [scope for provisioning](../app-provisioning/define-conditional-rules-for-provisioning-user-accounts.md). +* Determine what data to [map between Microsoft Entra ID and Diffchecker](../app-provisioning/customize-application-attributes.md). ++## Step 2: Configure Diffchecker to support provisioning with Microsoft Entra ID +Contact Diffchecker support to configure Diffchecker to support provisioning with Microsoft Entra ID. ++## Step 3: Add Diffchecker from the Microsoft Entra application gallery ++Add Diffchecker from the Microsoft Entra application gallery to start managing provisioning to Diffchecker. If you have previously setup Diffchecker for SSO, you can use the same application. However it's recommended that you create a separate app when testing out the integration initially. Learn more about adding an application from the gallery [here](../manage-apps/add-application-portal.md). ++## Step 4: Define who will be in scope for provisioning ++The Microsoft Entra provisioning service allows you to scope who will be provisioned based on assignment to the application and or based on attributes of the user. If you choose to scope who will be provisioned to your app based on assignment, you can use the following [steps](../manage-apps/assign-user-or-group-access-portal.md) to assign users to the application. If you choose to scope who will be provisioned based solely on attributes of the user, you can use a scoping filter as described [here](../app-provisioning/define-conditional-rules-for-provisioning-user-accounts.md). ++* Start small. Test with a small set of users before rolling out to everyone. When scope for provisioning is set to assigned users, you can control this by assigning one or two users to the app. When scope is set to all users, you can specify an [attribute based scoping filter](../app-provisioning/define-conditional-rules-for-provisioning-user-accounts.md). ++* If you need more roles, you can [update the application manifest](../develop/howto-add-app-roles-in-azure-ad-apps.md) to add new roles. ++## Step 5: Configure automatic user provisioning to Diffchecker ++This section guides you through the steps to configure the Microsoft Entra provisioning service to create, update, and disable users in Diffchecker based on user assignments in Microsoft Entra ID. ++<a name='to-configure-automatic-user-provisioning-for-Diffchecker-in-azure-ad'></a> ++### To configure automatic user provisioning for Diffchecker in Microsoft Entra ID: ++1. Sign in to the [Microsoft Entra admin center](https://entra.microsoft.com) as at least a [Cloud Application Administrator](../roles/permissions-reference.md#cloud-application-administrator). +1. Browse to **Identity** > **Applications** > **Enterprise applications** ++ ![Screenshot of Enterprise applications blade.](common/enterprise-applications.png) ++1. In the applications list, select **Diffchecker**. ++ ![Screenshot of the Diffchecker link in the Applications list.](common/all-applications.png) ++1. Select the **Provisioning** tab. ++ ![Screenshot of Provisioning tab.](common/provisioning.png) ++1. Set the **Provisioning Mode** to **Automatic**. ++ ![Screenshot of Provisioning tab automatic.](common/provisioning-automatic.png) ++1. Under the **Admin Credentials** section, input your Diffchecker Tenant URL and Secret Token. Click **Test Connection** to ensure Microsoft Entra ID can connect to Diffchecker. If the connection fails, ensure your Diffchecker account has Admin permissions and try again. ++ ![Screenshot of Token.](common/provisioning-testconnection-tenanturltoken.png) ++1. In the **Notification Email** field, enter the email address of a person who should receive the provisioning error notifications and select the **Send an email notification when a failure occurs** check box. ++ ![Screenshot of Notification Email.](common/provisioning-notification-email.png) ++1. Select **Save**. ++1. Under the **Mappings** section, select **Synchronize Microsoft Entra users to Diffchecker**. ++1. Review the user attributes that are synchronized from Microsoft Entra ID to Diffchecker in the **Attribute-Mapping** section. The attributes selected as **Matching** properties are used to match the user accounts in Diffchecker for update operations. If you choose to change the [matching target attribute](../app-provisioning/customize-application-attributes.md), you need to ensure that the Diffchecker API supports filtering users based on that attribute. Select the **Save** button to commit any changes. ++ |Attribute|Type|Supported for filtering|Required by Diffchecker| + ||||| + |userName|String|✓|✓ + |active|Boolean|| + |emails[type eq "work"].value|String||✓ + |name.givenName|String|| + |name.familyName|String|| ++1. To configure scoping filters, refer to the following instructions provided in the [Scoping filter tutorial](../app-provisioning/define-conditional-rules-for-provisioning-user-accounts.md). ++1. To enable the Microsoft Entra provisioning service for Diffchecker, change the **Provisioning Status** to **On** in the **Settings** section. ++ ![Screenshot of Provisioning Status Toggled On.](common/provisioning-toggle-on.png) ++1. Define the users that you would like to provision to Diffchecker by choosing the desired values in **Scope** in the **Settings** section. ++ ![Screenshot of Provisioning Scope.](common/provisioning-scope.png) ++1. When you're ready to provision, click **Save**. ++ ![Screenshot of Saving Provisioning Configuration.](common/provisioning-configuration-save.png) ++This operation starts the initial synchronization cycle of all users defined in **Scope** in the **Settings** section. The initial cycle takes longer to perform than subsequent cycles, which occur approximately every 40 minutes as long as the Microsoft Entra provisioning service is running. ++## Step 6: Monitor your deployment +Once you've configured provisioning, use the following resources to monitor your deployment: ++* Use the [provisioning logs](../reports-monitoring/concept-provisioning-logs.md) to determine which users have been provisioned successfully or unsuccessfully +* Check the [progress bar](../app-provisioning/application-provisioning-when-will-provisioning-finish-specific-user.md) to see the status of the provisioning cycle and how close it's to completion +* If the provisioning configuration seems to be in an unhealthy state, the application goes into quarantine. Learn more about quarantine states [here](../app-provisioning/application-provisioning-quarantine-status.md). ++## More resources ++* [Managing user account provisioning for Enterprise Apps](../app-provisioning/configure-automatic-user-provisioning-portal.md) +* [What is application access and single sign-on with Microsoft Entra ID?](../manage-apps/what-is-single-sign-on.md) ++## Next steps ++* [Learn how to review logs and get reports on provisioning activity](../app-provisioning/check-status-user-account-provisioning.md) |
active-directory | Gong Provisioning Tutorial | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/saas-apps/gong-provisioning-tutorial.md | The scenario outlined in this tutorial assumes that you already have the followi 1. In the **Update settings** area, define how settings can be managed for this assignment: * Select **Manual editing** to manage data capture and permission settings for users in this assignment in Gong. After you create the assignment: if you make changes to group settings in Microsoft Entra ID, they will not be pushed to Gong. However, you can edit the group settings manually in Gong.- * (Recommended) Select **Automatic updates** to give Microsoft Entra ID Governance over data capture and permission settings in Gong. + * (Recommended) Select **Automatic updates** to give Microsoft Entra ID control over data capture and permission settings in Gong. Define data capture and permission settings in Gong only when creating an assignment. Thereafter, other changes will only be applied to users in groups with this assignment when pushed from Microsoft Entra ID. 1. Click **ADD ASSIGNMENT**. 1. For org's that don't have assignments (step 3), select the permission profile to apply to for automatically provisioned users. |
active-directory | Team Today Provisioning Tutorial | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/saas-apps/team-today-provisioning-tutorial.md | + + Title: 'Tutorial: Configure Team Today for automatic user provisioning with Microsoft Entra ID' +description: Learn how to automatically provision and deprovision user accounts from Microsoft Entra ID to Team Today. +++writer: twimmers ++ms.assetid: 9d3f64dc-d18a-44e4-a13b-d5e37e2aac3a ++++ Last updated : 10/09/2023++++# Tutorial: Configure Team Today for automatic user provisioning ++This tutorial describes the steps you need to perform in both Team Today and Microsoft Entra ID to configure automatic user provisioning. When configured, Microsoft Entra ID automatically provisions and deprovisions users to [Team Today](https://team-today.com) using the Microsoft Entra provisioning service. For important details on what this service does, how it works, and frequently asked questions, see [Automate user provisioning and deprovisioning to SaaS applications with Microsoft Entra ID](../app-provisioning/user-provisioning.md). ++## Supported capabilities +> [!div class="checklist"] +> * Create users in Team Today. +> * Remove users in Team Today when they do not require access anymore. +> * Keep user attributes synchronized between Microsoft Entra ID and Team Today. +> * [Single sign-on](../manage-apps/add-application-portal-setup-oidc-sso.md) to Team Today (recommended). ++## Prerequisites ++The scenario outlined in this tutorial assumes that you already have the following prerequisites: ++* [A Microsoft Entra tenant](../develop/quickstart-create-new-tenant.md) +* A user account in Microsoft Entra ID with [permission](../roles/permissions-reference.md) to configure provisioning (for example, Application Administrator, Cloud Application administrator, Application Owner, or Global Administrator). +* A user account in Team Today with Admin permissions. ++## Step 1: Plan your provisioning deployment +* Learn about [how the provisioning service works](../app-provisioning/user-provisioning.md). +* Determine who will be in [scope for provisioning](../app-provisioning/define-conditional-rules-for-provisioning-user-accounts.md). +* Determine what data to [map between Microsoft Entra ID and Team Today](../app-provisioning/customize-application-attributes.md). ++## Step 2: Configure Team Today to support provisioning with Microsoft Entra ID +Contact Team Today support to configure Team Today to support provisioning with Microsoft Entra ID. ++## Step 3: Add Team Today from the Microsoft Entra application gallery ++Add Team Today from the Microsoft Entra application gallery to start managing provisioning to Team Today. If you have previously setup Team Today for SSO, you can use the same application. However it's recommended that you create a separate app when testing out the integration initially. Learn more about adding an application from the gallery [here](../manage-apps/add-application-portal.md). ++## Step 4: Define who will be in scope for provisioning ++The Microsoft Entra provisioning service allows you to scope who will be provisioned based on assignment to the application and or based on attributes of the user. If you choose to scope who will be provisioned to your app based on assignment, you can use the following [steps](../manage-apps/assign-user-or-group-access-portal.md) to assign users to the application. If you choose to scope who will be provisioned based solely on attributes of the user, you can use a scoping filter as described [here](../app-provisioning/define-conditional-rules-for-provisioning-user-accounts.md). ++* Start small. Test with a small set of users before rolling out to everyone. When scope for provisioning is set to assigned users, you can control this by assigning one or two users to the app. When scope is set to all users, you can specify an [attribute based scoping filter](../app-provisioning/define-conditional-rules-for-provisioning-user-accounts.md). ++* If you need more roles, you can [update the application manifest](../develop/howto-add-app-roles-in-azure-ad-apps.md) to add new roles. ++## Step 5: Configure automatic user provisioning to Team Today ++This section guides you through the steps to configure the Microsoft Entra provisioning service to create, update, and disable users in Team Today based on user assignments in Microsoft Entra ID. ++<a name='to-configure-automatic-user-provisioning-for-Team Today-in-azure-ad'></a> ++### To configure automatic user provisioning for Team Today in Microsoft Entra ID: ++1. Sign in to the [Microsoft Entra admin center](https://entra.microsoft.com) as at least a [Cloud Application Administrator](../roles/permissions-reference.md#cloud-application-administrator). +1. Browse to **Identity** > **Applications** > **Enterprise applications** ++ ![Screenshot of Enterprise applications blade.](common/enterprise-applications.png) ++1. In the applications list, select **Team Today**. ++ ![Screenshot of the Team Today link in the Applications list.](common/all-applications.png) ++1. Select the **Provisioning** tab. ++ ![Screenshot of Provisioning tab.](common/provisioning.png) ++1. Set the **Provisioning Mode** to **Automatic**. ++ ![Screenshot of Provisioning tab automatic.](common/provisioning-automatic.png) ++1. Under the **Admin Credentials** section, input your Team Today Tenant URL and Secret Token. Click **Test Connection** to ensure Microsoft Entra ID can connect to Team Today. If the connection fails, ensure your Team Today account has Admin permissions and try again. ++ ![Screenshot of Token.](common/provisioning-testconnection-tenanturltoken.png) ++1. In the **Notification Email** field, enter the email address of a person who should receive the provisioning error notifications and select the **Send an email notification when a failure occurs** check box. ++ ![Screenshot of Notification Email.](common/provisioning-notification-email.png) ++1. Select **Save**. ++1. Under the **Mappings** section, select **Synchronize Microsoft Entra users to Team Today**. ++1. Review the user attributes that are synchronized from Microsoft Entra ID to Team Today in the **Attribute-Mapping** section. The attributes selected as **Matching** properties are used to match the user accounts in Team Today for update operations. If you choose to change the [matching target attribute](../app-provisioning/customize-application-attributes.md), you need to ensure that the Team Today API supports filtering users based on that attribute. Select the **Save** button to commit any changes. ++ |Attribute|Type|Supported for filtering|Required by Team Today| + ||||| + |userName|String|✓|✓ + |externalId|String|✓|✓ + |active|Boolean||✓ + |name.givenName|String||✓ + |name.familyName|String||✓ + |urn:ietf:params:scim:schemas:extension:enterprise:2.0:User:department|String|| ++1. To configure scoping filters, refer to the following instructions provided in the [Scoping filter tutorial](../app-provisioning/define-conditional-rules-for-provisioning-user-accounts.md). ++1. To enable the Microsoft Entra provisioning service for Team Today, change the **Provisioning Status** to **On** in the **Settings** section. ++ ![Screenshot of Provisioning Status Toggled On.](common/provisioning-toggle-on.png) ++1. Define the users that you would like to provision to Team Today by choosing the desired values in **Scope** in the **Settings** section. ++ ![Screenshot of Provisioning Scope.](common/provisioning-scope.png) ++1. When you're ready to provision, click **Save**. ++ ![Screenshot of Saving Provisioning Configuration.](common/provisioning-configuration-save.png) ++This operation starts the initial synchronization cycle of all users defined in **Scope** in the **Settings** section. The initial cycle takes longer to perform than subsequent cycles, which occur approximately every 40 minutes as long as the Microsoft Entra provisioning service is running. ++## Step 6: Monitor your deployment +Once you've configured provisioning, use the following resources to monitor your deployment: ++* Use the [provisioning logs](../reports-monitoring/concept-provisioning-logs.md) to determine which users have been provisioned successfully or unsuccessfully +* Check the [progress bar](../app-provisioning/application-provisioning-when-will-provisioning-finish-specific-user.md) to see the status of the provisioning cycle and how close it's to completion +* If the provisioning configuration seems to be in an unhealthy state, the application goes into quarantine. Learn more about quarantine states [here](../app-provisioning/application-provisioning-quarantine-status.md). ++## More resources ++* [Managing user account provisioning for Enterprise Apps](../app-provisioning/configure-automatic-user-provisioning-portal.md) +* [What is application access and single sign-on with Microsoft Entra ID?](../manage-apps/what-is-single-sign-on.md) ++## Next steps ++* [Learn how to review logs and get reports on provisioning activity](../app-provisioning/check-status-user-account-provisioning.md) |
ai-services | Authentication | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/authentication.md | description: "There are three ways to authenticate a request to an Azure AI serv -+ Last updated 08/30/2023 |
ai-services | Autoscale | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/autoscale.md | Title: Use the autoscale feature description: Learn how to use the autoscale feature for Azure AI services to dynamically adjust the rate limit of your service. -+ Last updated 06/27/2022 |
ai-services | Cognitive Services And Machine Learning | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/cognitive-services-and-machine-learning.md | |
ai-services | Cognitive Services Container Support | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/cognitive-services-container-support.md | |
ai-services | Cognitive Services Custom Subdomains | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/cognitive-services-custom-subdomains.md | description: Custom subdomain names for each Azure AI services resource are crea -+ Last updated 12/04/2020 |
ai-services | Cognitive Services Data Loss Prevention | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/cognitive-services-data-loss-prevention.md | Title: Data loss prevention description: Azure AI services data loss prevention capabilities allow customers to configure the list of outbound URLs their Azure AI services resources are allowed to access. This configuration creates another level of control for customers to prevent data loss. -+ Last updated 03/31/2023 |
ai-services | Cognitive Services Development Options | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/cognitive-services-development-options.md | |
ai-services | Cognitive Services Environment Variables | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/cognitive-services-environment-variables.md | description: "This guide shows you how to set and retrieve environment variables -+ Last updated 09/09/2022 |
ai-services | Cognitive Services Limited Access | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/cognitive-services-limited-access.md | description: Azure AI services that are available with Limited Access are descri -+ Last updated 10/27/2022 |
ai-services | Cognitive Services Support Options | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/cognitive-services-support-options.md | description: How to obtain help and support for questions and problems when you -+ Last updated 06/28/2022 |
ai-services | Cognitive Services Virtual Networks | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/cognitive-services-virtual-networks.md | description: Configure layered network security for your Azure AI services resou -+ Last updated 08/10/2023 |
ai-services | Commitment Tier | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/commitment-tier.md | description: Learn how to sign up for commitment tier pricing, which is differen -+ Last updated 12/01/2022 |
ai-services | Create Account Bicep | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/create-account-bicep.md | description: Create an Azure AI service resource with Bicep. keywords: Azure AI services, cognitive solutions, cognitive intelligence, cognitive artificial intelligence -+ Last updated 01/19/2023 |
ai-services | Create Account Resource Manager Template | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/create-account-resource-manager-template.md | keywords: Azure AI services, cognitive solutions, cognitive intelligence, cognit -+ Last updated 09/01/2022 |
ai-services | Create Account Terraform | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/create-account-terraform.md | Title: 'Quickstart: Create an Azure AI services resource using Terraform' description: 'In this article, you create an Azure AI services resource using Terraform' keywords: Azure AI services, cognitive solutions, cognitive intelligence, cognitive artificial intelligence -+ Last updated 4/14/2023 |
ai-services | Diagnostic Logging | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/diagnostic-logging.md | description: This guide provides step-by-step instructions to enable diagnostic -+ Last updated 07/19/2021 |
ai-services | Disable Local Auth | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/disable-local-auth.md | description: "This article describes disabling local authentication in Azure AI -+ Last updated 09/22/2023 |
ai-services | Language Support | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/language-support.md | description: Azure AI services enable you to build applications that see, hear, -+ Last updated 07/18/2023 |
ai-services | Multi Service Resource | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/multi-service-resource.md | |
ai-services | Managed Identity | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/openai/how-to/managed-identity.md | In the following sections, you'll use the Azure CLI to assign roles, and obtain - An Azure subscription - <a href="https://azure.microsoft.com/free/cognitive-services" target="_blank">Create one for free</a> - Access granted to the Azure OpenAI Service in the desired Azure subscription-- Currently, access to this service is granted only by application. You can apply for access to Azure OpenAI by completing the form at [https://aka.ms/oai/access</a>. Open an issue on this repo to contact us if you have an issue.+- Currently, access to this service is granted only by application. You can apply for access to Azure OpenAI by completing the [Request Access to Azure OpenAI Service form](https://aka.ms/oai/access). Open an issue on this repo to contact us if you have an issue. - [Custom subdomain names are required to enable features like Azure Active Directory (Azure AD) for authentication.]( ../../cognitive-services-custom-subdomains.md) |
ai-services | Role Based Access Control | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/openai/how-to/role-based-access-control.md | recommendations: false # Role-based access control for Azure OpenAI Service -Azure OpenAI Service supports Azure role-based access control (Azure RBAC), an authorization system for managing individual access to Azure resources. Using Azure RBAC, you assign different team members different levels of permissions based on their needs for a given project. For more information, see the [Azure RBAC documentation](../../../role-based-access-control/index.yml) for more information. +Azure OpenAI Service supports Azure role-based access control (Azure RBAC), an authorization system for managing individual access to Azure resources. Using Azure RBAC, you assign different team members different levels of permissions based on their needs for a given project. For more information, see the [Azure RBAC documentation](../../../role-based-access-control/index.yml). ## Add role assignment to an Azure OpenAI resource |
ai-services | Plan Manage Costs | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/plan-manage-costs.md | description: Learn how to plan for and manage costs for Azure AI services by usi -+ Last updated 11/03/2021 |
ai-services | Policy Reference | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/policy-reference.md | description: Lists Azure Policy built-in policy definitions for Azure AI service Last updated 09/19/2023 -+ |
ai-services | Recover Purge Resources | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/recover-purge-resources.md | |
ai-services | Responsible Use Of Ai Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/responsible-use-of-ai-overview.md | description: Azure AI services provides information and guidelines on how to res -+ Last updated 1/10/2022 |
ai-services | Rotate Keys | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/rotate-keys.md | description: "Learn how to rotate API keys for better security, without interrup -+ Last updated 11/08/2022 |
ai-services | Security Controls Policy | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/security-controls-policy.md | Last updated 09/19/2023 -+ # Azure Policy Regulatory Compliance controls for Azure AI services |
ai-services | Security Features | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/security-features.md | description: Learn about the security considerations for Azure AI services usage -+ Last updated 12/02/2022 |
ai-services | Create Translator Resource | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/translator/create-translator-resource.md | Title: Create a Translator resource -description: This article shows you how to create an Azure AI Translator resource. +description: Learn how to create an Azure AI Translator resource and retrieve your API key and endpoint URL in the Azure portal. Last updated 09/06/2023 # Create a Translator resource -In this article, you learn how to create a Translator resource in the Azure portal. [Azure AI Translator](translator-overview.md) is a cloud-based machine translation service that is part of the [Azure AI services](../what-are-ai-services.md) family. Azure resources are instances of services that you create. All API requests to Azure AI services require an **endpoint** URL and a read-only **key** for authenticating access. +In this article, you learn how to create a Translator resource in the Azure portal. [Azure AI Translator](translator-overview.md) is a cloud-based machine translation service that is part of the [Azure AI services](../what-are-ai-services.md) family. Azure resources are instances of services that you create. All API requests to Azure AI services require an *endpoint* URL and a read-only *key* for authenticating access. ## Prerequisites -To get started, you need an active [**Azure account**](https://azure.microsoft.com/free/cognitive-services/). If you don't have one, you can [**create a free 12-month subscription**](https://azure.microsoft.com/free/). +To get started, you need an active [**Azure account**](https://azure.microsoft.com/free/cognitive-services/). If you don't have one, you can [**create a free 12-month subscription**](https://azure.microsoft.com/free/). ## Create your resource -The Translator service can be accessed through two different resource types: +With your Azure account, you can access the Translator service through two different resource types: * [**Single-service**](https://portal.azure.com/#create/Microsoft.CognitiveServicesTextTranslation) resource types enable access to a single service API key and endpoint. -* [**Multi-service**](https://portal.azure.com/#create/Microsoft.CognitiveServicesAllInOne) resource types enable access to multiple Azure AI services using a single API key and endpoint. +* [**Multi-service**](https://portal.azure.com/#create/Microsoft.CognitiveServicesAllInOne) resource types enable access to multiple Azure AI services by using a single API key and endpoint. ## Complete your project and instance details +After you decide which resource type you want use to access the Translator service, you can enter the details for your project and instance. + 1. **Subscription**. Select one of your available Azure subscriptions. 1. **Resource Group**. You can create a new resource group or add your resource to a pre-existing resource group that shares the same lifecycle, permissions, and policies. 1. **Resource Region**. Choose **Global** unless your business or application requires a specific region. If you're planning on using the Document Translation feature with [managed identity authorization](document-translation/how-to-guides/create-use-managed-identities.md), choose a geographic region such as **East US**. -1. **Name**. Enter the name you have chosen for your resource. The name you choose must be unique within Azure. +1. **Name**. Enter a name for your resource. The name you choose must be unique within Azure. > [!NOTE]- > If you are using a Translator feature that requires a custom domain endpoint, such as Document Translation, the value that you enter in the Name field will be the custom domain name parameter for the endpoint. + > If you're using a Translator feature that requires a custom domain endpoint, such as Document Translation, the value that you enter in the Name field will be the custom domain name parameter for the endpoint. 1. **Pricing tier**. Select a [pricing tier](https://azure.microsoft.com/pricing/details/cognitive-services/translator) that meets your needs: * Each subscription has a free tier. * The free tier has the same features and functionality as the paid plans and doesn't expire.- * Only one free tier is available per subscription. - * Document Translation is supported in paid tiers. The Language Studio only supports the S1 or D3 instance tiers. We suggest you select the Standard S1 instance tier to try Document Translation. + * Only one free tier resource is available per subscription. + * Document Translation is supported in paid tiers. The Language Studio only supports the S1 or D3 instance tiers. If you just want to try Document Translation, select the Standard S1 instance tier. -1. If you've created a multi-service resource, you need to confirm more usage details via the check boxes. +1. If you've created a multi-service resource, the links at the bottom of the **Basics** tab provides technical documentation regarding the appropriate operation of the service. 1. Select **Review + Create**. -1. Review the service terms and select **Create** to deploy your resource. +1. Review the service terms, and select **Create** to deploy your resource. 1. After your resource has successfully deployed, select **Go to resource**. All Azure AI services API requests require an endpoint URL and a read-only key f * **Authentication keys**. Your key is a unique string that is passed on every request to the Translation service. You can pass your key through a query-string parameter or by specifying it in the HTTP request header. -* **Endpoint URL**. Use the Global endpoint in your API request unless you need a specific Azure region or custom endpoint. *See* [Base URLs](reference/v3-0-reference.md#base-urls). The Global endpoint URL is `api.cognitive.microsofttranslator.com`. +* **Endpoint URL**. Use the Global endpoint in your API request unless you need a specific Azure region or custom endpoint. For more information, see [Base URLs](reference/v3-0-reference.md#base-urls). The Global endpoint URL is `api.cognitive.microsofttranslator.com`. ## Get your authentication keys and endpoint -1. After your new resource deploys, select **Go to resource** or navigate directly to your resource page. -1. In the left rail, under *Resource Management*, select **Keys and Endpoint**. -1. Copy and paste your keys and endpoint URL in a convenient location, such as *Microsoft Notepad*. +To authenitcate your connection to your Translator resource, you'll need to find its keys and endpoint. ++1. After your new resource deploys, select **Go to resource** or go to your resource page. +1. In the left navigation pane, under **Resource Management**, select **Keys and Endpoint**. +1. Copy and paste your keys and endpoint URL in a convenient location, such as Notepad. ## Create a Text Translation client Text Translation supports both [global and regional endpoints](#complete-your-pr > > Deleting a resource group also deletes all resources contained in the group. -To remove an Azure AI multi-service or Translator resource, you can **delete the resource** or **delete the resource group**. - To delete the resource: -1. Navigate to your Resource Group in the Azure portal. +1. Search and select **Resource groups** in the Azure portal, and select your resource group. 1. Select the resources to be deleted by selecting the adjacent check box. 1. Select **Delete** from the top menu near the right edge.-1. Type *yes* in the **Deleted Resources** dialog box. +1. Enter *delete* in the **Delete Resources** dialog box. 1. Select **Delete**. To delete the resource group: -1. Navigate to your Resource Group in the Azure portal. -1. Select the **Delete resource group** from the top menu bar near the left edge. +1. Go to your Resource Group in the Azure portal. +1. Select **Delete resource group** from the top menu bar. 1. Confirm the deletion request by entering the resource group name and selecting **Delete**. -## How to get started with Translator +## How to get started with Azure AI Translator REST APIs In our quickstart, you learn how to use the Translator service with REST APIs. > [!div class="nextstepaction"] > [Get Started with Translator](quickstart-text-rest-api.md) -## More resources +## Next Steps -* [Microsoft Translator code samples](https://github.com/MicrosoftTranslator). Multi-language Translator code samples are available on GitHub. +* [Microsoft Translator code samples](https://github.com/MicrosoftTranslator). Multi-language Translator code samples are available on GitHub. * [Microsoft Translator Support Forum](https://www.aka.ms/TranslatorForum) * [Get Started with Azure (3-minute video)](https://azure.microsoft.com/get-started/?b=16.24) |
ai-services | Use Key Vault | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/use-key-vault.md | Title: Develop Azure AI services applications with Key Vault description: Learn how to develop Azure AI services applications securely by using Key Vault. -+ Last updated 09/13/2022 |
ai-services | What Are Ai Services | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/what-are-ai-services.md | |
aks | Auto Upgrade Node Image | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/aks/auto-upgrade-node-image.md | -# Auto-upgrade Azure Kubernetes Service cluster node OS images +# Auto-upgrade Azure Kubernetes Service cluster node OS images + AKS now supports the node OS auto-upgrade channel, an exclusive channel dedicated to controlling node-level OS security updates. This channel can't be used for cluster-level Kubernetes version upgrades. ## How does node OS auto-upgrade work with cluster auto-upgrade? The default cadence means there's no planned maintenance window applied. |Channel|Updates Ownership|Default cadence| |||-| `Unmanaged`|OS driven security updates. AKS has no control over these updates|Nightly around 6AM UTC for Ubuntu and Mariner, Windows every month.| -| `SecurityPatch`|AKS|Weekly| -| `NodeImage`|AKS|Weekly| +| `Unmanaged`|OS driven security updates. AKS has no control over these updates.|Nightly around 6AM UTC for Ubuntu and Azure Linux. Monthly for Windows.| +| `SecurityPatch`|AKS|Weekly.| +| `NodeImage`|AKS|Weekly.| ## Prerequisites |
aks | Windows Faq | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/aks/windows-faq.md | This article outlines some of the frequently asked questions and OS concepts for AKS uses Windows Server 2019 and Windows Server 2022 as the host OS version and only supports process isolation. Container images built by using other Windows Server versions are not supported. For more information, see [Windows container version compatibility][windows-container-compat]. For Kubernetes version 1.25 and higher, Windows Server 2022 is the default operating system. Windows Server 2019 is being retired after Kubernetes version 1.32 reaches end of life (EOL) and won't be supported in future releases. For more information about this retirement, see the [AKS release notes][aks-release-notes]. -## Is Kubernetes different on Windows and Linux? --Windows Server node pool support includes some limitations that are part of the upstream Windows Server in Kubernetes project. These limitations are not specific to AKS. For more information on the upstream support from the Kubernetes project, see the [Supported functionality and limitations][upstream-limitations] section of the [Intro to Windows support in Kubernetes][intro-windows] document. --Historically, Kubernetes is Linux-focused. Many examples used in the upstream [Kubernetes.io][kubernetes] website are intended for use on Linux nodes. When you create deployments that use Windows Server containers, the following considerations at the OS level apply: --- **Identity**: Linux identifies a user by an integer user identifier (UID). A user also has an alphanumeric user name for logging on, which Linux translates to the user's UID. Similarly, Linux identifies a user group by an integer group identifier (GID) and translates a group name to its corresponding GID.- Windows Server uses a larger binary security identifier (SID) that's stored in the Windows Security Access Manager (SAM) database. This database is not shared between the host and containers, or between containers. -- **File permissions**: Windows Server uses an access control list based on SIDs, rather than a bitmask of permissions and UID+GID.-- **File paths**: The convention on Windows Server is to use \ instead of /. - In pod specs that mount volumes, specify the path correctly for Windows Server containers. For example, rather than a mount point of */mnt/volume* in a Linux container, specify a drive letter and location such as */K/Volume* to mount as the *K:* drive. - ## What kind of disks are supported for Windows? Azure Disks and Azure Files are the supported volume types, and are accessed as NTFS volumes in the Windows Server container. |
aks | Windows Vs Linux Containers | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/aks/windows-vs-linux-containers.md | + + Title: Windows container considerations in Kubernetes ++description: See the Windows container considerations in Kubernetes. + Last updated : 10/05/2023++++++# Windows container considerations in Kubernetes ++When you create deployments that use Windows Server containers on Azure Kubernetes Service (AKS), there are a few differences relative to Linux deployments you should keep in mind. For a detailed comparison of the differences between Windows and Linux in upstream Kubernetes, please see [Windows containers in Kubernetes](https://kubernetes.io/docs/concepts/windows/intro/). ++Some of the major differences include: ++- **Identity**: Windows Server uses a larger binary security identifier (SID) that's stored in the Windows Security Access Manager (SAM) database. This database isn't shared between the host and containers or between containers. +- **File permissions**: Windows Server uses an access control list based on SIDs rather than a bitmask of permissions and UID+GID. +- **File paths**: The convention on Windows Server is to use \ instead of /. In pod specs that mount volumes, specify the path correctly for Windows Server containers. For example, rather than a mount point of */mnt/volume* in a Linux container, specify a drive letter and location such as */K/Volume* to mount as the *K:* drive. ++> [!NOTE] +> For Kubernetes versions 1.25 and higher, Windows Server 2022 is the default OS. Windows Server 2019 is being retired after Kubernetes version 1.32 reaches end-of-life (EOL) and won't be supported in future releases. For more information, see the [AKS release notes][aks-release-notes]. ++This article covers important considerations to keep in mind when using Windows containers instead of Linux containers in Kubernetes. For an in-depth comparison of Windows and Linux containers, see [Comparison with Linux][comparison-with-linux]. ++## Considerations ++| Feature | Windows considerations | +|--|:--| +| [Cluster creation][cluster-configuration] | ΓÇó The first system node pool *must* be Linux.<br/> ΓÇó AKS Windows clusters have a maximum limit of 10 node pools.<br/> ΓÇó AKS Windows clusters have a maximum limit of 100 nodes in each node pool.<br/> ΓÇó The Windows Server node pool name has a limit of six characters. | +| [Privileged containers][privileged-containers] | Not supported. The equivalent is **HostProcess Containers (HPC) containers**. | +| [HPC containers][hpc-containers] | ΓÇó HostProcess containers are the Windows alternative to Linux privileged containers. For more information, see [Create a Windows HostProcess pod](https://kubernetes.io/docs/tasks/configure-pod-container/create-hostprocess-pod/). | +| [Azure Network Policy Manager (Azure)][azure-network-policy] | Azure Network Policy Manager doesn't support:<br/> ΓÇó Named ports<br/> ΓÇó SCTP protocol<br/> ΓÇó Negative match labels or namespace selectors (all labels except "debug=true")<br/> ΓÇó "except" CIDR blocks (a CIDR with exceptions)<br/> ΓÇó Windows Server 2019<br/> | +| [Node upgrade][node-upgrade] | Windows Server nodes on AKS don't automatically apply Windows updates. Instead, you perform a node pool upgrade or [node image upgrade][node-image-upgrade]. These upgrades deploy new nodes with the latest Window Server 2019 and Windows Server 2022 base node image and security patches. | +| [AKS Image Cleaner][aks-image-cleaner] | Not supported. | +| [BYOCNI][byo-cni] | Not supported. | +| [Open Service Mesh][open-service-mesh] | Not supported. | +| [GPU][gpu] | Not supported. | +| [Multi-instance GPU][multi-instance-gpu] | Not supported. | +| [Generation 2 VMs (preview)][gen-2-vms] | Supported in preview. | +| [Custom node config][custom-node-config] | ΓÇó Custom node config has two configurations:<br/> ΓÇó [kubelet][custom-kubelet-parameters]: Supported in preview.<br/> ΓÇó OS config: Not supported. | ++## Next steps ++For more information on Windows containers, see the [Windows Server containers FAQ][windows-server-containers-faq]. ++<!-- LINKS - external --> +[aks-release-notes]: https://github.com/Azure/AKS/releases +[comparison-with-linux]: https://kubernetes.io/docs/concepts/windows/intro/#compatibility-linux-similarities ++<!-- LINKS - internal --> +[cluster-configuration]: ../aks/learn/quick-windows-container-deploy-cli.md#limitations +[privileged-containers]: use-windows-hpc.md#limitations +[hpc-containers]: use-windows-hpc.md#limitations +[node-upgrade]: ./manage-node-pools.md#upgrade-a-single-node-pool +[aks-image-cleaner]: image-cleaner.md#limitations +[windows-server-containers-faq]: windows-faq.md +[azure-network-policy]: use-network-policies.md#overview-of-network-policy +[node-image-upgrade]: node-image-upgrade.md +[byo-cni]: use-byo-cni.md +[open-service-mesh]: open-service-mesh-about.md +[gpu]: gpu-cluster.md +[multi-instance-gpu]: gpu-multi-instance.md +[gen-2-vms]: cluster-configuration.md#generation-2-virtual-machines +[custom-node-config]: custom-node-configuration.md +[custom-kubelet-parameters]: custom-node-configuration.md#kubelet-custom-configuration |
api-management | Api Management Features | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/api-management-features.md | Each API Management [pricing tier](https://aka.ms/apimpricing) offers a distinct > [!IMPORTANT] > * The Developer tier is for non-production use cases and evaluations. It doesn't offer SLA.-> * The Consumption tier isn't available in the US Government cloud or the Microsoft Azure operated by 21Vianet cloud. +> * The Consumption tier isn't available in the US Government cloud or the Microsoft Azure operated by 21Vianet cloud. +> * API Management **v2 tiers** are now in preview, with updated feature availability. [Learn more](v2-service-tiers-overview.md). + | Feature | Consumption | Developer | Basic | Standard | Premium | | -- | -- | | -- | -- | - | |
api-management | Api Management Gateways Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/api-management-gateways-overview.md | Related information: * For more information about the API Management service tiers and features, see [Feature-based comparison of the Azure API Management tiers](api-management-features.md). + ## Role of the gateway The API Management *gateway* (also called *data plane* or *runtime*) is the service component that's responsible for proxying API requests, applying policies, and collecting telemetry. |
api-management | Api Management Howto Cache | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/api-management-howto-cache.md | What you'll learn: > [!NOTE] > Internal cache is not available in the **Consumption** tier of Azure API Management. You can [use an external Azure Cache for Redis](api-management-howto-cache-external.md) instead.+> +> For feature availability in the v2 tiers (preview), see the [v2 tiers overview](v2-service-tiers-overview.md). + ## Prerequisites |
api-management | Api Management Howto Developer Portal Customize | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/api-management-howto-developer-portal-customize.md | Title: Tutorial - Access and customize the developer portal - Azure API Management | Microsoft Docs -description: Follow this to tutorial to learn how to customize the API Management developer portal, an automatically generated, fully customizable website with the documentation of your APIs. +description: In this tutorial, customize the API Management developer portal, an automatically generated, fully customizable website with the documentation of your APIs. Previously updated : 11/21/2022 Last updated : 09/06/2023 You can find more details on the developer portal in the [Azure API Management d Follow the steps below to access the managed version of the portal. 1. In the [Azure portal](https://portal.azure.com), navigate to your API Management instance.-1. Select the **Developer portal** button in the top navigation bar. A new browser tab with an administrative version of the portal will open. -+1. If you created your instance in a v2 service tier that supports the developer portal, first enable the developer portal. + 1. In the left menu, under **Developer portal**, select **Portal settings**. + 1. In the **Portal settings** window, select **Enabled**. Select **Save**. + + It might take a few minutes to enable the developer portal. +1. In the left menu, under **Developer portal**, select **Portal overview**. Then select the **Developer portal** button in the top navigation bar. A new browser tab with an administrative version of the portal will open. ## Developer portal architectural concepts |
api-management | Api Management Howto Manage Protocols Ciphers | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/api-management-howto-manage-protocols-ciphers.md | By default, API Management enables TLS 1.2 for client and backend connectivity a :::image type="content" source="media/api-management-howto-manage-protocols-ciphers/api-management-protocols-ciphers.png" alt-text="Screenshot of managing protocols and ciphers in the Azure portal."::: > [!NOTE] > * If you're using the self-hosted gateway, see [self-hosted gateway security](self-hosted-gateway-overview.md#security) to manage TLS protocols and cipher suites. |
api-management | Api Management Using With Internal Vnet | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/api-management-using-with-internal-vnet.md | Title: Connect to an internal virtual network using Azure API Management -description: Learn how to set up and configure Azure API Management in a virtual network using internal mode + Title: Deploy Azure API Management instance to internal VNet +description: Learn how to deploy (inject) your Azure API instance to a virtual network in internal mode and access API backends through it. Last updated 01/03/2022 -# Connect to a virtual network in internal mode using Azure API Management -With Azure virtual networks (VNets), Azure API Management can manage internet-inaccessible APIs using several VPN technologies to make the connection. For VNet connectivity options, requirements, and considerations, see [Using a virtual network with Azure API Management](virtual-network-concepts.md). +# Deploy your Azure API Management instance to a virtual network - internal mode ++Azure API Management can be deployed (injected) inside an Azure virtual network (VNet) to access backend services within the network. For VNet connectivity options, requirements, and considerations, see [Using a virtual network with Azure API Management](virtual-network-concepts.md). This article explains how to set up VNet connectivity for your API Management instance in the *internal* mode. In this mode, you can only access the following API Management endpoints within a VNet whose access you control. * The API gateway Use API Management in internal mode to: :::image type="content" source="media/api-management-using-with-internal-vnet/api-management-vnet-internal.png" alt-text="Connect to internal VNet"::: -For configurations specific to the *external* mode, where the API Management endpoints are accessible from the public internet, and backend services are located in the network, see [Connect to a virtual network using Azure API Management](api-management-using-with-vnet.md). +For configurations specific to the *external* mode, where the API Management endpoints are accessible from the public internet, and backend services are located in the network, see [Deploy your Azure API Management instance to a virtual network - external mode](api-management-using-with-vnet.md). [!INCLUDE [updated-for-az](../../includes/updated-for-az.md)] |
api-management | Api Management Using With Vnet | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/api-management-using-with-vnet.md | Title: Connect to a virtual network using Azure API Management -description: Learn how to set up a connection to a virtual network in Azure API Management and access API backends through it. + Title: Deploy Azure API Management instance to external VNet +description: Learn how to deploy (inject) your Azure API instance to a virtual network in external mode and access API backends through it. -# Connect to a virtual network using Azure API Management +# Deploy your Azure API Management instance to a virtual network - external mode -Azure API Management can be deployed inside an Azure virtual network (VNet) to access backend services within the network. For VNet connectivity options, requirements, and considerations, see [Using a virtual network with Azure API Management](virtual-network-concepts.md). +Azure API Management can be deployed (injected) inside an Azure virtual network (VNet) to access backend services within the network. For VNet connectivity options, requirements, and considerations, see [Using a virtual network with Azure API Management](virtual-network-concepts.md). This article explains how to set up VNet connectivity for your API Management instance in the *external* mode, where the developer portal, API gateway, and other API Management endpoints are accessible from the public internet, and backend services are located in the network. :::image type="content" source="media/api-management-using-with-vnet/api-management-vnet-external.png" alt-text="Connect to external VNet"::: -For configurations specific to the *internal* mode, where the endpoints are accessible only within the VNet, see [Connect to an internal virtual network using Azure API Management](./api-management-using-with-internal-vnet.md). +For configurations specific to the *internal* mode, where the endpoints are accessible only within the VNet, see [Deploy your Azure API Management instance to a virtual network - internal mode](./api-management-using-with-internal-vnet.md). [!INCLUDE [updated-for-az](../../includes/updated-for-az.md)] |
api-management | Compute Infrastructure | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/compute-infrastructure.md | description: Learn about the compute platform used to host your API Management s Previously updated : 04/17/2023 Last updated : 09/18/2023 Most new instances created in service tiers other than the Consumption tier are ## What are the compute platforms for API Management? -The following table summarizes the compute platforms currently used for instances in the different API Management service tiers. +The following table summarizes the compute platforms currently used in the **Consumption**, **Developer**, **Basic**, **Standard**, and **Premium** tiers of API Management. | Version | Description | Architecture | Tiers | | -| -| -- | - | The `stv2` platform infrastructure supports several resiliency and security feat Migration steps depend on features enabled in your API Management instance. If the instance isn't injected in a VNet, you can use a migration API. For instances that are VNet-injected, follow manual steps. For details, see the [migration guide](migrate-stv1-to-stv2.md). +## What about the v2 pricing tiers? ++The v2 pricing tiers are a new set of tiers for API Management currently in preview. Hosted on a new, highly scalable and available Azure infrastructure that's different from the `stv1` and `stv2` compute platforms, the v2 tiers aren't affected by the retirement of the `stv1` platform. ++The v2 tiers are designed to make API Management accessible to a broader set of customers and offer flexible options for a wider variety of scenarios. For more information, see [v2 tiers overview](v2-service-tiers-overview.md). + ## Next steps * [Migrate an API Management instance to the stv2 platform](migrate-stv1-to-stv2.md). |
api-management | Integrate Vnet Outbound | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/integrate-vnet-outbound.md | + + Title: Connect API Management instance to a private network | Microsoft Docs +description: Learn how to integrate an Azure API Management instance in the Standard v2 tier with a virtual network to access backend APIs hosted within the network. ++++ Last updated : 09/20/2023+++# Integrate an Azure API Management instance with a private VNet for outbound connections (preview) ++This article guides you through the process of configuring *VNet integration* for your Azure API Management instance so that your API Management instance can make outbound requests to API backends that are isolated in the network. ++When an API Management instance is integrated with a virtual network for outbound requests, the API Management itself is not deployed in a VNet; the gateway and other endpoints remain publicly accessible. In this configuration, the API Management instance can reach both public and network-isolated backend services. ++++## Prerequisites ++- An Azure API Management instance in the [Standard v2](v2-service-tiers-overview.md) pricing tier +- A virtual network with a subnet where your API Management backend APIs are hosted + - The network must be deployed in the same region as your API Management instance +- (Optional) For testing, a sample backend API hosted within a different subnet in the virtual network. For example, see [Tutorial: Establish Azure Functions private site access](../azure-functions/functions-create-private-site-access.md). ++## Delegate the subnet ++The subnet used for integration must be delegated to the **Microsoft.Web/serverFarms** service. In the subnet settings, in **Delegate subnet to a service**, select **Microsoft.Web/serverFarms**. +++For details, see [Add or remove a subnet delegation](../virtual-network/manage-subnet-delegation.md). ++## Enable VNet integration ++This section will guide you through the process of enabling VNet integration for your Azure API Management instance. ++1. In the [Azure portal](https://portal.azure.com), navigate to your API Management instance. +1. In the left menu, under **Deployment + Infrastructure**, select **Network**. +1. On the **Outbound traffic** card, select **VNET integration**. ++ :::image type="content" source="media/integrate-vnet-outbound/integrate-vnet.png" lightbox="media/integrate-vnet-outbound/integrate-vnet.png" alt-text="Screenshot of VNet integration in the portal."::: ++1. In the **Virtual network** blade, enable the **Virtual network** checkbox. +1. Select the location of your API Management instance. +1. In **Virtual network**, select the virtual network and the delegated subnet that you want to integrate. +1. Select **Apply**, and then select **Save**. The VNet is integrated. ++ :::image type="content" source="media/integrate-vnet-outbound/vnet-settings.png" lightbox="media/integrate-vnet-outbound/vnet-settings.png" alt-text="Screenshot of VNet settings in the portal."::: ++## (Optional) Test VNet integration ++If you have an API hosted in the virtual network, you can import it to your Management instance and test the VNet integration. For basic steps, see [Import and publish an API](import-and-publish.md). +++## Related content ++* [Use a virtual network with Azure API Management](virtual-network-concepts.md) +++ |
api-management | V2 Service Tiers Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/v2-service-tiers-overview.md | + + Title: Azure API Management - v2 tiers (preview) +description: Introduction to key scenarios, capabilities, and concepts of the v2 tiers (SKUs) of the Azure API Management service. The v2 tiers are in preview. +++editor: '' + ++ Last updated : 10/02/2023+++++# New Azure API Management tiers (preview) ++We're introducing a new set of pricing tiers (SKUs) for Azure API Management: the *v2 tiers*. The new tiers are built on a new, more reliable and scalable platform and are designed to make API Management accessible to a broader set of customers and offer flexible options for a wider variety of scenarios. ++Currently in preview, the following v2 tiers are available: ++* **Basic v2** - The Basic v2 tier is designed for development and testing scenarios, and is supported with an SLA. In the Basic v2 tier, the developer portal is an optional add-on. ++* **Standard v2** - Standard v2 is a production-ready tier with support planned for advanced API Management features previously available only in a Premium tier of API Management, including high availability and networking options. ++## Key capabilities ++* **Faster deployment, configuration, and scaling** - Deploy a production-ready API Management instance in minutes. Quickly apply configurations such as certificate and hostname updates. Scale a Basic v2 or Standard v2 instance quickly to up to 10 units to meet the needs of your API management workloads. ++* **Simplified networking** - The Standard v2 tier supports [outbound connections](#networking-options) to network-isolated backends. ++* **More options for production workloads** - The v2 tiers are all supported with an SLA. Upgrade from Basic v2 to Standard v2 to add more production options. ++* **Developer portal options** - Enable the [developer portal](api-management-howto-developer-portal.md) when you're ready to let API consumers discover your APIs. The developer portal is included in the Standard v2 tier, and is an add-on in the Basic v2 tier. ++## Networking options ++In preview, the v2 tiers currently support the following options to limit network traffic from your API Management instance to protected API backends: +++* **Standard v2** ++ **Outbound** - VNet integration to allow your API Management instance to reach API backends that are isolated in a VNet. The API Management gateway, management plane, and developer portal remain publicly accessible from the internet. The VNet must be in the same region as the API Management instance. [Learn more](integrate-vnet-outbound.md). ++ +## Features and limitations ++### API version ++The v2 tiers are supported in API Management API version **2023-03-01-preview** or later. ++### Supported regions ++In preview, the v2 tiers are available in the following regions: ++* East US +* South Central US +* West US +* France Central +* North Europe +* West Europe +* UK South +* Brazil South +* Australia East +* Australia Southeast +* East Asia ++### Feature availability ++Most capabilities of the existing (v1) tiers are planned for the v2 tiers. However, the following capabilities aren't supported in the v2 tiers: ++* API Management service configuration using Git +* Back up and restore of API Management instance +* Enabling Azure DDoS Protection ++### Preview limitations ++Currently, the following API Management capabilities are unavailable in the v2 tiers preview and are planned for later release. Where indicated, certain features are planned only for the Standard v2 tier. Features may be enabled during the preview period. +++**Infrastructure and networking** +* Zone redundancy (*Standard v2*) +* Multi-region deployment (*Standard v2*) +* Multiple custom domain names (*Standard v2*) +* Capacity metric +* Autoscaling +* Built-in analytics +* Inbound connection using a private endpoint +* Upgrade to v2 tiers from v1 tiers +* Workspaces ++**Developer portal** +* Delegation of user registration and product subscription +* Reports ++**Gateway** +* Self-hosted gateway (*Standard v2*) +* Management of Websocket APIs +* Rate limit by key and quota by key policies +* Cipher configuration +* Client certificate renegotiation +* Requests to the gateway over localhost ++## Deployment ++Deploy an instance of the Basic v2 or Standard v2 tier using the Azure portal, Azure REST API, or Azure Resource Manager or Bicep template. ++## Frequently asked questions ++### Q: Can I migrate from my existing API Management instance to a new v2 tier instance? ++A: No. Currently you can't migrate an existing API Management instance (in the Consumption, Developer, Basic, Standard, or Premium tier) to a new v2 tier instance. Currently the new tiers are available for newly created service instances only. ++### Q: What's the relationship between the stv2 compute platform and the v2 tiers? ++A: They're not related. stv2 is a [compute platform](compute-infrastructure.md) version of the Developer, Basic, Standard, and Premium tier service instances. stv2 is a successor to the stv1 platform [scheduled for retirement in 2024](./breaking-changes/stv1-platform-retirement-august-2024.md). ++### Q: Will I still be able to provision Basic or Standard tier services? ++A: Yes, there are no changes to the Basic or Standard tiers. ++### Q: What is the difference between VNet integration in Standard v2 tier and VNet support in the Premium tier? ++A: A Standard v2 service instance can be integrated with a VNet to provide secure access to the backends residing there. A Standard v2 service instance integrated with a VNet will have a public IP address that can be secured separately, via Private Link, if necessary. The Premium tier supports a [fully private integration](api-management-using-with-internal-vnet.md) with a VNet (often referred to as injection into VNet) without exposing a public IP address. ++### Q: Can I deploy an instance of the Basic v2 or Standard v2 tier entirely in my VNet? ++A: No, such a deployment is only supported in the Premium tier. ++## Related content ++* Learn more about the API Management [tiers](api-management-features.md). |
api-management | Virtual Network Concepts | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/virtual-network-concepts.md | Title: Azure API Management with an Azure virtual network -description: Learn about scenarios and requirements to secure your API Management instance using an Azure virtual network. +description: Learn about scenarios and requirements to secure inbound and outbound traffic for your API Management instance using an Azure virtual network. Previously updated : 03/09/2023 Last updated : 09/14/2023 -# Use a virtual network with Azure API Management +# Use a virtual network to secure inbound and outbound traffic for Azure API Management -API Management provides several options to secure access to your API Management instance and APIs using an Azure virtual network. API Management supports the following options, which are mutually exclusive: +API Management provides several options to secure access to your API Management instance and APIs using an Azure virtual network. API Management supports the following options. Available options depend on the [service tier](api-management-features.md) of your API Management instance. -* **Integration (injection)** of the API Management instance into the virtual network, enabling the gateway to access resources in the network. - You can choose one of two integration modes: *external* or *internal*. They differ in whether inbound connectivity to the gateway and other API Management endpoints is allowed from the internet or only from within the virtual network. +* **Injection** of the API Management instance into a subnet in the virtual network, enabling the gateway to access resources in the network. -* **Enabling secure and private inbound connectivity** to the API Management gateway using a *private endpoint*. + You can choose one of two injection modes: *external* or *internal*. They differ in whether inbound connectivity to the gateway and other API Management endpoints is allowed from the internet or only from within the virtual network. +* **Enabling secure and private inbound connectivity** to the API Management gateway using a *private endpoint*. + The following table compares virtual networking options. For more information, see later sections of this article and links to detailed guidance. |Networking model |Supported tiers |Supported components |Supported traffic |Usage scenario | |||||-|-|**[Virtual network - external](#virtual-network-integration)** | Developer, Premium | Developer portal, gateway, management plane, and Git repository | Inbound and outbound traffic can be allowed to internet, peered virtual networks, Express Route, and S2S VPN connections. | External access to private and on-premises backends -|**[Virtual network - internal](#virtual-network-integration)** | Developer, Premium | Developer portal, gateway, management plane, and Git repository. | Inbound and outbound traffic can be allowed to peered virtual networks, Express Route, and S2S VPN connections. | Internal access to private and on-premises backends -|**[Inbound private endpoint](#inbound-private-endpoint)** | Developer, Basic, Standard, Premium | Gateway only (managed gateway supported, self-hosted gateway not supported). | Only inbound traffic can be allowed from internet, peered virtual networks, Express Route, and S2S VPN connections. | Secure client connection to API Management gateway | +|**[Virtual network injection - external](#virtual-network-injection)** | Developer, Premium | Developer portal, gateway, management plane, and Git repository | Inbound and outbound traffic can be allowed to internet, peered virtual networks, Express Route, and S2S VPN connections. | External access to private and on-premises backends +|**[Virtual network injection - internal](#virtual-network-injection)** | Developer, Premium | Developer portal, gateway, management plane, and Git repository. | Inbound and outbound traffic can be allowed to peered virtual networks, Express Route, and S2S VPN connections. | Internal access to private and on-premises backends +|**[Inbound private endpoint](#inbound-private-endpoint)** | Developer, Basic, Standard, Premium | Gateway only (managed gateway supported, self-hosted gateway not supported). | Only inbound traffic can be allowed from internet, peered virtual networks, Express Route, and S2S VPN connections. | Secure client connection to API Management gateway | ++ -## Virtual network integration -With Azure virtual networks (VNets), you can place ("inject") your API Management instance in a non-internet-routable network to which you control access. In a virtual network, your API Management instance can securely access other networked Azure resources and also connect to on-premises networks using various VPN technologies. To learn more about Azure VNets, start with the information in the [Azure Virtual Network Overview](../virtual-network/virtual-networks-overview.md). +## Virtual network injection +With VNet injection, deploy ("inject") your API Management instance in a subnet in a non-internet-routable network to which you control access. In the virtual network, your API Management instance can securely access other networked Azure resources and also connect to on-premises networks using various VPN technologies. To learn more about Azure VNets, start with the information in the [Azure Virtual Network Overview](../virtual-network/virtual-networks-overview.md). You can use the Azure portal, Azure CLI, Azure Resource Manager templates, or other tools for the configuration. You control inbound and outbound traffic into the subnet in which API Management is deployed by using [network security groups](../virtual-network/network-security-groups-overview.md). For detailed deployment steps and network configuration, see: -* [Connect to an external virtual network using Azure API Management](./api-management-using-with-vnet.md). -* [Connect to an internal virtual network using Azure API Management](./api-management-using-with-internal-vnet.md). +* [Deploy your API Management instance to a virtual network - external mode](./api-management-using-with-vnet.md). +* [Deploy your API Management instance to a virtual network - internal mode](./api-management-using-with-internal-vnet.md). ### Access options Using a virtual network, you can configure the developer portal, API gateway, and other API Management endpoints to be accessible either from the internet (external mode) or only within the VNet (internal mode). * **External** - The API Management endpoints are accessible from the public internet via an external load balancer. The gateway can access resources within the VNet. - :::image type="content" source="media/virtual-network-concepts/api-management-vnet-external.png" alt-text="Diagram showing a connection to external VNet." lightbox="media/virtual-network-concepts/api-management-vnet-external.png"::: + :::image type="content" source="media/virtual-network-concepts/api-management-vnet-external.png" alt-text="Diagram showing a connection to external VNet." ::: Use API Management in external mode to access backend services deployed in the virtual network. Using a virtual network, you can configure the developer portal, API gateway, an * Manage your APIs hosted in multiple geographic locations, using a single gateway endpoint. -### Network resource requirements +### Network resource requirements for injection -The following are virtual network resource requirements for API Management. Some requirements differ depending on the version (`stv2` or `stv1`) of the [compute platform](compute-infrastructure.md) hosting your API Management instance. +The following are virtual network resource requirements for API Management injection into a VNet. Some requirements differ depending on the version (`stv2` or `stv1`) of the [compute platform](compute-infrastructure.md) hosting your API Management instance. #### [stv2](#tab/stv2) One example is to deploy an API Management instance in an internal virtual netwo :::image type="content" source="media/virtual-network-concepts/api-management-application-gateway.png" alt-text="Diagram showing Application Gateway in front of API Management instance." lightbox="media/virtual-network-concepts/api-management-application-gateway.png"::: -For more information, see [Integrate API Management in an internal virtual network with Application Gateway](api-management-howto-integrate-internal-vnet-appgateway.md). +For more information, see [Deploy API Management in an internal virtual network with Application Gateway](api-management-howto-integrate-internal-vnet-appgateway.md). ## Next steps For more information, see [Integrate API Management in an internal virtual netwo Learn more about: Virtual network configuration with API Management:-* [Connect to an external virtual network using Azure API Management](./api-management-using-with-vnet.md). -* [Connect to an internal virtual network using Azure API Management](./api-management-using-with-internal-vnet.md). +* [Deploy your Azure API Management instance to a virtual network - external mode](./api-management-using-with-vnet.md). +* [Deploy your Azure API Management instance to a virtual network - internal mode](./api-management-using-with-internal-vnet.md). * [Connect privately to API Management using a private endpoint](private-endpoint.md) * [Defend your Azure API Management instance against DDoS attacks](protect-with-ddos-protection.md) |
azure-arc | Upgrade | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-arc/resource-bridge/upgrade.md | This article describes how Arc resource bridge (preview) is upgraded and the two ## Prerequisites -In order to upgrade resource bridge, its status must be online and the [credentials in the appliance VM](maintenance.md#update-credentials-in-the-appliance-vm) must be valid. +In order to upgrade Arc resource bridge, its status must be online and the [credentials in the appliance VM](maintenance.md#update-credentials-in-the-appliance-vm) must be valid. There must be sufficient space on the management machine and appliance VM to download required images (~3.5 GB). For VMware, a new template is created. az arcappliance show --resource-group [REQUIRED] --name [REQUIRED] ## Manual upgrade -Arc resource bridge can be manually upgraded from the management machine. The management machine must have the kubeconfig and appliance configuration files stored locally. Manual upgrade generally takes between 30-90 minutes, depending on network speeds. +Arc resource bridge can be manually upgraded from the management machine. You must meet all upgrade prerequisites before attempting to upgrade. The management machine must have the kubeconfig and appliance configuration files stored locally. Manual upgrade generally takes between 30-90 minutes, depending on network speeds. To manually upgrade your Arc resource bridge, make sure you have installed the latest `az arcappliance` CLI extension by running the extension upgrade command from the management machine: To manually upgrade your resource bridge, use the following command: az arcappliance upgrade <private cloud> --config-file <file path to ARBname-appliance.yaml> ``` -For example: `az arcappliance upgrade vmware --config-file c:\contosoARB01-appliance.yaml` +For example, to upgrade a resource bridge on VMware: `az arcappliance upgrade vmware --config-file c:\contosoARB01-appliance.yaml` ++For example, to upgrade a resource bridge on Azure Stack HCI, run: `az arcappliance upgrade hci --config-file c:\contosoARB01-appliance.yaml` ## Private cloud providers Partner products that use Arc resource bridge may choose to handle upgrades differently, including enabling cloud-managed upgrade by default. This article will be updated to reflect any such changes. -[Azure Arc VM management (preview) on Azure Stack HCI](/azure-stack/hci/manage/azure-arc-vm-management-overview) handles upgrades across all components as a "validated recipe" package, and upgrades are applied using the LCM tool. You must manually apply the packaged upgrade using the LCM tool. +[Azure Arc VM management (preview) on Azure Stack HCI](/azure-stack/hci/manage/azure-arc-vm-management-overview) supports upgrade of an Arc resource bridge on Azure Stack HCI, version 22H2 up until Arc resource bridge version 1.0.14 and `az arcappliance` CLI extension version 0.2.33. These upgrades can be done through manual upgrade or a support request for cloud-managed upgrade. For additional upgrades afterwards, you must transition to Azure Stack HCI, version 23H2 (preview). In version 23H2 (preview), the LCM tool manages upgrades across all components as a "validated recipe" package. For more information, visit the [Arc VM management FAQ page](/azure-stack/hci/manage/faqs-arc-enabled-vms). ## Version releases If an Arc resource bridge is unable to be upgraded to a supported version, you m - Learn about [Arc resource bridge maintenance operations](maintenance.md). - Learn about [troubleshooting Arc resource bridge](troubleshoot-resource-bridge.md).+ |
azure-arc | License Extended Security Updates | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-arc/servers/license-extended-security-updates.md | Flexibility is critical when enrolling end of support infrastructure in Extended When provisioning WS2012 ESU licenses, you need to specify: * Either virtual core or physical core license-* Standard or datacenter license -* Attest to the number of associated cores (broken down by the number of 2-core and 16-core packs). +* Standard or Datacenter license ++You'll also need to attest to the number of associated cores (broken down by the number of 2-core and 16-core packs). To assist with the license provisioning process, this article provides general guidance and sample customer scenarios for planning your deployment of WS2012 ESUs through Azure Arc. If you choose to license based on virtual cores, the licensing requires a minimu An additional scenario (scenario 1, below) is a candidate for VM/Virtual core licensing when the WS2012 VMs are running on a newer Windows Server host (that is, Windows Server 2016 or later). > [!IMPORTANT]-> In all cases, you are required to attest to their conformance with SA or SPLA. There is no exception for these requirements. Software Assurance or an equivalent Server Subscription is required for you to purchase Extended Security Updates on-premises and in hosted environments. You will be able to purchase Extended Security Updates from Enterprise Agreement (EA), Enterprise Subscription Agreement (EAS), a Server & Cloud Enrollment (SCE), and Enrollment for Education Solutions (EES). On Azure, you do not need Software Assurance to get free Extended Security Updates, but Software Assurance or Server Subscription is required to take advantage of the Azure Hybrid Benefit. -> +> Customers that choose virtual core licensing will always be charged at the Standard edition rate, even if the actual operating system used is Datacenter edition. Additionally, virtual core licensing is not available for physical servers. +> ++### SA/SPLA conformance ++In all cases, you're required to attest to conformance with SA or SPLA. There is no exception for these requirements. Software Assurance or an equivalent Server Subscription is required for you to purchase Extended Security Updates on-premises and in hosted environments. You will be able to purchase Extended Security Updates from Enterprise Agreement (EA), Enterprise Subscription Agreement (EAS), a Server & Cloud Enrollment (SCE), and Enrollment for Education Solutions (EES). On Azure, you do not need Software Assurance to get free Extended Security Updates, but Software Assurance or Server Subscription is required to take advantage of the Azure Hybrid Benefit. ## Cost savings with migration and modernization of workloads |
azure-netapp-files | Enable Continuous Availability Existing SMB | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-netapp-files/enable-continuous-availability-existing-SMB.md | You can enable the SMB Continuous Availability (CA) feature when you [create a n >[!IMPORTANT] > You should enable Continuous Availability for [Citrix App Layering](https://docs.citrix.com/en-us/citrix-app-layering/4.html), SQL Server, and [FSLogix user profile containers](../virtual-desktop/create-fslogix-profile-container.md). Using SMB Continuous Availability shares for any other workload is not supported. This feature is currently supported on Windows SQL Server. Linux SQL Server is not currently supported. > If you are using a non-administrator (domain) account to install SQL Server, ensure that the account has the required security privilege assigned. If the domain account does not have the required security privilege (`SeSecurityPrivilege`), and the privilege cannot be set at the domain level, you can grant the privilege to the account by using the **Security privilege users** field of Active Directory connections. See [Create an Active Directory connection](create-active-directory-connections.md#create-an-active-directory-connection).++>[!IMPORTANT] +> Change notifications are not supported with Continuously Available shares in Azure NetApp Files. ## Steps |
azure-resource-manager | Private Module Registry | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-resource-manager/bicep/private-module-registry.md | To see the published module in the portal: You're now ready to reference the file in the registry from a Bicep file. For examples of the syntax to use for referencing an external module, see [Bicep modules](modules.md). ++## Working with Bicep registry files ++When leveraging bicep files that are hosted in a remote registry, it's important to understand how your local machine will interact with the regsitry. When you first declare the reference to the registry, your local editor will try to communicate with the Azure Containter Registry and download a copy of the registry to your local cache. ++The local cache is found in: ++- On Windows ++ ```path + %USERPROFILE%\.bicep\br\<registry-name>.azurecr.io\<module-path\<tag> + ``` ++- On Linux ++ ```path + /home/<username>/.bicep + ``` ++- On Mac ++ ```path + ~/.bicep + ``` ++Any changes made to the remote registry will not be recognized by your local machine until a `restore` has been ran with the specified file that includes the registry reference. ++```azurecli +az bicep restore --file <bicep-file> [--force] +``` ++For more information refer to the [`restore` command.](bicep-cli.md#restore) ++ ## Next steps * To learn about modules, see [Bicep modules](modules.md). |
azure-signalr | Howto Private Endpoints | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-signalr/howto-private-endpoints.md | Clients in VNets with existing private endpoints face constraints when accessing This constraint is a result of the DNS changes made when Azure SignalR Service S2 creates a private endpoint. -### Network Security Group rules for subnets with private endpoints --Currently, you can't configure [Network Security Group](../virtual-network/network-security-groups-overview.md) (NSG) rules and user-defined routes for private endpoints. NSG rules applied to the subnet hosting the private endpoint are applied to the private endpoint. A limited workaround for this issue is to implement your access rules for private endpoints on the source subnets, though this approach may require a higher management overhead. - ## Next steps - [Configure Network Access Control](howto-network-access-control.md) |
azure-web-pubsub | Howto Secure Private Endpoints | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-web-pubsub/howto-secure-private-endpoints.md | The Azure Web PubSub service free tier instance cannot integrate with private en Clients in VNets with existing private endpoints face constraints when accessing other Azure Web PubSub service instances that have private endpoints. For instance, suppose a VNet N1 has a private endpoint for an Azure Web PubSub service instance W1. If Azure Web PubSub service W2 has a private endpoint in a VNet N2, then clients in VNet N1 must also access Azure Web PubSub service W2 using a private endpoint. If Azure Web PubSub service W2 does not have any private endpoints, then clients in VNet N1 can access Azure Web PubSub service in that account without a private endpoint. This constraint is a result of the DNS changes made when Azure Web PubSub service W2 creates a private endpoint.--### Network Security Group rules for subnets with private endpoints --Currently, you can't configure [Network Security Group](../virtual-network/network-security-groups-overview.md) (NSG) rules and user-defined routes for private endpoints. NSG rules applied to the subnet hosting the private endpoint are applied to the private endpoint. A limited workaround for this issue is to implement your access rules for private endpoints on the source subnets, though this approach may require a higher management overhead. - |
backup | Backup Support Matrix | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/backup/backup-support-matrix.md | Backup supports the compression of backup traffic, as summarized in the followin **Maximum recovery points per protected instance (machine or workload)** | 9,999 **Maximum expiry time for a recovery point** | No limit **Maximum backup frequency to DPM/MABS** | Every 15 minutes for SQL Server<br/><br/> Once an hour for other workloads-**Maximum backup frequency to vault** | **On-premises Windows machines or Azure VMs running MARS:** Three per day<br/><br/> **DPM/MABS:** Two per day<br/><br/> **Azure VM backup:** One per day +**Maximum backup frequency to vault** | **On-premises Windows machines or Azure VMs running MARS:** Three per day. A maximum of 22 TB of data change is supported between backups.<br/><br/> **DPM/MABS:** Two per day<br/><br/> **Azure VM backup:** One per day **Recovery point retention** | Daily, weekly, monthly, yearly **Maximum retention period** | Depends on backup frequency **Recovery points on DPM/MABS disk** | 64 for file servers; 448 for app servers <br/><br/>Unlimited tape recovery points for on-premises DPM |
cdn | Cdn Features | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/cdn/cdn-features.md | The following table compares the features available with each product. | [Query string caching](cdn-query-string.md) | **✓** |**✓** |**✓** |**✓** | | IPv4/IPv6 dual-stack | **✓** |**✓** |**✓** |**✓** | | [HTTP/2 support](cdn-http2.md) | **✓** |**✓** |**✓** |**✓** |+| [Routing preference unmetered](../virtual-network/ip-services/routing-preference-unmetered.md) | |**✓** |**✓** |**✓** | |||| **Security** | **Standard Microsoft** | **Standard Akamai** | **Standard Edgio** | **Premium Edgio** | | HTTPS support with CDN endpoint | **✓** |**✓** |**✓** |**✓** | |
cloud-shell | Quickstart | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/cloud-shell/quickstart.md | description: Learn how to start using Azure Cloud Shell. ms.contributor: jahelmic Previously updated : 03/06/2023 Last updated : 10/09/2023 tags: azure-resource-manager Title: Quickstart for Azure Cloud Shell # Quickstart for Azure Cloud Shell -This document details how to use Bash and PowerShell in Azure Cloud Shell from the -[Azure portal][03]. +This document details how to get started using Azure Cloud Shell. ++## Prerequisites ++Before you can use Azure Cloud Shell, you must register the **Microsoft.CloudShell** resource +provider. Access to resources is enabled through provider namespaces that must be registered in your +subscription. You only need to register the namespace once per subscription. ++To see all resource providers, and the registration status for your subscription: ++1. Sign in to the [Azure portal][03]. +1. On the Azure portal menu, search for **Subscriptions**. Select it from the available options. +1. Select the subscription you want to view. +1. On the left menu, under **Settings**, select **Resource providers**. +1. In the search box, enter `cloudshell` to search for the resource provider. +1. Select the **Microsoft.CloudShell** resource provider register from the provider list. +1. Select **Register** to change the status from **unregistered** to **Registered**. ++ :::image type="content" source="./media/quickstart/resource-provider.png" alt-text="Screenshot of selecting resource providers in the Azure portal."::: ## Start Cloud Shell Cloud Shell allows you to select either **Bash** or **PowerShell** for your comm ![Screenshot showing the shell selector.][04] -### Registering your subscription with Azure Cloud Shell --Azure Cloud Shell needs access to manage resources. Access is provided through namespaces that must -be registered to your subscription. Use the following commands to register the -**Microsoft.CloudShell** namespace in your subscription: --<!-- markdownlint-disable MD023 --> -<!-- markdownlint-disable MD024 --> -<!-- markdownlint-disable MD051 --> -#### [Azure CLI](#tab/azurecli) --```azurecli-interactive -az account set --subscription <Subscription Name or Id> -az provider register --namespace Microsoft.CloudShell -``` --#### [Azure PowerShell](#tab/powershell) --```azurepowershell-interactive -Select-AzSubscription -SubscriptionId <SubscriptionId> -Register-AzResourceProvider -ProviderNamespace Microsoft.CloudShell -``` -<!-- markdownlint-enable MD023 --> -<!-- markdownlint-enable MD024 --> -<!-- markdownlint-enable MD051 --> ----> [!NOTE] -> You only need to register the namespace once per subscription. - ### Set your subscription 1. List subscriptions you have access to. -<!-- markdownlint-disable MD023 --> -<!-- markdownlint-disable MD024 --> -<!-- markdownlint-disable MD051 --> + <!-- markdownlint-disable MD023 MD024 MD051--> #### [Azure CLI](#tab/azurecli) ```azurecli-interactive Register-AzResourceProvider -ProviderNamespace Microsoft.CloudShell ```azurepowershell-interactive Get-AzSubscription ```-<!-- markdownlint-enable MD023 --> -<!-- markdownlint-enable MD024 --> -<!-- markdownlint-enable MD051 --> -- + <!-- markdownlint-enable MD023 MD024 MD051--> 1. Set your preferred subscription: -<!-- markdownlint-disable MD023 --> -<!-- markdownlint-disable MD024 --> -<!-- markdownlint-disable MD051 --> + <!-- markdownlint-disable MD023 MD024 MD051--> #### [Azure CLI](#tab/azurecli) ```azurecli-interactive Register-AzResourceProvider -ProviderNamespace Microsoft.CloudShell ```azurepowershell-interactive Set-AzContext -Subscription <SubscriptionId> ```-<!-- markdownlint-enable MD023 --> -<!-- markdownlint-enable MD024 --> -<!-- markdownlint-enable MD051 --> -+ <!-- markdownlint-enable MD023 MD024 MD051--> > [!TIP] Register-AzResourceProvider -ProviderNamespace Microsoft.CloudShell ### Get a list of Azure commands -<!-- markdownlint-disable MD023 --> -<!-- markdownlint-disable MD024--> -<!-- markdownlint-disable MD051 --> +<!-- markdownlint-disable MD023 MD024 MD051--> #### [Azure CLI](#tab/azurecli) Run the following command to see a list of all Azure CLI commands. Run the following commands to get a list the Azure PowerShell commands that appl cd 'Azure:/My Subscription/WebApps' Get-AzCommand ```-<!-- markdownlint-enable MD023 --> -<!-- markdownlint-enable MD024 --> -<!-- markdownlint-enable MD051 --> +<!-- markdownlint-enable MD023 MD024 MD051--> |
communication-services | Manage Audio Filters | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/communication-services/how-tos/calling-sdk/manage-audio-filters.md | Title: Outgoing Audio Filters description: Use Azure Communication Services SDKs to set outgoing audio filters. -+ Last updated 07/27/2023 +zone_pivot_groups: acs-plat-ios-android-windows # Manage audio filters Learn how to manage audio processing features with the Azure Communication Servi - A user access token to enable the calling client. For more information, see [Create and manage access tokens](../../quickstarts/identity/access-tokens.md). - Optional: Complete the quickstart to [add voice calling to your application](../../quickstarts/voice-video-calling/getting-started-with-calling.md) ++ [!INCLUDE [Manage Audio Filters Windows](./includes/manage-audio-filters/manage-audio-filters-windows.md)] ## Next steps - [Learn how to manage calls](./manage-calls.md) |
cost-management-billing | Reporting Get Started | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/cost-management-billing/costs/reporting-get-started.md | Title: Get started with Cost Management + Billing reporting - Azure -description: This article helps you to get started with Cost Management + Billing to understand, report on, and analyze your invoiced Microsoft Cloud and AWS costs. + Title: Get started with Cost Management reporting - Azure +description: This article helps you to get started with Cost Management to understand, report on, and analyze your invoiced Microsoft Cloud and AWS costs. Last updated 10/18/2022-# Get started with Cost Management + Billing reporting +# Get started with Cost Management reporting -Cost Management + Billing includes several tools to help you understand, report on, and analyze your invoiced Microsoft Cloud and AWS costs. The following sections describe the major reporting components. +Cost Management includes several tools to help you understand, report on, and analyze your invoiced Microsoft Cloud and AWS costs. The following sections describe the major reporting components. ## Cost analysis The app is available for [iOS](https://itunes.apple.com/us/app/microsoft-azure/i - [Explore and analyze costs with cost analysis](quick-acm-cost-analysis.md). - [Analyze Azure costs with the Power BI App](analyze-cost-data-azure-cost-management-power-bi-template-app.md). - [Connect to Microsoft Cost Management data in Power BI Desktop](/power-bi/connect-data/desktop-connect-azure-cost-management).-- [Create and manage exported data](tutorial-export-acm-data.md).+- [Create and manage exported data](tutorial-export-acm-data.md). |
cost-management-billing | Exchange And Refund Azure Reservations | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/cost-management-billing/reservations/exchange-and-refund-azure-reservations.md | Azure has the following policies for cancellations, exchanges, and refunds. - The new reservation's lifetime commitment should equal or be greater than the returned reservation's remaining commitment. Example: for a three-year reservation that's $100 per month and exchanged after the 18th payment, the new reservation's lifetime commitment should be $1,800 or more (paid monthly or upfront). - The new reservation purchased as part of exchange has a new term starting from the time of exchange. - There's no penalty or annual limits for exchanges.+- Exchanges will be unavailable for all compute reservations - Azure Reserved Virtual Machine Instances, Azure Dedicated Host reservations, and Azure App Services reservations - purchased on or after **January 1, 2024**. Compute reservations purchased **prior to January 1, 2024** will reserve the right to **exchange one more time** after the policy change goes into effect. For more information about the exchange policy change, see [Changes to the Azure reservation exchange policy](reservation-exchange-policy-changes.md). **Refund policies** |
cost-management-billing | Reservation Trade In | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/cost-management-billing/savings-plan/reservation-trade-in.md | Although compute reservation exchanges will end on January 1, 2024, noncompute r - You must have owner access on the Reservation Order to trade in an existing reservation. You can [Add or change users who can manage a savings plan](manage-savings-plan.md#who-can-manage-a-savings-plan). - To trade-in a reservation for a savings plan, you must have Azure RBAC Owner permission on the subscription you plan to use to purchase a savings plan. - EA Admin write permission or Billing profile contributor and higher, which are Cost Management + Billing permissions, are supported only for direct Savings plan purchases. They can't be used for savings plans purchases as a part of a reservation trade-in.+- The new savings plan's lifetime commitment should equal or be greater than the returned reservation's remaining commitment. Example: for a three-year reservation that's $100 per month and exchanged after the 18th payment, the new savings plan's lifetime commitment should be $1,800 or more (paid monthly or upfront). - Microsoft isn't currently charging early termination fees for reservation trade ins. We might charge the fees made in the future. We currently don't have a date for enabling the fee. ## How to trade in an existing reservation |
data-factory | How To Create Custom Event Trigger | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/how-to-create-custom-event-trigger.md | Data Factory expects events to follow the [Event Grid event schema](../event-gri 1. Select your custom topic from the Azure subscription dropdown or manually enter the event topic scope. > [!NOTE]- > To create or modify a custom event trigger in Data Factory, you need to use an Azure account with appropriate role-based access control (Azure RBAC). No additional permission is required. The Data Factory service principle does *not* require special permission to your Event Grid. For more information about access control, see the [Role-based access control](#role-based-access-control) section. + > To create or modify a custom event trigger in Data Factory, you need to use an Azure account with appropriate role-based access control (Azure RBAC). No additional permission is required. The Data Factory service principal does *not* require special permission to your Event Grid. For more information about access control, see the [Role-based access control](#role-based-access-control) section. 1. The **Subject begins with** and **Subject ends with** properties allow you to filter for trigger events. Both properties are optional. |
defender-for-cloud | Defender For Containers Vulnerability Assessment Azure | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/defender-for-cloud/defender-for-containers-vulnerability-assessment-azure.md | Container vulnerability assessment powered by Qualys has the following capabilit | Recommendation | Description | Assessment Key |--|--|--| | [Container registry images should have vulnerability findings resolved (powered by Qualys)](https://ms.portal.azure.com/#view/Microsoft_Azure_Security_CloudNativeCompute/ContainerRegistryRecommendationDetailsBlade/assessmentKey/dbd0cb49-b563-45e7-9724-889e799fa648)| Container image vulnerability assessment scans your registry for security vulnerabilities and exposes detailed findings for each image. Resolving the vulnerabilities can greatly improve your containers' security posture and protect them from attacks. | dbd0cb49-b563-45e7-9724-889e799fa648 |- | [Running container images should have vulnerability findings resolved (powered by Qualys)](https://ms.portal.azure.com/#view/Microsoft_Azure_Security_CloudNativeCompute/KubernetesRuntimeVisibilityRecommendationDetailsBlade/assessmentKey/41503391-efa5-47ee-9282-4eff6131462c)ΓÇ»| Container image vulnerability assessment scans container images running on your Kubernetes clusters for security vulnerabilities and exposes detailed findings for each image. Resolving the vulnerabilities can greatly improve your containers' security posture and protect them from attacks. | 41503391-efa5-47ee-9282-4eff6131462c/ | + | [Running container images should have vulnerability findings resolved (powered by Qualys)](https://ms.portal.azure.com/#view/Microsoft_Azure_Security_CloudNativeCompute/KubernetesRuntimeVisibilityRecommendationDetailsBlade/assessmentKey/41503391-efa5-47ee-9282-4eff6131462c)ΓÇ»| Container image vulnerability assessment scans container images running on your Kubernetes clusters for security vulnerabilities and exposes detailed findings for each image. Resolving the vulnerabilities can greatly improve your containers' security posture and protect them from attacks. | 41503391-efa5-47ee-9282-4eff6131462c | - **Query vulnerability information via the Azure Resource Graph** - Ability to query vulnerability information via the [Azure Resource Graph](/azure/governance/resource-graph/overview#how-resource-graph-complements-azure-resource-manager). Learn how to [query recommendations via the ARG](review-security-recommendations.md#review-recommendation-data-in-azure-resource-graph-arg). |
defender-for-cloud | Permissions | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/defender-for-cloud/permissions.md | Title: User roles and permissions description: This article explains how Microsoft Defender for Cloud uses role-based access control to assign permissions to users and identify the permitted actions for each role. Previously updated : 03/06/2023 Last updated : 10/09/2023 # User roles and permissions Defender for Cloud assesses the configuration of your resources to identify secu In addition to the built-in roles, there are two roles specific to Defender for Cloud: - **Security Reader**: A user that belongs to this role has read-only access to Defender for Cloud. The user can view recommendations, alerts, a security policy, and security states, but can't make changes.-- **Security Admin**: A user that belongs to this role has the same access as the Security Reader and can also update the security policy, dismiss alerts and recommendations, and apply recommendations.+- **Security Admin**: A user that belongs to this role has the same access as the Security Reader and can also update the security policy, and dismiss alerts and recommendations. We recommend that you assign the least permissive role needed for users to complete their tasks. For example, assign the Reader role to users who only need to view information about the security health of a resource but not take action, such as applying recommendations or editing policies. |
defender-for-cloud | Upcoming Changes | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/defender-for-cloud/upcoming-changes.md | Title: Important upcoming changes description: Upcoming changes to Microsoft Defender for Cloud that you might need to be aware of and for which you might need to plan Previously updated : 09/04/2023 Last updated : 10/09/2023 # Important upcoming changes to Microsoft Defender for Cloud > [!IMPORTANT] -> The information on this page relates to pre-release products or features, which may be substantially modified before they are commercially released, if ever. Microsoft makes no commitments or warranties, express or implied, with respect to the information provided here. +> The information on this page relates to pre-release products or features, which might be substantially modified before they are commercially released, if ever. Microsoft makes no commitments or warranties, express or implied, with respect to the information provided here. [Defender for Servers](#defender-for-servers) On this page, you can learn about changes that are planned for Defender for Cloud. It describes planned modifications to the product that might affect things like your secure score or workflows. If you're looking for the latest release notes, you can find them in the [What's | Planned change | Estimated date for change | |--|--| | [Replacing the "Key Vaults should have purge protection enabled" recommendation with combined recommendation "Key Vaults should have deletion protection enabled"](#replacing-the-key-vaults-should-have-purge-protection-enabled-recommendation-with-combined-recommendation-key-vaults-should-have-deletion-protection-enabled) | June 2023|-| [Changes to the Defender for DevOps recommendations environment source and resource ID](#changes-to-the-defender-for-devops-recommendations-environment-source-and-resource-id) | August 2023 | | [Preview alerts for DNS servers to be deprecated](#preview-alerts-for-dns-servers-to-be-deprecated) | August 2023 |-| [Deprecate and replace recommendations App Service Client Certificates](#deprecate-and-replace-recommendations-app-service-client-certificates) | August 2023 | | [Classic connectors for multicloud will be retired](#classic-connectors-for-multicloud-will-be-retired) | September 2023 |-| [Replacing secret scanning recommendation results in Defender for DevOps from CredScan with GitHub Advanced Security for Azure DevOps powered secret scanning](#replacing-secret-scanning-recommendation-results-in-defender-for-devops-from-credscan-with-github-advanced-security-for-azure-devops-powered-secret-scanning) | September 2023 | | [Change to the Log Analytics daily cap](#change-to-the-log-analytics-daily-cap) | September 2023 |-| [Deprecating and replacing "Microsoft Defender for Storage plan should be enabled" recommendation](#deprecating-and-replacing-microsoft-defender-for-storage-plan-should-be-enabled-recommendation) | September 2023| -| [DevOps Resource Deduplication for Defender for DevOps](#devops-resource-deduplication-for-defender-for-devops) | September 2023 | +| [DevOps Resource Deduplication for Defender for DevOps](#devops-resource-deduplication-for-defender-for-devops) | November 2023 | +| [Changes to Attack Path's Azure Resource Graph table scheme](#changes-to-attack-paths-azure-resource-graph-table-scheme) | November 2023 | | [Defender for Cloud plan and strategy for the Log Analytics agent deprecation](#defender-for-cloud-plan-and-strategy-for-the-log-analytics-agent-deprecation) | August 2024 | -### Replacing secret scanning recommendation results in Defender for DevOps from CredScan with GitHub Advanced Security for Azure DevOps powered secret scanning +### Replacing the "Key Vaults should have purge protection enabled" recommendation with combined recommendation "Key Vaults should have deletion protection enabled" ++**Estimated date for change: June 2023** ++The `Key Vaults should have purge protection enabled` recommendation is deprecated from the (regulatory compliance dashboard/Azure security benchmark initiative) and replaced with a new combined recommendation `Key Vaults should have deletion protection enabled`. ++| Recommendation name | Description | Effect(s) | Version | +|--|--|--|--| +| [Key vaults should have deletion protection enabled](https://ms.portal.azure.com/#view/Microsoft_Azure_Policy/PolicyDetailBlade/definitionId/%2Fproviders%2FMicrosoft.Authorization%2FpolicyDefinitions%2F0b60c0b2-2dc2-4e1c-b5c9-abbed971de53)| A malicious insider in your organization can potentially delete and purge key vaults. Purge protection protects you from insider attacks by enforcing a mandatory retention period for soft deleted key vaults. No one inside your organization or Microsoft will be able to purge your key vaults during the soft delete retention period. | audit, deny, disabled | [2.0.0](https://github.com/Azure/azure-policy/blob/master/built-in-policies/policyDefinitions/Key%20Vault/KeyVault_Recoverable_Audit.json) | ++See the [full index of Azure Policy built-in policy definitions for Key Vault](../key-vault/policy-reference.md) -**Estimated date for change: September 2023** +### Preview alerts for DNS servers to be deprecated -Currently, the recommendations for secret scanning in Azure DevOps repositories by Defender for DevOps are based on the results of CredScan, which is manually run using the Microsoft Security DevOps Extension. However, this mechanism of running secret scanning is being deprecated in September 2023. Instead, you can see secret scanning results generated by GitHub Advanced Security for Azure DevOps (GHAzDO). +**Estimated date for change: August 2023** -As GHAzDO enters Public Preview, we're working towards unifying the secret scanning experience across both GitHub Advanced Security and GHAzDO. This unification enables you to receive detections across all branches, git history, and secret leak protection via push protection to your repositories. This process can all be done with a single button press, without requiring any pipeline runs. +Following quality improvement process, security alerts for DNS servers are set to be deprecated in August. For cloud resources, use [Azure DNS](defender-for-dns-introduction.md) to receive the same security value. -For more information about GHAzDO Secret Scanning, see [Set up secret scanning](/azure/devops/repos/security/configure-github-advanced-security-features#set-up-secret-scanning). +The following table lists the alerts to be deprecated: ++| AlertDisplayName | AlertType | +|--|--| +| Communication with suspicious random domain name (Preview) | DNS_RandomizedDomain +| Communication with suspicious domain identified by threat intelligence (Preview) | DNS_ThreatIntelSuspectDomain | +| Digital currency mining activity (Preview) | DNS_CurrencyMining | +| Network intrusion detection signature activation (Preview) | DNS_SuspiciousDomain | +| Attempted communication with suspicious sinkholed domain (Preview) | DNS_SinkholedDomain | +| Communication with possible phishing domain (Preview) | DNS_PhishingDomain| +| Possible data transfer via DNS tunnel (Preview) | DNS_DataObfuscation | +| Possible data exfiltration via DNS tunnel (Preview) | DNS_DataExfiltration | +| Communication with suspicious algorithmically generated domain (Preview) | DNS_DomainGenerationAlgorithm | +| Possible data download via DNS tunnel (Preview) | DNS_DataInfiltration | +| Anonymity network activity (Preview) | DNS_DarkWeb | +| Anonymity network activity using web proxy (Preview) | DNS_DarkWebProxy | ### Classic connectors for multicloud will be retired How to migrate to the native security connectors: - [Connect your AWS account to Defender for Cloud](quickstart-onboard-aws.md) - [Connect your GCP project to Defender for Cloud](quickstart-onboard-gcp.md) -### Defender for Cloud plan and strategy for the Log Analytics agent deprecation +### Change to the Log Analytics daily cap ++Azure monitor offers the capability to [set a daily cap](../azure-monitor/logs/daily-cap.md) on the data that is ingested on your Log analytics workspaces. However, Defender for Cloud security events are currently not supported in those exclusions. ++Starting on September 18, 2023 the Log Analytics Daily Cap will no longer exclude the following set of data types: ++- WindowsEvent +- SecurityAlert +- SecurityBaseline +- SecurityBaselineSummary +- SecurityDetection +- SecurityEvent +- WindowsFirewall +- MaliciousIPCommunication +- LinuxAuditLog +- SysmonEvent +- ProtectionStatus +- Update +- UpdateSummary +- CommonSecurityLog +- Syslog ++At that time, all billable data types will be capped if the daily cap is met. This change improves your ability to fully contain costs from higher-than-expected data ingestion. ++Learn more about [workspaces with Microsoft Defender for Cloud](../azure-monitor/logs/daily-cap.md#workspaces-with-microsoft-defender-for-cloud). -**Estimated date for change: August 2024** -The Azure Log Analytics agent, also known as the Microsoft Monitoring Agent (MMA) will be [retired in August 2024.](https://azure.microsoft.com/updates/were-retiring-the-log-analytics-agent-in-azure-monitor-on-31-august-2024/) As a result, features of the two Defender for Cloud plans that rely on the Log Analytics agent are impacted, and they have updated strategies: [Defender for Servers](#defender-for-servers) and [Defender for SQL Server on machines](#defender-for-sql-server-on-machines). #### Key strategy points The Azure Log Analytics agent, also known as the Microsoft Monitoring Agent (MMA - Defender for Servers MMA-based features and capabilities will be deprecated in their Log Analytics version in August 2024, and delivered over alternative infrastructures, before the MMA deprecation date. - In addition, the currently shared autoprovisioning process that provides the installation and configuration of both agents (MMA/AMA), will be adjusted accordingly. + #### Defender for Servers The following table explains how each capability will be provided after the Log Analytics agent retirement: To ensure the security of your servers and receive all the security updates from Following that, plan your migration plan according to your organization requirements:  -||Azure Monitor agent (AMA) required (for Defender for SQL or other scenarios)|FIM/EPP discovery/Baselined is required as part of Defender for Server|What should I do| -| -- | -- | -- | -- | -| |No |Yes |You can remove MMA starting April 2024, using GA version of Defender for Server capabilities according to your needs (preview versions will be available earlier)  | -| |No |No |You can remove MMA starting now | -| |Yes |No |You can start migration from MMA to AMA now | -| |Yes |Yes |You can either start migration from MMA to AMA starting April 2024 or alternatively, you can use both agents side by side starting now. | +|Azure Monitor agent (AMA) required (for Defender for SQL or other scenarios)|FIM/EPP discovery/Baselined is required as part of Defender for Server|What should I do| +| -- | -- | -- | +|No |Yes |You can remove MMA starting April 2024, using GA version of Defender for Server capabilities according to your needs (preview versions will be available earlier)  | +|No |No |You can remove MMA starting now | +|Yes |No |You can start migration from MMA to AMA now | +|Yes |Yes |You can either start migration from MMA to AMA starting April 2024 or alternatively, you can use both agents side by side starting now. | **Customers with Log analytics Agent** **(MMA) enabled**  The following section describes the planned introduction of a new and improved S | SQL-targeted AMA autoprovisioning GA release | December 2023 | GA release of a SQL-targeted AMA autoprovisioning process. Following the release, it will be defined as the default option for all new customers. | | MMA deprecation | August 2024 | The current MMA autoprovisioning process and its related policy initiative will be deprecated. It can still be used customers, but they won't be eligible for support. | -### Replacing the "Key Vaults should have purge protection enabled" recommendation with combined recommendation "Key Vaults should have deletion protection enabled" --**Estimated date for change: June 2023** --The `Key Vaults should have purge protection enabled` recommendation is deprecated from the (regulatory compliance dashboard/Azure security benchmark initiative) and replaced with a new combined recommendation `Key Vaults should have deletion protection enabled`. --| Recommendation name | Description | Effect(s) | Version | -|--|--|--|--| -| [Key vaults should have deletion protection enabled](https://ms.portal.azure.com/#view/Microsoft_Azure_Policy/PolicyDetailBlade/definitionId/%2Fproviders%2FMicrosoft.Authorization%2FpolicyDefinitions%2F0b60c0b2-2dc2-4e1c-b5c9-abbed971de53)| A malicious insider in your organization can potentially delete and purge key vaults. Purge protection protects you from insider attacks by enforcing a mandatory retention period for soft deleted key vaults. No one inside your organization or Microsoft will be able to purge your key vaults during the soft delete retention period. | audit, deny, disabled | [2.0.0](https://github.com/Azure/azure-policy/blob/master/built-in-policies/policyDefinitions/Key%20Vault/KeyVault_Recoverable_Audit.json) | --See the [full index of Azure Policy built-in policy definitions for Key Vault](../key-vault/policy-reference.md) --### Changes to the Defender for DevOps recommendations environment source and resource ID --**Estimated date for change: August 2023** --The Security DevOps recommendations will be updated to align with the overall Microsoft Defender for Cloud features and experience. Affected recommendations will point to a new recommendation source environment and have an updated resource ID. --Security DevOps recommendations impacted: --- Code repositories should have code scanning findings resolved (preview)-- Code repositories should have secret scanning findings resolved (preview)-- Code repositories should have dependency vulnerability scanning findings resolved (preview)-- Code repositories should have infrastructure as code scanning findings resolved (preview)-- GitHub repositories should have code scanning enabled (preview)-- GitHub repositories should have Dependabot scanning enabled (preview)-- GitHub repositories should have secret scanning enabled (preview)--The recommendation environment source will be updated from `Azure` to `AzureDevOps` or `GitHub`. --The format for resource IDs will be changed from: --`Microsoft.SecurityDevOps/githubConnectors/owners/repos/` --To: --`Microsoft.Security/securityConnectors/devops/azureDevOpsOrgs/projects/repos` -`Microsoft.Security/securityConnectors/devops/gitHubOwners/repos` --As a part of the migration, source code management system specific recommendations will be created for security findings: --- GitHub repositories should have code scanning findings resolved (preview)-- GitHub repositories should have secret scanning findings resolved (preview)-- GitHub repositories should have dependency vulnerability scanning findings resolved (preview)-- GitHub repositories should have infrastructure as code scanning findings resolved (preview)-- GitHub repositories should have code scanning enabled (preview)-- GitHub repositories should have Dependabot scanning enabled (preview)-- GitHub repositories should have secret scanning enabled (preview)-- Azure DevOps repositories should have code scanning findings resolved (preview)-- Azure DevOps repositories should have secret scanning findings resolved (preview)-- Azure DevOps repositories should have infrastructure as code scanning findings resolved (preview)--Customers that rely on the `resourceID` to query DevOps recommendation data will be affected. For example, Azure Resource Graph queries, workbooks queries, API calls to Microsoft Defender for Cloud. --Queries will need to be updated to include both the old and new `resourceID` to show both, for example, total over time. --Additionally, customers that have created custom queries using the DevOps workbook will need to update the assessment keys for the impacted DevOps security recommendations. The template DevOps workbook is planned to be updated to reflect the new recommendations, although during the actual migration, customers may experience some errors with the workbook. --The experience on the recommendations page will be impacted and require customers to query under "All recommendations" to view the new DevOps recommendations. For Azure DevOps, deprecated assessments may continue to show for a maximum of 14 days if new pipelines are not run. Refer to [Defender for DevOps Common questions](/azure/defender-for-cloud/faq-defender-for-devops#why-don-t-i-see-recommendations-for-findings-) for details. --### Preview alerts for DNS servers to be deprecated --**Estimated date for change: August 2023** --Following quality improvement process, security alerts for DNS servers are set to be deprecated in August. For cloud resources, use [Azure DNS](defender-for-dns-introduction.md) to receive the same security value. --The following table lists the alerts to be deprecated: --| AlertDisplayName | AlertType | -|--|--| -| Communication with suspicious random domain name (Preview) | DNS_RandomizedDomain -| Communication with suspicious domain identified by threat intelligence (Preview) | DNS_ThreatIntelSuspectDomain | -| Digital currency mining activity (Preview) | DNS_CurrencyMining | -| Network intrusion detection signature activation (Preview) | DNS_SuspiciousDomain | -| Attempted communication with suspicious sinkholed domain (Preview) | DNS_SinkholedDomain | -| Communication with possible phishing domain (Preview) | DNS_PhishingDomain| -| Possible data transfer via DNS tunnel (Preview) | DNS_DataObfuscation | -| Possible data exfiltration via DNS tunnel (Preview) | DNS_DataExfiltration | -| Communication with suspicious algorithmically generated domain (Preview) | DNS_DomainGenerationAlgorithm | -| Possible data download via DNS tunnel (Preview) | DNS_DataInfiltration | -| Anonymity network activity (Preview) | DNS_DarkWeb | -| Anonymity network activity using web proxy (Preview) | DNS_DarkWebProxy | --### Deprecate and replace recommendations App Service Client Certificates --**Estimated date for change: August 2023** --App Service policies are set to be deprecated and replaced so that they only monitor apps using HTTP 1.1 since HTTP 2.0 on App Service doesn't support client certificates. The existing policies that enforce client certificates require an additional check to determine if Http 2.0 is being used by the app. Adding this additional check requires a change to the policy "effect" from Audit to AuditIfNotExists. Policy "effect" changes require deprecation of the old version of the policy and the creation of a replacement. --Policies in this scope: --- App Service apps should have Client Certificates (Incoming client certificates) enabled-- App Service app slots should have Client Certificates (Incoming client certificates) enabled-- Function apps should have Client Certificates (Incoming client certificates) enabled-- Function app slots should have Client Certificates (Incoming client certificates) enabled--Customers who are currently using this policy will need to ensure they have the new policies with similar names enabled and assigned to their intended scope. --### Change to the Log Analytics daily cap --Azure monitor offers the capability to [set a daily cap](../azure-monitor/logs/daily-cap.md) on the data that is ingested on your Log analytics workspaces. However, Defender for Cloud security events are currently not supported in those exclusions. --Starting on September 18, 2023 the Log Analytics Daily Cap will no longer exclude the following set of data types: --- WindowsEvent-- SecurityAlert-- SecurityBaseline-- SecurityBaselineSummary-- SecurityDetection-- SecurityEvent-- WindowsFirewall-- MaliciousIPCommunication-- LinuxAuditLog-- SysmonEvent-- ProtectionStatus-- Update-- UpdateSummary-- CommonSecurityLog-- Syslog--At that time, all billable data types will be capped if the daily cap is met. This change improves your ability to fully contain costs from higher-than-expected data ingestion. --Learn more about [workspaces with Microsoft Defender for Cloud](../azure-monitor/logs/daily-cap.md#workspaces-with-microsoft-defender-for-cloud). +### DevOps Resource Deduplication for Defender for DevOps -## Deprecating and replacing "Microsoft Defender for Storage plan should be enabled" recommendation +**Estimated date for change: November 2023** -**Estimated date for change: September 2023** +To improve the Defender for DevOps user experience and enable further integration with Defender for Cloud's rich set of capabilities, Defender for DevOps will no longer support duplicate instances of a DevOps organization to be onboarded to an Azure tenant. -The recommendation `Microsoft Defender for Storage plan should be enabled` will be deprecated on public clouds and will remain available on Azure Government cloud. This recommendation will be replaced by a new recommendation: `Microsoft Defender for Storage plan should be enabled with Malware Scanning and Sensitive Data Threat Detection`. This recommendation ensures that Defender for Storage is enabled at the subscription level with malware scanning and sensitive data threat detection capabilities. +If you don't have an instance of a DevOps organization onboarded more than once to your organization, no further action is required. If you do have more than one instance of a DevOps organization onboarded to your tenant, the subscription owner will be notified and will need to delete the DevOps Connector(s) they don't want to keep by navigating to Defender for Cloud Environment Settings. -| Policy Name | Description | Policy Effect | Version | -|--|--|--|--| -| [Microsoft Defender for Storage should be enabled](https://ms.portal.azure.com/#view/Microsoft_Azure_Policy/PolicyDetailBlade/definitionId/%2fproviders%2fMicrosoft.Authorization%2fpolicyDefinitions%2f640d2586-54d2-465f-877f-9ffc1d2109f4) | Microsoft Defender for Storage detects potential threats to your storage accounts. It helps prevent the three major impacts on your data and workload: malicious file uploads, sensitive data exfiltration, and data corruption. The new Defender for Storage plan includes malware scanning and sensitive data threat detection.This plan also provides a predictable pricing structure (per storage account) for control over coverage and costs. | Audit, disabled | 1.0.0 | +Customers will have until November 14, 2023 to resolve this issue. After this date, only the most recent DevOps Connector created where an instance of the DevOps organization exists will remain onboarded to Defender for DevOps. For example, if Organization Contoso exists in both connectorA and connectorB, and connectorB was created after connectorA, then connectorA will be removed from Defender for DevOps. -Learn more about [Microsoft Defender for Storage](defender-for-storage-introduction.md). +### Changes to Attack Path's Azure Resource Graph table scheme -### DevOps Resource Deduplication for Defender for DevOps +**Estimated date for change: November 2023** -**Estimated date for change: September 2023** +The Attack Path's Azure Resource Graph (ARG) table scheme will be updated. The `attackPathType` property wil be removed and additional properties will be added. -To improve the Defender for DevOps user experience and enable further integration with Defender for Cloud's rich set of capabilities, Defender for DevOps will no longer support duplicate instances of a DevOps organization to be onboarded to an Azure tenant. +### Defender for Cloud plan and strategy for the Log Analytics agent deprecation -If you don't have an instance of a DevOps organization onboarded more than once to your organization, no further action is required. If you do have more than one instance of a DevOps organization onboarded to your tenant, the subscription owner will be notified and will need to delete the DevOps Connector(s) they don't want to keep by navigating to Defender for Cloud Environment Settings. +**Estimated date for change: August 2024** -Customers will have until September 30, 2023 to resolve this issue. After this date, only the most recent DevOps Connector created where an instance of the DevOps organization exists will remain onboarded to Defender for DevOps. For example, if Organization Contoso exists in both connectorA and connectorB, and connectorB was created after connectorA, then connectorA will be removed from Defender for DevOps. +The Azure Log Analytics agent, also known as the Microsoft Monitoring Agent (MMA) will be [retired in August 2024.](https://azure.microsoft.com/updates/were-retiring-the-log-analytics-agent-in-azure-monitor-on-31-august-2024/) As a result, features of the two Defender for Cloud plans that rely on the Log Analytics agent are impacted, and they have updated strategies: [Defender for Servers](#defender-for-servers) and [Defender for SQL Server on machines](#defender-for-sql-server-on-machines). ## Next steps |
event-grid | Event Domains | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/event-grid/event-domains.md | Title: Event Domains in Azure Event Grid description: This article describes how to use event domains to manage the flow of custom events to your various business organizations, customers, or applications. Previously updated : 11/17/2022 Last updated : 10/09/2023 # Understand event domains for managing Event Grid topics -An event domain is a management tool for large number of Event Grid topics related to the same application. You can think of it as a meta-topic that can have thousands of individual topics. It allows an event publisher to publish events to thousands of topics at the same time. Domains also give you authentication and authorization control over each topic so you can partition your tenants. This article describes how to use event domains to manage the flow of custom events to your various business organizations, customers, or applications. Use event domains to: +An event domain is a management tool for large number of Event Grid topics related to the same application. You can think of it as a meta-topic that can have thousands of individual topics. It provides one publishing endpoint for all the topics in the domain. When publishing an event, the publisher must specify the target topic in the domain to which it wants to publish. The publisher can send an array or a batch of events where events are sent to different topics in the domain. See the [Publishing events to an event domain](#publishing-to-an-event-domain) section for details. ++Domains also give you authentication and authorization control over each topic so you can partition your tenants. This article describes how to use event domains to manage the flow of custom events to your various business organizations, customers, or applications. Use event domains to: * Manage multitenant eventing architectures at scale. * Manage your authentication and authorization. * Partition your topics without managing each individually. * Avoid individually publishing to each of your topic endpoints. +> [!NOTE] +> Event domain is not intended to support broadcast scenario where an event is sent to a domain and each topic in the domain receives a copy of the event. When publishing events, the publisher must specify the target topic in the domain to which it wants to publish. If the publisher wants to publish the same event payload to multiple topics in the domain, the publisher needs to duplicate the event payload, and change the topic name, and publish them to Event Grid using the domain endpoint, either individually or as a batch. + ## Example use case [!INCLUDE [domain-example-use-case.md](./includes/domain-example-use-case.md)] Subscribing to events for a topic within an event domain is the same as [creatin ### Domain scope subscriptions -Event domains also allow for domain-scope subscriptions. An event subscription on an event domain will receive all events sent to the domain regardless of the topic the events are sent to. Domain scope subscriptions can be useful for management and auditing purposes. +Event domains also allow for domain-scope subscriptions. An event subscription on an event domain receives all events sent to the domain regardless of the topic the events are sent to. Domain scope subscriptions can be useful for management and auditing purposes. ## Publishing to an event domain -When you create an event domain, you're given a publishing endpoint similar to if you had created a topic in Event Grid. To publish events to any topic in an event domain, push the events to the domain's endpoint the [same way you would for a custom topic](./post-to-custom-topic.md). The only difference is that you must specify the topic you'd like the event to be delivered to. For example, publishing the following array of events would send event with `"id": "1111"` to topic `foo` while the event with `"id": "2222"` would be sent to topic `bar`: +When you create an event domain, you're given a publishing endpoint similar to if you had created a topic in Event Grid. To publish events to any topic in an event domain, push the events to the domain's endpoint the [same way you would for a custom topic](./post-to-custom-topic.md). The only difference is that you must specify the topic you'd like the event to be delivered to. For example, publishing the following array of events would send event with `"id": "1111"` to topic `foo` while the event with `"id": "2222"` would be sent to topic `bar`. +++# [Event Grid event schema](#tab/event-grid-event-schema) +When using the **Event Grid event schema**, specify the name of the Event Grid topic in the domain as a value for the `topic` property. In the following example, `topic` property is set to `foo` for the first event and to `bar` for the second event. ```json [{ When you create an event domain, you're given a publishing endpoint similar to i "dataVersion": "1.0" }] ```+# [Cloud event schema](#tab/cloud-event-schema) ++When using the **cloud event schema**, specify the name of the Event Grid topic in the domain as a value for the `source` property. In the following example, `source` property is set to `foo` for the first event and to `bar` for the second event. ++If you want to use a different field to specify the intended topic in the domain, specify input schema mapping when creating the domain. For example, if you're using the REST API, use the [properties.inputSchemaMapping](/rest/api/eventgrid/controlplane-preview/domains/create-or-update#jsoninputschemamapping) property when to map that field to `properties.topic`. If you're using the .NET SDK, use [`EventGridJsonInputSchemaMapping `](/dotnet/api/azure.resourcemanager.eventgrid.models.eventgridjsoninputschemamapping). Other SDKs also support the schema mapping. ++```json +[{ + "source": "foo", + "id": "1111", + "type": "maintenanceRequested", + "subject": "myapp/vehicles/diggers", + "time": "2018-10-30T21:03:07+00:00", + "data": { + "make": "Contoso", + "model": "Small Digger" + }, + "specversion": "1.0" +}, +{ + "source": "bar", + "id": "2222", + "type": "maintenanceCompleted", + "subject": "myapp/vehicles/tractors", + "time": "2018-10-30T21:04:12+00:00", + "data": { + "make": "Contoso", + "model": "Big Tractor" + }, + "specversion": "1.0" +}] +``` ++ Event domains handle publishing to topics for you. Instead of publishing events to each topic you manage individually, you can publish all of your events to the domain's endpoint. Event Grid makes sure each event is sent to the correct topic. |
event-grid | Mqtt Publish And Subscribe Portal | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/event-grid/mqtt-publish-and-subscribe-portal.md | After a successful installation of Step, you should open a command prompt in you 2. On the **Topic spaces** page, select **+ Topic space** on the toolbar. :::image type="content" source="./media/mqtt-publish-and-subscribe-portal/create-topic-space-menu.png" alt-text="Screenshot of Topic spaces page with create button selected." lightbox="./media/mqtt-publish-and-subscribe-portal/create-topic-space-menu.png":::-1. On the **Create topic space** page, enter a name for the topic space. +1. Provide a **name** for the topic space, on the **Create topic space** page. 1. Select **+ Add topic template**. :::image type="content" source="./media/mqtt-publish-and-subscribe-portal/create-topic-space-name.png" alt-text="Screenshot of Create topic space with the name."::: |
event-hubs | Event Hubs About | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/event-hubs/event-hubs-about.md | Title: What is Azure Event Hubs? - a Big Data ingestion service -description: Learn about Azure Event Hubs, a Big Data streaming service that ingests millions of events per second. + Title: Azure Event Hubs – data streaming platform with Kafka support +description: Learn about Azure Event Hubs, A real-time data streaming platform with native Apache Kafka support. Previously updated : 03/07/2023 Last updated : 10/09/2023 -# What is Azure Event Hubs? — A big data streaming platform and event ingestion service -Event Hubs is a modern big data streaming platform and event ingestion service that can seamlessly integrate with other Azure and Microsoft services, such as Stream Analytics, Power BI, and Event Grid, along with outside services like Apache Spark. The service can process millions of events per second with low latency. The data sent to an event hub (Event Hubs instance) can be transformed and stored by using any real-time analytics providers or batching or storage adapters. +# Azure Event Hubs – A real-time data streaming platform with native Apache Kafka support +Azure Event Hubs is a cloud native data streaming service that can stream millions of events per second, with low latency, from any source to any destination. Event Hubs is compatible with Apache Kafka, and it enables you to run existing Kafka workloads without any code changes. -## Why use Event Hubs? -Data is valuable only when there's an easy way to process and get timely insights from data sources. Event Hubs provides a distributed stream processing platform with low latency and seamless integration, with data and analytics services inside and outside Azure to build your complete big data pipeline. +Using Event Hubs to ingest and store streaming data, businesses can harness the power of streaming data to gain valuable insights, drive real-time analytics, and respond to events as they happen, enhancing overall efficiency and customer experience. + :::image type="content" source="./media/event-hubs-about/event-streaming-platform.png" alt-text="Diagram that shows how Azure Event Hubs fits in an event streaming platform."::: -Event Hubs represents the "front door" for an event pipeline, often called an **event ingestor** in solution architectures. An event ingestor is a component or service that sits between event publishers and event consumers to decouple the production of events from the consumption of those events. Event Hubs provides a unified streaming platform with time retention buffer, decoupling event producers from event consumers. +Azure Event Hubs is the preferred event ingestion layer of any event streaming solution that you build on top of Azure. It seamlessly integrates with data and analytics services inside and outside Azure to build your complete data streaming pipeline to serve following use cases. -The following sections describe key features of the Azure Event Hubs service: +- [Real-time analytics with Azure Stream Analytics](./process-data-azure-stream-analytics.md) to generate real-time insights from streaming data. +- Analyze and explore streaming data with Azure Data Explorer. +- Create your own cloud native applications, functions, or microservices that run on streaming data from Event Hubs. +- Stream events with schema validation using a built-in schema registry to ensure quality and compatibility of streaming data. -## Fully managed PaaS -Event Hubs is a fully managed Platform-as-a-Service (PaaS) with little configuration or management overhead, so you focus on your business solutions. [Event Hubs for Apache Kafka ecosystems](azure-event-hubs-kafka-overview.md) gives you the PaaS Kafka experience without having to manage, configure, or run your clusters. -## Event Hubs for Apache Kafka -Azure Event Hubs for Apache Kafka ecosystems enables [Apache Kafka (1.0 and later)](https://kafka.apache.org/) clients and applications to talk to Event Hubs. You don't need to set up, configure, and manage your own Kafka and Zookeeper clusters or use some Kafka-as-a-Service offering not native to Azure. For more information, see [Event Hubs for Apache Kafka ecosystems](azure-event-hubs-kafka-overview.md). +## Key capabilities? +### Apache Kafka on Azure Event Hubs +Azure Event Hubs is a multi-protocol event streaming engine that natively supports AMQP, Apache Kafka and HTTPs protocols. Since it supports Apache Kafka, you bring Kafka workloads to Azure Event Hubs without doing any code change. You don't need to set up, configure, and manage your own Kafka clusters or use some Kafka-as-a-Service offering not native to Azure. -## Schema Registry in Azure Event Hubs -Schema Registry in Event Hubs provides a centralized repository for managing schemas of events streaming applications. Azure Schema Registry comes free with every Event Hubs namespace, and it integrates seamlessly with your Kafka applications or Event Hubs SDK based applications. +Event Hubs is built from the ground up as a cloud native broker engine. Hence you can run Kafka workloads with better performance, better cost efficiency and with no operational overhead. -It ensures data compatibility and consistency across event producers and consumers, enabling seamless schema evolution, validation, and governance, and promoting efficient data exchange and interoperability. For more information, see [Schema Registry in Azure Event Hubs](schema-registry-overview.md). +### Schema Registry in Azure Event Hubs +Azure Schema Registry in Event Hubs provides a centralized repository for managing schemas of events streaming applications. Azure Schema Registry comes free with every Event Hubs namespace, and it integrates seamlessly with your Kafka applications or Event Hubs SDK based applications. -## Support for real-time and batch processing -Ingest, buffer, store, and process your stream in real time to get actionable insights. Event Hubs uses a [partitioned consumer model](event-hubs-scalability.md#partitions), enabling multiple applications to process the stream concurrently and letting you control the speed of processing. Azure Event Hubs also integrates with [Azure Functions](../azure-functions/index.yml) for a serverless architecture. -## Capture event data -Capture your data in near-real time in an [Azure Blob storage](https://azure.microsoft.com/services/storage/blobs/) or [Azure Data Lake Storage](https://azure.microsoft.com/services/data-lake-store/) for long-term retention or micro-batch processing. You can achieve this behavior on the same stream you use for deriving real-time analytics. Setting up capture of event data is fast. There are no administrative costs to run it, and it scales automatically with Event Hubs [throughput units](event-hubs-scalability.md#throughput-units) or [processing units](event-hubs-scalability.md#processing-units). Event Hubs enables you to focus on data processing rather than on data capture. For more information, see [Event Hubs Capture](event-hubs-capture-overview.md). -## Scalable -With Event Hubs, you can start with data streams in megabytes, and grow to gigabytes or terabytes. The [Autoinflate](event-hubs-auto-inflate.md) feature is one of the many options available to scale the number of throughput units or processing units to meet your usage needs. +It ensures data compatibility and consistency across event producers and consumers. Schema Registry enables seamless schema evolution, validation, and governance, and promoting efficient data exchange and interoperability. +Schema Registry seamlessly integrates with your existing Kafka applications and it supports multiple schema definitions formats including Avro and JSON Schemas. ++### Real-time event stream processing with Azure Stream Analytics +Event Hubs integrates seamlessly with Azure Stream Analytics to enable real-time stream processing. With the built-in no-code editor, you can effortlessly develop a Stream Analytics job using drag-and-drop functionality, without writing any code. + -## Rich ecosystem -With a broad ecosystem available for the industry-standard AMQP 1.0 protocol and SDKs available in various languages: [.NET](https://github.com/Azure/azure-sdk-for-net/), [Java](https://github.com/Azure/azure-sdk-for-java/), [Python](https://github.com/Azure/azure-sdk-for-python/), [JavaScript](https://github.com/Azure/azure-sdk-for-js/), you can easily start processing your streams from Event Hubs. All supported client languages provide low-level integration. The ecosystem also provides you with seamless integration with Azure services like Azure Stream Analytics and Azure Functions and thus enables you to build serverless architectures. +Alternatively, developers can use the SQL-based Stream Analytics query language to perform real-time stream processing and take advantage of a wide range of functions for analyzing streaming data. +### Exploring streaming data with Azure Data Explorer +Azure Data Explorer is a fully managed platform for big data analytics that delivers high performance and allows for the analysis of large volumes of data in near real time. By integrating Event Hubs with Azure Data Explorer, you can easily perform near real-time analytics and exploration of streaming data. -## Event Hubs premium and dedicated -Event Hubs **premium** caters to high-end streaming needs that require superior performance, better isolation with predictable latency, and minimal interference in a managed multitenant PaaS environment. On top of all the features of the standard offering, the premium tier offers several extra features such as [dynamic partition scale up](dynamically-add-partitions.md), extended retention, and [customer-managed-keys](configure-customer-managed-key.md). For more information, see [Event Hubs Premium](event-hubs-premium-overview.md). -Event Hubs **dedicated** tier offers single-tenant deployments for customers with the most demanding streaming needs. This single-tenant offering has a guaranteed 99.99% SLA and is available only on our dedicated pricing tier. An Event Hubs cluster can ingress millions of events per second with guaranteed capacity and subsecond latency. Namespaces and event hubs created within the dedicated cluster include all features of the premium offering and more. For more information, see [Event Hubs Dedicated](event-hubs-dedicated-overview.md). -For more information, see [comparison between Event Hubs tiers](event-hubs-quotas.md). +### Rich ecosystem– Azure functions, SDKs and Kafka ecosystem +Ingest, buffer, store, and process your stream in real time to get actionable insights. Event Hubs uses a partitioned consumer model, enabling multiple applications to process the stream concurrently and letting you control the speed of processing. Azure Event Hubs also integrates with Azure Functions for a serverless architecture. +With a broad ecosystem available for the industry-standard AMQP 1.0 protocol and SDKs available in various languages: .NET, Java, Python, JavaScript, you can easily start processing your streams from Event Hubs. All supported client languages provide low-level integration. -## Event Hubs on Azure Stack Hub -The Event Hubs service on Azure Stack Hub allows you to realize hybrid cloud scenarios. Streaming and event-based solutions are supported for both on-premises and Azure cloud processing. Whether your scenario is hybrid (connected), or disconnected, your solution can support processing of events/streams at large scale. Your scenario is only bound by the Event Hubs cluster size, which you can provision according to your needs. +The ecosystem also provides you with seamless integration Azure Functions, Azure Spring Apps, Kafka Connectors and other data analytics platforms and technologies such as Apache Spark and Apache Flink. -The Event Hubs editions (on Azure Stack Hub and on Azure) offer a high degree of feature parity. This parity means SDKs, samples, PowerShell, CLI, and portals offer a similar experience, with few differences. -For more information, see [Event Hubs on Azure Stack Hub overview](/azure-stack/user/event-hubs-overview). +### Flexible and cost-efficient event streaming +You can experience flexible and cost-efficient event streaming through Event Hubs' diverse selection of tiers – including Standard, Premium, and Dedicated. These options cater to data streaming needs ranging from a few MB/s to several GB/s, allowing you to choose the perfect match for your requirements. -## Key architecture components -Event Hubs contains the following key components. +### Scalable +With Event Hubs, you can start with data streams in megabytes, and grow to gigabytes or terabytes. The [Autoinflate](event-hubs-auto-inflate.md) feature is one of the many options available to scale the number of throughput units or processing units to meet your usage needs. ++### Capture streaming data for long term retention and batch analytics +Capture your data in near-real time in an Azure Blob storage or Azure Data Lake Storage for long-term retention or micro-batch processing. You can achieve this behavior on the same stream you use for deriving real-time analytics. Setting up capture of event data is fast. -| Component | Description | -| | -- | -| Event producers | Any entity that sends data to an event hub. Event publishers can publish events using HTTPS or AMQP 1.0 or Apache Kafka (1.0 and higher). | -| Partitions | Each consumer only reads a specific subset, or a partition, of the message stream. | -| Consumer groups | A view (state, position, or offset) of an entire event hub. Consumer groups enable consuming applications to each have a separate view of the event stream. They read the stream independently at their own pace and with their own offsets. | -| Event receivers | Any entity that reads event data from an event hub. All Event Hubs consumers connect via the AMQP 1.0 session. The Event Hubs service delivers events through a session as they become available. All Kafka consumers connect via the Kafka protocol 1.0 and later. | -| [Throughput units (standard tier)](event-hubs-scalability.md#throughput-units) or [processing units (premium tier)](event-hubs-scalability.md#processing-units) or [capacity units (dedicated)](event-hubs-dedicated-overview.md) | Prepurchased units of capacity that control the throughput capacity of Event Hubs. | +## How it works? +Event Hubs provides a unified event streaming platform with time retention buffer, decoupling event producers from event consumers. The producers and consumer applications can perform large scale data ingestion through multiple protocols. -The following figure shows the Event Hubs stream processing architecture: -![Event Hubs](./media/event-hubs-about/event_hubs_architecture.png) +The following figure shows the key components of Event Hubs architecture: +The key functional components of Event Hubs include: +- **Event Hub/Kafka topic**: In Event Hubs, you can organize events into event hubs or Kafka topic. It's an append only distributed log, which can comprise of one or more partitions. +- **Partitions** are used to scale an event hub. They are like lanes in a freeway. If you need more streaming throughput, you need to add more partitions. +- **Producer applications** can ingest data to an event hub using Event Hubs SDKs or any Kafka producer client. +- **Consumer applications** consume data by seeking through the event log and maintaining consumer offset. Consumers can be based on Kafka consumer clients or Event Hubs SDK as well. +- **Consumer Group** is a logical group of consumer instances that reads data from an event hub/Kafka topic. It enables multiple consumers to read the same streaming data in an event hub independently at their own pace and with their own offsets. +- **Namespace** is the management container for one or more event hubs or Kafka topics. The management tasks such as allocating streaming capacity, configuring network security, enabling Geo Disaster recovery etc. are handled at the namespace level. -> [!NOTE] -> For more information, see [Event Hubs features or components](event-hubs-features.md). ## Next steps -To get started using Event Hubs, see the **Send and receive events** tutorials: +To get started using Event Hubs, see the following quick start guides. +### Stream data using Event Hubs SDK (AMQP) +You can use any of the following samples to stream data to Event Hubs using SDKs. - [.NET Core](event-hubs-dotnet-standard-getstarted-send.md) - [Java](event-hubs-java-get-started-send.md) - [Spring](/azure/developer/java/spring-framework/configure-spring-cloud-stream-binder-java-app-azure-event-hub?toc=/azure/event-hubs/TOC.json) To get started using Event Hubs, see the **Send and receive events** tutorials: - [C](event-hubs-c-getstarted-send.md) (send only) - [Apache Storm](event-hubs-storm-getstarted-receive.md) (receive only) +### Stream data using Apache Kafka +You can use following samples to stream data from your Kafka applications to Event Hubs. +- [Using Event Hubs with Kafka applications](event-hubs-java-get-started-send.md) ++### Schema validation with Schema Registry +You can use Event Hubs Schema Registry to perform schema validation for your event streaming applications. -To learn more about Event Hubs, see the following articles: +- [Schema validation for Kafka applications](schema-registry-kafka-java-send-receive-quickstart.md) -- [Event Hubs features overview](event-hubs-features.md)-- [Frequently asked questions](event-hubs-faq.yml). |
hdinsight-aks | Cluster Storage | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/cluster-storage.md | + + Title: Introduction to cluster storage +description: Understand how Azure HDInsight on AKS integrates with Azure Storage ++ Last updated : 08/3/2023+++# Introduction to cluster storage +++Azure HDInsight on AKS can seamlessly integrate with Azure Storage, which is a general-purpose storage solution that works well with many other Azure services. +Azure Data Lake Storage Gen2 (ADLS Gen 2) is the default file system for the clusters. ++The storage account could be used as the default location for data, cluster logs, and other output that are generated during cluster operation. It could also be a default storage for the Hive catalog that depends on the cluster type. ++For more information, see [Introduction to Azure Data Lake Storage Gen2](/azure/storage/blobs/create-data-lake-storage-account). ++## Managed identities for secure file access ++Azure HDInsight on AKS uses managed identities (MSI) to secure cluster access to files in Azure Data Lake Storage Gen2. Managed identity is a feature of Azure Active Directory that provides Azure services with a set of automatically managed credentials. These credentials can be used to authenticate to any service that supports Active Directory authentication. Moreover, managed identities don't require you to store credentials in code or configuration files. ++In Azure HDInsight on AKS, once you select a managed identity and storage during cluster creation, the managed identity can seamlessly work with storage for data management, provided the **Storage Blob Data Owner** role is assigned to the user-assigned MSI. ++The following table outlines the supported storage options for Azure HDInsight on AKS (public preview): ++|Cluster Type|Supported Storage|Connection|Role on Storage| +||||| +|Trino, Apache Flink, and Apache Spark |ADLS Gen2|Cluster user-assigned managed identity (MSI) | The user-assigned MSI needs to have **Storage Blob Data Owner** role on the storage account.| ++> [!NOTE] +> To share a storage account across multiple clusters, you can just assign the corresponding cluster user-assigned MSI ΓÇ£Storage Blob Data OwnerΓÇ¥ on the shared storage account. Learn how to [assign a role](/azure/role-based-access-control/role-assignments-portal#step-2-open-the-add-role-assignment-page). ++After that, you can use the full storage `abfs://` path to access the data via your applications. ++For more information, see [Managed identities for Azure resources](/azure/active-directory/managed-identities-azure-resources/overview). +<br>Learn how to [create an ADLS Gen2 account](/azure/storage/blobs/create-data-lake-storage-account). ++## Azure HDInsight on AKS storage architecture ++The following diagram provides an abstract view of the Azure HDInsight on AKS architecture of Azure Storage. +++### Storage management ++Currently, Azure HDInsight on AKS doesn't support storage accounts with soft delete enabled, make sure you disable soft delete for your storage account. + |
hdinsight-aks | Concept Azure Monitor Integration | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/concept-azure-monitor-integration.md | + + Title: Metrics and monitoring in HDInsight on AKS +description: Learn about how HDInsight on AKS interacts with Azure Monitoring. ++ Last updated : 08/29/2023+++# Azure Monitor integration +++HDInsight on AKS offers an integration with Azure Monitor that can be used to monitor cluster pools and their clusters. ++Azure Monitor collects metrics and logs from multiple resources into an Azure Monitor Log Analytics workspace, which presents the data as structured, queryable tables that can be used to configure custom alerts. Azure Monitor logs provide an excellent overall experience for monitoring workloads and interacting with logs, especially if you have multiple clusters. ++Setting up Azure Monitor alerts is easy and beneficial. These alerts are triggered when the value of a metric or the results of a query meet certain conditions. A severity level for the alert can be added in addition to the name. The ability to specify severity is a powerful tool that can be used when creating multiple alerts. You can learn more about this topic and how to set up alerts [here](/azure/azure-monitor/alerts/alerts-log). ++The integration offers various other capabilities such as a flexible canvas for data analysis and the creation of rich visual reports using [Workbooks](/azure/azure-monitor/visualize/workbooks-overview) and as well cross cluster monitoring. ++Azure HDInsight on AKS comes with integrated monitoring experience with Azure services like Azure managed Prometheus along with Azure managed Grafana dashboards for monitoring. ++- Azure Managed Prometheus is a service that monitors your cloud environments. The monitoring is to maintain their availability, performance, and cluster metrics. It collects data generated by resources in your Azure instances and from other monitoring tools. The data is used to provide analysis across multiple sources. +- Azure Managed Grafana is a data visualization platform built on top of the Grafana software by Grafana Labs. It's built as a fully managed Azure service operated and supported by Microsoft. Grafana helps you bring together metrics, logs, and traces into a single user interface. With its extensive support for data sources and graphing capabilities, you can view and analyze your application and infrastructure telemetry data in real-time. ++HDInsight on AKS also offers an out-of-box monitoring feature that provides premade dashboards based on cluster and service health information on top of the Azure monitor integration for more flexibility and a better visualization experience. ++For more information +- [How to enable log analytics](how-to-azure-monitor-integration.md). +- [Using Azure managed Prometheus & Grafana](monitor-with-prometheus-grafana.md) |
hdinsight-aks | Concept Security | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/concept-security.md | + + Title: Security in HDInsight on AKS +description: An introduction to security with managed identity from Azure Active Directory in HDInsight on AKS. ++ Last updated : 08/29/2023+++# Overview of enterprise security in Azure HDInsight on AKS +++Azure HDInsight on AKS offers is secure by default, and there are several methods to address your enterprise security needs. Most of these solutions are activated by default. ++This article covers overall security architecture, and security solutions by dividing them into four traditional security pillars: perimeter security, authentication, authorization, and encryption. ++## Security architecture ++Enterprise readiness for any software requires stringent security checks to prevent and address threats that may arise. HDInsight on AKS provides a multi-layered security model to protect you on multiple layers. The security architecture uses modern authorization methods using MSI. All the storage access is through MSI, and the database access is through username/password. The password is stored in Azure [Key Vault](../key-vault/general/basic-concepts.md), defined by the customer. This makes the setup robust and secure by default. ++The below diagram illustrates a high-level technical architecture of security in HDInsight on AKS. +++## Enterprise security pillars ++One way of looking at enterprise security is to divide security solutions into four main groups based on the type of control. These groups are also called security pillars and are of the following types: perimeter security, authentication, authorization, and encryption. ++### Perimeter security ++Perimeter security in HDInsight on AKS is achieved through [virtual networks.](../hdinsight/hdinsight-plan-virtual-network-deployment.md) An enterprise admin can create a cluster inside a virtual network (VNET) and use [network security groups (NSG)](./secure-traffic-by-nsg.md) to restrict access to the virtual network. ++### Authentication ++HDInsight on AKS provides Azure Active Directory-based authentication for cluster login and uses managed identities (MSI) to secure cluster access to files in Azure Data Lake Storage Gen2. Managed identity is a feature of Azure Active Directory that provides Azure services with a set of automatically managed credentials. With this setup, enterprise employees can sign into the cluster nodes by using their domain credentials. +A managed identity from Azure Active Directory (Azure AD) allows your app to easily access other Azure AD-protected resources such as Azure Key Vault, Storage, SQL Server, and Database. The identity managed by the Azure platform and doesn't require you to provision or rotate any secrets. +This solution is a key for securing access to your HDInsight on AKS cluster and other dependent resources. Managed identities make your app more secure by eliminating secrets from your app, such as credentials in the connection strings. ++You create a user-assigned managed identity, which is a standalone Azure resource, as part of the cluster creation process, which manages the access to your dependent resources. ++### Authorization ++A best practice most enterprises follow is making sure that not every employee has full access to all enterprise resources. Likewise, the admin can define role-based access control policies for the cluster resources. ++The resource owners can configure role-based access control (RBAC). Configuring RBAC policies allows you to associate permissions with a role in the organization. This layer of abstraction makes it easier to ensure people have only the permissions needed to perform their work responsibilities. +Authorization managed by ARM roles for cluster management (control plane) and cluster data access (data plane) managed by [cluster access management](./hdinsight-on-aks-manage-authorization-profile.md). +#### Cluster management roles (Control Plane / ARM Roles) ++|Action |HDInsight on AKS Cluster Pool Admin | HDInsight on AKS Cluster Admin| +|-|-|-| +|Create / Delete cluster pool |✅ | | +|Assign permission and roles on the cluster pool |✅| | +|Create/delete cluster |✅| ✅ | +| **Manage Cluster**| | ✅ | +| Configuration Management | |✅| +| Script actions | |✅| +| Library Management | |✅| +| Monitoring | |✅| +| Scaling actions | |✅| ++The above roles are from the ARM operations perspective. For more information, see [Grant a user access to Azure resources using the Azure portal - Azure RBAC](../role-based-access-control/quickstart-assign-role-user-portal.md). ++#### Cluster access (Data Plane) ++You can allow users, service principals, managed identity to access the cluster through portal or using ARM. ++This access enables you to ++* View clusters and manage jobs. +* Perform all the monitoring and management operations. +* Perform auto scale operations and update the node count. + +The access won't be provided for +* Cluster deletion +++> [!Important] +> Any newly added user will require additional role of “Azure Kubernetes Service RBAC Reader” for viewing the [service health](./service-health.md). ++## Auditing ++Auditing cluster resource access is necessary to track unauthorized or unintentional access of the resources. It's as important as protecting the cluster resources from unauthorized access. ++The resource group admin can view and report all access to the HDInsight on AKS cluster resources and data using activity log. The admin can view and report changes to the access control policies. ++## Encryption ++Protecting data is important for meeting organizational security and compliance requirements. Along with restricting access to data from unauthorized employees, you should encrypt it. The storage and the disks (OS disk and persistent data disk) used by the cluster nodes and containers are encrypted. Data in Azure Storage is encrypted and decrypted transparently using 256-bit AES encryption, one of the strongest block ciphers available, and is FIPS 140-2 compliant. Azure Storage encryption is enabled for all storage accounts, which makes data secure by default, you don't need to modify your code or applications to take advantage of Azure Storage encryption. Encryption of data in transit is handled with TLS 1.2. ++## Compliance ++Azure compliance offerings are based on various types of assurances, including formal certifications. Also, attestations, validations, and authorizations. Assessments produced by independent third-party auditing firms. +Contractual amendments, self-assessments, and customer guidance documents produced by Microsoft. For HDInsight on AKS compliance information, see the Microsoft [Trust Center](https://www.microsoft.com/trust-center?rtc=1) and the [Overview of Microsoft Azure compliance](/samples/browse/). ++## Shared responsibility model ++The following image summarizes the major system security areas and the security solutions that are available to you. It also highlights which security areas are your responsibilities as a customer and areas that are responsibility of HDInsight on AKS as the service provider. +++The following table provides links to resources for each type of security solution. ++|Security area |Solutions available |Responsible party| +|-|-|-| +|Data Access Security |[Configure access control lists ACLs](../storage/blobs/data-lake-storage-access-control.md) for Azure Data Lake Storage Gen2 |Customer| +| |Enable the [Secure transfer required](../storage/common/storage-require-secure-transfer.md) property on storage|Customer| +| |Configure [Azure Storage firewalls](../storage/common/storage-network-security.md) and virtual networks|Customer| +|Operating system security|Create clusters with most recent HDInsight on AKS versions|Customer| +|Network security| Configure a [virtual network](../hdinsight/hdinsight-plan-virtual-network-deployment.md)|| +| | Configure [Traffic using Firewall rules](./secure-traffic-by-firewall.md)|Customer| +| | Configure [Outbound traffic required](./required-outbound-traffic.md) |Customer| |
hdinsight-aks | Create Cluster Error Dictionary | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/create-cluster-error-dictionary.md | + + Title: Create a cluster - error dictionary in Azure HDInsight on AKS +description: Learn how to troubleshoot errors that occur when creating Azure HDInsight on AKS clusters ++ Last updated : 08/31/2023+++# Cluster creation errors on Azure HDInsight on AKS ++This article describes how to troubleshoot and resolve errors that could occur when you create Azure HDInsight on AKS clusters. ++|Sr. No|Error message|Cause|Resolution| +|-|--|--|--| +|1|InternalServerError UnrecognizableError|This error could indicate an incorrect template used. Currently, database connectors are allowed only through ARM template. Hence the validation of configuration isn't possible on the template.| | +|2|InvalidClusterSpec - ServiceDependencyFailure - Invalid configuration|Max memory per node error.|Refer to the maximum memory configurations [property value types](https://trino.io/docs/current/admin/properties-resource-management.html#query-max-memory-per-node).| +|3|WaitingClusterResourcesReadyTimeOut - Metastoreservice unready|This error could be due to the container name may only contain lowercase letters, numbers, and hyphens. Container name must begin with a letter or a number.|Each hyphen must be preceded by and follow by a nonhyphen character. The name must also be between 3 and 63 characters long.| +|4|InvalidClusterSpec -Invalid configuration - ClusterUpsertActivity|Error: Invalid configuration property `hive.metastore.uri: may not be null`.|[Refer to the Hive connector documentation](https://trino.io/docs/current/connector/hive.html#connector-hive--page-root).| +|5|`InternalServerError - An exception has been raised that is likely due to a transient failure. Consider enabling transient error resiliency by adding 'EnableRetryOnFailure()' to the 'UseSqlServer' call`.||Retry the operation or open a support ticket to Azure HDInsight team.| +|6|`InternalServerError - ObjectDisposedException` occurs in RP code.||Retry the operation or open a support ticket to Azure HDInsight team.| +|7|`PreconditionFailed `- Operation failure due to quota limits on user subscription.|There's quota validation before cluster creation. But when several clusters are created under the same subscription at the same time, it's possible that the first cluster occupies the quota and the other fails because of quota shortage.|Confirm there's enough quota and retry cluster/cluster pool creation.| +|8|`ReconcileApplicationSecurityGroupError` - Internal AKS error||Retry the operation or open a support ticket to Azure HDInsight team.| +|9|`ResourceGroupBeingDeleted`|During HDI on AKS resource creation or update, user is also deleting some resources in related resource groups.|Don't delete resources in HDI related resource groups when HDI on AKS resources are being created or updated.| +|10|`UpsertNodePoolTimeOut - Async operation dependentArmResourceTask has timed out`.|AKS issue ΓÇô could be due to high traffic in a particular region at the time of the operation.|Retry the operation after some time. If possible, use another region.| +|11|`Authorization_IdentityNotFound - {"code":null,"message":"The identity of the calling application could not be established."}`|The 1-p service principle isn't on boarded to the tenant.|Execute the command to provision the 1-p service principle on the new tenant to onboard.| +|12|`NotFound - ARM/AKS sdk error`|The user tries to update HDI on AKS cluster but the corresponding agent pool has been deleted.|The corresponding agent pool has been deleted. It's not recommended to operate AKS agent pool directly.| +|13|`AuthorizationFailed - Scope invalid role assignment issue with managed RG and cluster msi`|Lack of permission to perform the operation.|Check if the service principle app ID mentioned in the error message owned by you. If yes, grant the permission according to the error message. If no, open a support ticket to Azure HDInsight team.| +|14|`DeleteAksClusterFailed - {"code":"DeleteAksClusterFailed","message":"An Azure service request has failed. ErrorCode: 'DeleteAksClusterFailed', ErrorMessage: 'Delete HDI cluster namespcae failed. Additional info: 'Can't access a disposed object.\\r\\nObject name: 'Microsoft.Azure.Common.Configuration.ManagedConfiguration was already disposed'.''."}`|RP switched to a new role instance unexpectedly.|retry the operation or open a support ticket to Azure HDInsight team.| +|15|`EntityStoreOperationError - ARM/AKS sdk error`|A database operation failed on AKS side during cluster update.|Retry the operation after some time. If the issue persists, open a support ticket to Azure HDInsight team.| +|16|`InternalServerError - {"exception":"System.Threading.Tasks.TaskCanceledException","message":"The operation was canceled."}`|This error caused due to various issues.|retry the operation or open a support ticket to Azure HDInsight team.| +|17|`InternalServerError - {"exception":"System.IO.IOException","message":"Unable to read data from the transport connection: A connection attempt failed because the connected party didn't properly respond after a period of time, or established connection failed because connected host has failed to respond."}`|This error caused due to various issues.|retry the operation after some time. If the issue persists, open a support ticket to Azure HDInsight team.| +|18|`InternalServerError - Null reference exception occurs in RP code`.|This error caused due to various issues.|Retry the operation or open a support ticket to Azure HDInsight team.| +|19|`InternalServerError - {"code":"InternalServerError","message":"An internal error has occurred, exception: 'InvalidOperationException, Sequence contains no elements.'"}`|This error caused due to various issues.|retry the operation or open a support ticket to Azure HDInsight team.| +|20|`InternalServerError - {"code":"InternalServerError","message":"An internal error has occurred, exception: 'ArgumentNullException, Value can't be null. (Parameter 'roleAssignmentGuid')'"}`|This error caused due to various issues.|retry the operation or open a support ticket to Azure HDInsight team.| +|21|` OperationNotAllowed - {"code":"OperationNotAllowed","message":"An Azure service request has failed. ErrorCode: 'OperationNotAllowed', ErrorMessage: 'Service request failed.\\r\\nStatus: 409 (Conflict)\\r\\n\\r\\nContent:\\r\\n{\\ n \\"code\\": \\"OperationNotAllowed\\",\\ n \\"details\\": null,\\ n \\"message\\": \\"Operation isn't allowed: Another agent pool operation (Scaling) is in progress, wait for it to finish before starting a new operation.`|Another agent pool operation (Scaling) is in progress. This error caused due to RP Service Fabric reboot.|Wait for the previous operation to finish before starting a new operation. If the issue persists after retry, open a support ticket to Azure HDInsight team.| +|22|`ReconcileVMSSAgentPoolFailed`|There's quota validation before cluster creation. But when several clusters are created under the same subscription at the same time, it's possible that the first cluster occupies the quota and the others fail because of quota shortage.|Confirm there's enough quota and retry cluster/cluster pool creation.| +|23|`ReconcileVMSSAgentPoolFailed` - Unable to establish outbound connection from agents|`AKS/VMSS` side issue: VM has reported a failure.|retry the operation after some time. If the issue persists, open a support ticket to Azure HDInsight team.| +|24|`InternalServerError - {"code":"InternalServerError","message":"An internal error has occurred, exception: 'SqlException'"}`|This error caused due to a transient SQL connection issue.|retry the operation after some time. If the issue persists, open a support ticket to Azure HDInsight team.| +|25|`NotLatestOperation - ARM/AKS SDK error`|The operation can't proceed. Either the operation has been preempted by another one, or the information needed by the operation failed to be saved (or hasn't been saved yet).|retry the operation after some time. If the issue persists, open a support ticket to Azure HDInsight team.| +|26|`ReconcileVMSSAgentPoolFailed - Agent pool drain failed`|There was an issue with the scaling down operation.|Open a support ticket to Azure HDInsight team.| +|27|`ResourceNotFound - ARM/AKS SDK error`|This error issue occurs when a required resource removed/deleted by the user.|Make sure the resource that is mentioned in the error message exists, then retry the operation. If the issue persists, open a support ticket to Azure HDInsight team.| +|28|`InvalidClusterSpec - The cluster instance deployment failed with reason 'System.DependencyFailure' and message 'Metastoreservice instance _'xyz'_ has invalid request due to - [Hive metastore storage location access check timed out.]`.|The HMS initialization might time out due to SQL server or storage related issues.|Open a support ticket to Azure HDInsight team.| +|29|`InvalidClusterSpec - The cluster instance deployment failed with reason 'System.DependencyFailure' and message 'Metastoreservice instance '_xyz_' has invalid request due to - [Keyvault secrets weren't configured properly. Failed to fetch secrets from keyvault.]`.|This error can occur due to `keyvault` being inaccessible or the secret key being not available. In some rare cases, this error might be due to slower initialization of pod identity infra on the cluster nodes.|If you have Log Analytics enabled, check the logs of `secretprovider-validate job` to identify the reason.retry the operation after some time, if the issue persists, open a support ticket to Azure HDInsight team.| +|30|`FlinkCluster unready - {"FlinkCluster": "Status can't be determined"}`|This error can occur due to various reasons such as image pull issue, or controller pods not ready, or an issue with MSI.|retry the operation after some time, if the issue persists, open a support ticket to Azure HDInsight team.| +|31|`FlinkCluster unready - {"FlinkCluster": "StatefulSet instance 'flink-taskmanager' isn't ready due to - [Ready replicas don't match desired replica count]."}`|This error can occur due to various reasons such as image pull issue, or controller pods not ready, or an issue with MSI.|retry the operation after some time, if the issue persists, open a support ticket to Azure HDInsight team.| +|32|`InvalidClusterSpec (class com.microsoft.azure.hdinsight.services.spark.exception.ClusterConfigException:[SparkClusterValidator#ConfigurationValidator#][ISSUE:(1)-Component config valid:[[{serviceName='yarn-service,componentName=hadoop-config-client}, {serviceName='yarn-service,componentName=hadoop-config}]],current:[[{serviceName='yarn-service,componentName=yarn-config}'`.|This error can occur if the service config consists of components that aren't allowed.|Validate the service config components and retry. If the issue persists, open a support ticket to Azure HDInsight team.| +|33|`InvalidClusterSpec -1,"conditions":[{"type":"RequestIsValid","status":"UNKNOWN","reason":"UNKNOWN","message":"Unable to determine status of one or more dependencies`.|This error can occur due to HMS,SPARK,YARN services not being up, this error could be related to storage.|Open a support ticket to Azure HDInsight team.| +|34|`WaitingClusterResourcesReadyTimeOut - Failed to reconcile from generation 1 to 1.`||Open a support ticket to Azure HDInsight team.| +|35|`WaitingClusterResourcesReadyTimeOut - {"YarnService":"StatefulSet instance 'resourcemanager' isn't ready due to - `` see service status for specific details and how to fix it. Failing services are: YarnService, SparkService"}`|This error can occur due to HMS,SPARK,YARN services not being up, this error could be related to storage.|Open a support ticket to Azure HDInsight team.| +|36|`InvalidClusterSpec - [spec.configs[0].files[3].fileName: Invalid value: "yarn-env.sh": spec.configs[0].files[3].fileName in body should match '(^yarn-site\\.xml$)|(^capacity-scheduler\\.xml$)|(^core-site\\.xml$)|(^mapred-site\\.xml$)', spec.configs[0].files[3].values: Required value, spec.configs[1].files[2].fileName: Invalid value: "yarn-env.sh": spec.configs[1].files[2].fileName in body should match '(^yarn-site\\.xml$)|(^capacity-scheduler\\.xml$)|(^core-site\\.xml$)|(^mapred-site\\.xml$)', spec.configs[1].files[2].values: Required value]`.|This error can occur when unsupported files are passed in services configuration.|Validate the service config components and retry. If the issue persists, open a support ticket to Azure HDInsight team.| +|37|`InvalidClusterSpec - ".AccessDeniedException: Operation failed: "Server failed to authenticate the request. InvalidAuthenticationInfo, "Server failed to authenticate the request.."`|Invalid authentication parameters ΓÇô the storage location is inaccessible.|Correct authentication parameters and retry. If the issue persists, open a support ticket to Azure HDInsight team.| +|38|`InvalidClusterSpec - ΓÇ£_xyz_.dfs.core.windows.net isn't accessible. Reason: HTTP Error -1; url=. AzureADAuthenticator.getTokenCall threw java.net.SocketTimeoutException :. AzureADAuthenticator.getTokenCall threw java.net.SocketTimeoutException : Read timed out.]`.|This error can occur when the pod identity resources take too long to start on the node when HMS pod is scheduled.|retry the operation, if the issue persists, open a support ticket to Azure HDInsight team.| ++## Next steps +* [Troubleshoot cluster configuration](./trino/trino-configuration-troubleshoot.md). |
hdinsight-aks | Create Cluster Using Arm Template Script | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/create-cluster-using-arm-template-script.md | + + Title: Export ARM template in Azure HDInsight on AKS +description: How to create an ARM template to cluster using script in Azure HDInsight on AKS ++ Last updated : 08/29/2023+++# Export cluster ARM template using script +++This article describes how to generate an ARM template for your cluster automatically using a script. You can use the ARM template to modify, clone, or recreate a cluster starting from the existing cluster's configurations. ++## Prerequisites ++* An operational HDInsight on AKS cluster. +* Familiarity with [ARM template authoring and deployment](/azure/azure-resource-manager/templates/overview). + +## Steps to generate ARM template for the cluster ++1. Sign in to [Azure portal](https://portal.azure.com). ++2. In the Azure portal search bar, type "HDInsight on AKS cluster" and select "Azure HDInsight on AKS clusters" from the drop-down list. + + :::image type="content" source="./media/create-cluster-using-arm-template-script/cloud-portal-search.png" alt-text="Screenshot showing search option for getting started with HDInsight on AKS Cluster." border="true" lightbox="./media/create-cluster-using-arm-template-script/cloud-portal-search.png"::: + +6. Select your cluster name from the list page. + + :::image type="content" source="./media/create-cluster-using-arm-template-script/cloud-portal-list-view.png" alt-text="Screenshot showing selecting the HDInsight on AKS Cluster you require from the list." border="true" lightbox="./media/create-cluster-using-arm-template-script/cloud-portal-list-view.png"::: + +2. Navigate to the overview blade of your cluster and click on *JSON View* at the top right. + + :::image type="content" source="./media/create-cluster-using-arm-template-script/view-cost-json-view.png" alt-text="Screenshot showing how to view cost and JSON View buttons from the Azure portal." border="true" lightbox="./media/create-cluster-using-arm-template-script/view-cost-json-view.png"::: + +2. Copy the "Resource JSON" and save it to a local JSON file. For example, `template.json`. ++3. Click the following button at the top right in the Azure portal to launch Azure Cloud Shell. ++ :::image type="content" source="./media/create-cluster-using-arm-template-script/cloud-shell.png" alt-text="Screenshot screenshot showing Cloud Shell icon."::: + +5. Make sure Cloud Shell is set to "Bash" on the top left and upload your `template.json` file. ++ :::image type="content" source="./media/create-cluster-using-arm-template-script/azure-cloud-shell-template-upload.png" alt-text="Screenshot showing how to upload your template.json file." border="true" lightbox="./media/create-cluster-using-arm-template-script/azure-cloud-shell-template-upload.png"::: + +2. Execute the following command to generate the ARM template. ++ ```azurecli + wget https://hdionaksresources.blob.core.windows.net/common/arm_transform.py ++ python arm_transform.py template.json + ``` + + :::image type="content" source="./media/create-cluster-using-arm-template-script/azure-cloud-shell-script-output.png" alt-text="Screenshot showing results after running the script." border="true" lightbox="./media/create-cluster-using-arm-template-script/azure-cloud-shell-script-output.png"::: ++This script creates an ARM template with name `template-modified.json` for your cluster and generates a command to deploy the ARM template. ++Now, your cluster ARM template is ready. You can update the properties of the cluster and finally deploy the ARM template to refresh the resources. To redeploy, you can either use the Azure CLI command output by the script or [deploy an ARM template using Azure portal](/azure/azure-resource-manager/templates/deploy-portal#deploy-resources-from-custom-template). ++> [!IMPORTANT] +> If you're cloning the cluster or creating a new cluster, you'll need to modify the `name`, `location`, and `fqdn` (the fqdn must match the cluster name). |
hdinsight-aks | Create Cluster Using Arm Template | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/create-cluster-using-arm-template.md | + + Title: Export cluster ARM template +description: Learn how to Create cluster ARM template ++ Last updated : 08/29/2023+++# Export cluster ARM template +++This article describes how to generate an ARM template from resource JSON of your cluster. ++## Prerequisites ++* An operational HDInsight on AKS cluster. +* Familiarity with [ARM template authoring and deployment](/azure/azure-resource-manager/templates/overview). ++## Steps to generate ARM template for the cluster ++1. Sign in to [Azure portal](https://portal.azure.com). + +1. In the Azure portal search bar, type "HDInsight on AKS cluster" and select "Azure HDInsight on AKS clusters" from the drop-down list. + + :::image type="content" source="./media/create-cluster-using-arm-template/portal-search.png" alt-text="Screenshot showing search option for getting started with HDInsight on AKS Cluster." border="true" lightbox="./media/create-cluster-using-arm-template/portal-search.png"::: + +1. Select your cluster name from the list page. ++ :::image type="content" source="./media/create-cluster-using-arm-template/portal-search-result.png" alt-text="Screenshot showing selecting the HDInsight on AKS Cluster you require from the list." border="true" lightbox="./media/create-cluster-using-arm-template/portal-search-result.png"::: + +1. Go to the overview blade of your cluster and click on *JSON View* at the top right. ++ :::image type="content" source="./media/create-cluster-using-arm-template/view-cost-json-view.png" alt-text="Screenshot showing how to view cost and JSON View buttons from the Azure portal." border="true" lightbox="./media/create-cluster-using-arm-template/view-cost-json-view.png"::: + +1. Copy the response to an editor. For example: Visual Studio Code. +1. Modify the response with the following changes to turn it into a valid ARM template. ++ * Remove the following objects- + * `id`, `systemData` + * `deploymentId`, `provisioningState`, and `status` under properties object. ++ * Change "name" value to `<your clusterpool name>/<your cluster name>`. ++ :::image type="content" source="./media/create-cluster-using-arm-template/change-cluster-name.png" alt-text="Screenshot showing how to change cluster name."::: + + * Add "apiversion": "2023-06-01-preview" in the same section with name, location etc. ++ :::image type="content" source="./media/create-cluster-using-arm-template/api-version.png" alt-text="Screenshot showing how to modify the API version."::: ++ 1. Open [custom template](/azure/azure-resource-manager/templates/deploy-portal#deploy-resources-from-custom-template) from the Azure portal and select "Build your own template in the editor" option. + + 1. Copy the modified response to the ΓÇ£resourcesΓÇ¥ object in the ARM template format. For example: ++ :::image type="content" source="./media/create-cluster-using-arm-template/modify-get-response.png" alt-text="Screenshot showing how to modify the get response." border="true" lightbox="./media/create-cluster-using-arm-template/modify-get-response.png"::: ++Now, your cluster ARM template is ready. You can update the properties of the cluster and finally deploy the ARM template to refresh the resources. Learn how to [deploy an ARM template](/azure/azure-resource-manager/templates/deploy-portal). |
hdinsight-aks | Customize Clusters | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/customize-clusters.md | + + Title: Customize Azure HDInsight on AKS clusters +description: Add custom components to HDInsight on AKS clusters by using script actions. Script actions are Bash scripts that can be used to customize the cluster configuration. ++ Last updated : 08/29/2023+++# Customize Azure HDInsight on AKS clusters using script actions ++ +Azure HDInsight on AKS provides a configuration method called  Script Actions that invoke custom scripts to customize the cluster. These scripts can be used to install more packages/jars and change configuration settings. The Script actions can be used only during cluster creation. Post cluster creation script actions are in the roadmap. Currently Script Actions are available only with Spark clusters. ++## Understand script actions ++A script action is Bash script that runs on the service components in an HDInsight on AKS cluster. ++The characteristics and features of script actions are as follows: ++- The Bash script URI (the location to access the file) has to be publicly accessible from the HDInsight on AKS resource provider and the cluster. +- The following are possible storage locations for scripts: + - An ADLS Gen2 + - An Azure Storage account (the storage has to be publicly accessible) + - The Bash script URI format for ADLS Gen2 is `abfs://<container>@<datalakestoreaccountname>.dfs.core.windows.net/path_to_file.sh` + - The Bash script URI format for Azure Storage is `wasb://<container>@<azurestorageaccountname>testwasbwithoutmsi.blob.core.windows.net/path_to_file.sh` +- Script actions can be restricted to run on only certain service component types. For example, Resource Manager, Node Manager, Livy, Jupyter, Zeppelin, Metastore etc. +- Script actions are persisted. + - Persisted script actions must have a unique name. + - Persisted scripts are used to customize the service components + - When the service components are scaled up, the persisted script action is applied to them as well +- Script actions can accept parameters that required by the script, during execution. +- You're required to have permissions to create a cluster to execute script actions. ++ > [!IMPORTANT] + > * Script actions that remove or modify service files on the nodes may impact service health and availability. You're required to apply discretion and check scripts before executing them. + > * There's no automatic way to undo the changes made by a script action. ++## Methods for using script actions ++You have the option of configuring a Script Action to run during cluster creation. ++> [!NOTE] +> Configuration of Script Actions on existing cluster is part of the roadmap. ++### Script action during the cluster creation process ++In HDInsight on AKS, the script is automatically persisted. A failure in the script can cause the cluster creation process to fail. ++The following diagram illustrates when script action runs during the creation process: ++ +**The script runs while HDInsight on AKS cluster is being provisioned. The script runs in parallel on all the specified nodes in the cluster.** ++> [!IMPORTANT] +> * During cluster creation, you can use many script actions at once. +> * These scripts are invoked in the order in which they were specified, and not parallelly. ++### Next steps ++* How to [manage script actions](./manage-script-actions.md) |
hdinsight-aks | Faq | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/faq.md | + + Title: HDInsight on AKS FAQ +description: HDInsight on AKS frequently asked questions. ++ Last updated : 08/29/2023+++# HDInsight on AKS - Frequently asked questions ++This article addresses some common questions about Azure HDInsight on AKS. ++## General ++* What is HDInsight on AKS? ++ HDInsight on AKS is a new HDInsight version, which provides enterprise ready managed cluster service with emerging open-source analytics projects like Apache Flink (for Streaming), Trino (for adhoc analytics and BI), and Apache Spark. For more information, see [Overview](./overview.md). ++* What cluster shapes do HDInsight on AKS support? ++ HDInsight on AKS supports Trino, Apache Flink, and Apache Spark to start with. However, other cluster shapes such as Kafka, Hive etc., are on roadmap. + +* How do I get started with HDInsight on AKS? ++ To get started, visit Azure Marketplace and search for Azure HDInsight on AKS service and refer to [getting started](./quickstart-create-cluster.md). ++* What happens to existing HDInsight on VM and the clusters IΓÇÖm running today? ++ There are no changes to existing HDInsight (HDInsight on VM). All your existing clusters continue to run, and you can continue to create and scale new HDInsight clusters. ++* Which operating system is supported with HDInsight on AKS? ++ HDInsight on AKS is based on Mariner OS. For more information, see [OS Version](./release-notes/hdinsight-aks-release-notes.md#operating-system-version). ++* In what all Regions are HDInsight on AKS available? ++ For a list of supported regions, refer to [Region availability](./overview.md#region-availability-public-preview). ++* WhatΓÇÖs the cost to deploy an HDInsight on AKS Cluster? ++ For more information about pricing, see HDInsight on AKS pricing. ++## Cluster management ++* Can I run multiple clusters simultaneously? ++ Yes, you can run as many clusters as you want per cluster pool simultaneously. However, make sure you aren't constraint by quota for your subscription. The maximum number of nodes allowed in a cluster pool are 250 (in public preview). ++* Can I install or add more plugins/libraries on my cluster? ++ Yes, you can install custom plugins and libraries depending on the cluster shapes. + * For Trino, refer to [Install custom plugins](./trino/trino-custom-plugins.md). + * For Spark, refer to [Library management in Spark](./spark/library-management.md). + +* Can I SSH into my cluster? ++ Yes, you can SSH onto your cluster via webssh and execute queries and submit jobs directly from there. ++## Metastore ++* Can I use an external metastore to connect to my cluster? ++ Yes, you can use an external metastore. However, we support only Azure SQL Database as an external custom metastore. ++* Can I share a metastore across multiple clusters? ++ Yes, you can share a metastore across multiple HDInsight of AKS. ++* What's the version of Hive metastore supported? ++ Hive metastore version 3.1.2 ++## Workloads ++### Trino ++* What is Trino? ++ Trino is an open source federated and distributed SQL query engine, which allows you to query data residing on different data sources without moving to a central data warehouse. + You can query the data using ANSI SQL, no need to learn a new language. For more information, see [Trino overview](./trino/trino-overview.md). ++* What all connectors do you support? ++ HDInsight on AKS Trino supports multiple connectors. For more information, see this list of [Trino connectors](./trino/trino-connectors.md). + We keep on adding new connectors as and when new connectors are available in the open-source version. ++* Can I add catalogs to an existing cluster? ++ Yes, you can add supported catalogs to the existing cluster. For more information, see [Add catalogs to an existing cluster](./trino/trino-add-catalogs.md). + +### Apache Flink ++* What is Apache Flink? ++ Apache Flink is a best-in-class open-source analytic engine for stream processing and performing stateful computation over unbounded and bounded data streams. It can perform computations at in-memory speed and at any scale. + Flink on HDInsight on AKS offers managed open-source Apache Flink. For more information, see [Flink overview](./flink/flink-overview.md). ++* Do you support both session and app mode in Apache Flink? ++ In HDInsight on AKS, Flink currently support session mode clusters. ++* What is state backend management and how it's done in HDInsight on AKS? ++ Backends determine where state is stored. When checkpointing is activated, state is persisted upon checkpoints to guard against data loss and recover consistently. How the state is represented internally, and how and where it's persisted upon checkpoints depends on the chosen State Backend. For more information,see [Flink overview](./flink/flink-overview.md) ++### Apache Spark ++* What is Apache Spark? ++ Apache Spark is a data processing framework that can quickly perform processing tasks on large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. ++* What language APIs are supported in Spark? ++ Azure HDInsight on AKS supports Python and Scala. ++* Are external metastore supported in HDInsight on AKS Spark? ++ HDInsight on AKS support external metastore connectivity. Currently only Azure SQL DB in supported as external metastore. ++* What are the various ways to submit jobs in HDInsight on AKS Spark? ++ You can submit jobs on HDInsight on AKS Spark using Jupyter Notebook, Zeppelin Notebook, SDK and cluster terminal. For more information, see [Submit and Manage Jobs on a Spark cluster in HDInsight on AKS](./spark/submit-manage-jobs.md) |
hdinsight-aks | Assign Kafka Topic Event Message To Azure Data Lake Storage Gen2 | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/assign-kafka-topic-event-message-to-azure-data-lake-storage-gen2.md | + + Title: Write event messages into Azure Data Lake Storage Gen2 with DataStream API +description: Learn how to write event messages into Azure Data Lake Storage Gen2 with DataStream API ++ Last updated : 08/29/2023+++# Write event messages into Azure Data Lake Storage Gen2 with DataStream API +++Apache Flink uses file systems to consume and persistently store data, both for the results of applications and for fault tolerance and recovery. In this article, learn how to write event messages into Azure Data Lake Storage Gen2 with DataStream API. ++## Prerequisites ++* [HDInsight on AKS Apache Flink 1.16.0](../flink/flink-create-cluster-portal.md) +* [HDInsight Kafka](../../hdinsight/kafk) + * You're required to ensure the network settings are taken care as described on [Using HDInsight Kafka](../flink/process-and-consume-data.md); that's to make sure HDInsight on AKS Flink and HDInsight Kafka are in the same Virtual Network +* Use MSI to access ADLS Gen2 +* IntelliJ for development on an Azure VM in HDInsight on AKS Virtual Network ++## Apache Flink FileSystem connector ++This filesystem connector provides the same guarantees for both BATCH and STREAMING and is designed to provide exactly once semantics for STREAMING execution. For more information, see [Flink DataStream Filesystem](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/filesystem) ++## Apache Kafka Connector ++Flink provides an Apache Kafka connector for reading data from and writing data to Kafka topics with exactly once guarantees. For more information, see [Apache Kafka Connector](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/kafka) ++## Build the project for Apache Flink ++**pom.xml on IntelliJ IDEA** ++``` xml +<properties> + <maven.compiler.source>1.8</maven.compiler.source> + <maven.compiler.target>1.8</maven.compiler.target> + <flink.version>1.16.0</flink.version> + <java.version>1.8</java.version> + <scala.binary.version>2.12</scala.binary.version> + <kafka.version>3.2.0</kafka.version> + </properties> + <dependencies> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-java</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-streaming-java</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-clients</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-files --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-files</artifactId> + <version>${flink.version}</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-kafka</artifactId> + <version>${flink.version}</version> + </dependency> + </dependencies> + <build> + <plugins> + <plugin> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-assembly-plugin</artifactId> + <version>3.0.0</version> + <configuration> + <appendAssemblyId>false</appendAssemblyId> + <descriptorRefs> + <descriptorRef>jar-with-dependencies</descriptorRef> + </descriptorRefs> + </configuration> + <executions> + <execution> + <id>make-assembly</id> + <phase>package</phase> + <goals> + <goal>single</goal> + </goals> + </execution> + </executions> + </plugin> + </plugins> + </build> +</project> +``` ++**Program for ADLS Gen2 Sink** ++*abfsGen2.java* ++> [!Note] +> Replace [HDInsight Kafka](../../hdinsight/kafk)bootStrapServers with your own brokers for Kafka 2.4 or 3.2 ++``` java +package contoso.example; ++import org.apache.flink.api.common.eventtime.WatermarkStrategy; +import org.apache.flink.api.common.serialization.SimpleStringEncoder; +import org.apache.flink.api.common.serialization.SimpleStringSchema; +import org.apache.flink.configuration.MemorySize; +import org.apache.flink.connector.file.sink.FileSink; +import org.apache.flink.connector.kafka.source.KafkaSource; +import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer; +import org.apache.flink.core.fs.Path; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.DefaultRollingPolicy; ++import java.time.Duration; ++public class KafkaSinkToGen2 { + public static void main(String[] args) throws Exception { + // 1. get stream execution env + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); ++ // 1. read kafka message as stream input, update your broker ip's + String brokers = "<update-broker-ip>:9092,<update-broker-ip>:9092,<update-broker-ip>:9092"; + KafkaSource<String> source = KafkaSource.<String>builder() + .setBootstrapServers(brokers) + .setTopics("click_events") + .setGroupId("my-group") + .setStartingOffsets(OffsetsInitializer.earliest()) + .setValueOnlyDeserializer(new SimpleStringSchema()) + .build(); ++ DataStream<String> stream = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source"); + stream.print(); ++ // 3. sink to gen2, update container name and storage path + String outputPath = "abfs://<container-name>@<storage-path>.dfs.core.windows.net/flink/data/click_events"; + final FileSink<String> sink = FileSink + .forRowFormat(new Path(outputPath), new SimpleStringEncoder<String>("UTF-8")) + .withRollingPolicy( + DefaultRollingPolicy.builder() + .withRolloverInterval(Duration.ofMinutes(2)) + .withInactivityInterval(Duration.ofMinutes(3)) + .withMaxPartSize(MemorySize.ofMebiBytes(5)) + .build()) + .build(); ++ stream.sinkTo(sink); ++ // 4. run stream + env.execute("Kafka Sink To Gen2"); + } +} ++``` ++**Submit the job on Flink Dashboard UI** ++We are using Maven to package a jar onto local and submitting to Flink, and using Kafka to sink into ADLS Gen2 +++**Validate streaming data on ADLS Gen2** ++We are seeing the `click_events` streaming into ADLS Gen2 +++You can specify a rolling policy that rolls the in-progress part file on any of the following three conditions: ++``` java +.withRollingPolicy( + DefaultRollingPolicy.builder() + .withRolloverInterval(Duration.ofMinutes(5)) + .withInactivityInterval(Duration.ofMinutes(3)) + .withMaxPartSize(MemorySize.ofMebiBytes(5)) + .build()) +``` ++## Reference +- [Apache Kafka Connector](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/kafka) +- [Flink DataStream Filesystem](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/filesystem) + |
hdinsight-aks | Azure Databricks | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/azure-databricks.md | + + Title: Incorporate Flink DataStream into Azure Databricks Delta Lake Table +description: Learn about incorporate Flink DataStream into Azure Databricks Delta Lake Table in HDInsight on AKS - Apache Flink ++ Last updated : 10/05/2023+++# Incorporate Flink DataStream into Azure Databricks Delta Lake Table ++This example shows how to sink stream data landed into Azure ADLS Gen2 from HDInsight Flink cluster on AKS applications into Delta Lake tables using Azure Databricks Auto Loader. ++## Prerequisites ++- [HDInsight Flink 1.16.0 on AKS](./flink-create-cluster-portal.md) +- [HDInsight Kafka 3.2.0](../../hdinsight/kafk) +- [Azure Databricks](/azure/databricks/getting-started/) in the same VNET as HDInsight on AKS +- [ADLS Gen2](/azure/databricks/getting-started/connect-to-azure-storage/) and Service Principal ++## Azure Databricks Auto Loader ++Databricks Auto Loader makes it easy to stream data land into object storage from Flink applications into Delta Lake tables. [Auto Loader](/azure/databricks/ingestion/auto-loader/) provides a Structured Streaming source called cloudFiles. ++Here are the steps how you can use data from Flink in Azure Databricks delta live tables. ++### Create Kafka table on Flink SQL ++In this step, you can create Kafka table and ADLS Gen2 on Flink SQL. For the purpose of this document, we are using a airplanes_state_real_time table, you can use any topic of your choice. ++You are required to update the broker IPs with your Kafka cluster in the code snippet. ++```SQL +CREATE TABLE kafka_airplanes_state_real_time ( + `date` STRING, + `geo_altitude` FLOAT, + `icao24` STRING, + `latitude` FLOAT, + `true_track` FLOAT, + `velocity` FLOAT, + `spi` BOOLEAN, + `origin_country` STRING, + `minute` STRING, + `squawk` STRING, + `sensors` STRING, + `hour` STRING, + `baro_altitude` FLOAT, + `time_position` BIGINT, + `last_contact` BIGINT, + `callsign` STRING, + `event_time` STRING, + `on_ground` BOOLEAN, + `category` STRING, + `vertical_rate` FLOAT, + `position_source` INT, + `current_time` STRING, + `longitude` FLOAT + ) WITH ( + 'connector' = 'kafka', + 'topic' = 'airplanes_state_real_time', + 'scan.startup.mode' = 'latest-offset', + 'properties.bootstrap.servers' = '10.0.0.38:9092,10.0.0.39:9092,10.0.0.40:9092', + 'format' = 'json' +); +``` +Next, you can create ADLSgen2 table on Flink SQL. ++Update the container-name and storage-account-name in the code snippet with your ADLS Gen2 details. ++```SQL +CREATE TABLE adlsgen2_airplanes_state_real_time ( + `date` STRING, + `geo_altitude` FLOAT, + `icao24` STRING, + `latitude` FLOAT, + `true_track` FLOAT, + `velocity` FLOAT, + `spi` BOOLEAN, + `origin_country` STRING, + `minute` STRING, + `squawk` STRING, + `sensors` STRING, + `hour` STRING, + `baro_altitude` FLOAT, + `time_position` BIGINT, + `last_contact` BIGINT, + `callsign` STRING, + `event_time` STRING, + `on_ground` BOOLEAN, + `category` STRING, + `vertical_rate` FLOAT, + `position_source` INT, + `current_time` STRING, + `longitude` FLOAT + ) WITH ( + 'connector' = 'filesystem', + 'path' = 'abfs://<container-name>@<storage-account-name>/flink/airplanes_state_real_time/', + 'format' = 'json' + ); +``` ++Further, you can insert Kafka table into ADLSgen2 table on Flink SQL. +++### Validate the streaming job on Flink +++### Check data sink from Kafka in Azure Storage on Azure portal +++### Authentication of Azure Storage and Azure Databricks notebook ++ADLS Gen2 provides OAuth 2.0 with your Azure AD application service principal for authentication from an Azure Databricks notebook and then mount into Azure Databricks DBFS. ++**Let's get service principle appid, tenant id and secret key.** +++**Grant service principle the Storage Blob Data Owner on Azure portal** +++**Mount ADLS Gen2 into DBFS, on Azure Databricks notebook** +++**Prepare notebook** ++Let's write the following code: +```SQL +%sql +CREATE OR REFRESH STREAMING TABLE airplanes_state_real_time2 +AS SELECT * FROM cloud_files("dbfs:/mnt/contosoflinkgen2/flink/airplanes_state_real_time/", "json") +``` ++### Define Delta Live Table Pipeline and run on Azure Databricks ++++### Check Delta Live Table on Azure Databricks Notebook + |
hdinsight-aks | Azure Iot Hub | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/azure-iot-hub.md | + + Title: Process real-time IoT data on Flink with Azure HDInsight on AKS +description: How to integrate Azure IoT Hub and Apache Flink ++ Last updated : 10/03/2023+++# Process real-time IoT data on Flink with Azure HDInsight on AKS ++Azure IoT Hub is a managed service hosted in the cloud that acts as a central message hub for communication between an IoT application and its attached devices. You can connect millions of devices and their backend solutions reliably and securely. Almost any device can be connected to an IoT hub. ++## Prerequisites ++1. [Create an Azure IoTHub](/azure/iot-hub/iot-hub-create-through-portal/) +2. [Create an HDInsight on AKS Flink cluster](./flink-create-cluster-portal.md) ++## Configure Flink cluster ++Add ABFS storage account keys in your Flink cluster's configuration. ++Add the following configurations: ++`fs.azure.account.key.<your storage account's dfs endpoint> = <your storage account's shared access key>` +++## Writing the Flink job ++### Set up configuration for ABFS ++```java +Properties props = new Properties(); +props.put( + "fs.azure.account.key.<your storage account's dfs endpoint>", + "<your storage account's shared access key>" +); ++Configuration conf = ConfigurationUtils.createConfiguration(props); ++StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(conf); ++``` +++This set up is required for Flink to authenticate with your ABFS storage account to write data to it. ++### Defining the IoT Hub source ++IoTHub is build on top of event hub and hence supports a kafka-like API. So in our Flink job, we can define a `KafkaSource` with appropriate parameters to consume messages from IoTHub. ++```java +String connectionString = "<your iot hub connection string>"; ++KafkaSource<String> source = KafkaSource.<String>builder() + .setBootstrapServers("<your iot hub's service bus url>:9093") + .setTopics("<name of your iot hub>") + .setGroupId("$Default") + .setProperty("partition.discovery.interval.ms", "10000") + .setProperty("security.protocol", "SASL_SSL") + .setProperty("sasl.mechanism", "PLAIN") + .setProperty("sasl.jaas.config", String.format("org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"%s\";", connectionString)) + .setStartingOffsets(OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST)) + .setValueOnlyDeserializer(new SimpleStringSchema()) + .build(); ++DataStream<String> kafka = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source"); +kafka.print(); +``` ++The connection string for IoT Hub can be found here - +++Within the connection string, you can find a service bus URL (URL of the underlying event hub namespace), which you need to add as a bootstrap server in your kafka source. In this case, it is: `iothub-ns-sagiri-iot-25146639-20dff4e426.servicebus.windows.net:9093` ++### Defining the ABFS sink ++```java +String outputPath = "abfs://<container name>@<your storage account's dfs endpoint>"; ++final FileSink<String> sink = FileSink + .forRowFormat(new Path(outputPath), new SimpleStringEncoder<String>("UTF-8")) + .withRollingPolicy( + DefaultRollingPolicy.builder() + .withRolloverInterval(Duration.ofMinutes(2)) + .withInactivityInterval(Duration.ofMinutes(3)) + .withMaxPartSize(MemorySize.ofMebiBytes(5)) + .build()) + .build(); ++kafka.sinkTo(sink); +``` ++### Flink job code ++```java +package org.example; ++import java.time.Duration; +import java.util.Properties; +import org.apache.flink.api.common.serialization.SimpleStringEncoder; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.configuration.ConfigurationUtils; +import org.apache.flink.configuration.MemorySize; +import org.apache.flink.connector.file.sink.FileSink; +import org.apache.flink.core.fs.Path; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.api.common.serialization.SimpleStringSchema; +import org.apache.flink.connector.kafka.source.KafkaSource; +import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer; +import org.apache.flink.api.common.eventtime.WatermarkStrategy; +import org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.DefaultRollingPolicy; +import org.apache.kafka.clients.consumer.OffsetResetStrategy; ++public class StreamingJob { + public static void main(String[] args) throws Throwable { ++ Properties props = new Properties(); + props.put( + "fs.azure.account.key.<your storage account's dfs endpoint>", + "<your storage account's shared access key>" + ); ++ Configuration conf = ConfigurationUtils.createConfiguration(props); ++ StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(conf); ++ String connectionString = "<your iot hub connection string>"; ++ + KafkaSource<String> source = KafkaSource.<String>builder() + .setBootstrapServers("<your iot hub's service bus url>:9093") + .setTopics("<name of your iot hub>") + .setGroupId("$Default") + .setProperty("partition.discovery.interval.ms", "10000") + .setProperty("security.protocol", "SASL_SSL") + .setProperty("sasl.mechanism", "PLAIN") + .setProperty("sasl.jaas.config", String.format("org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"%s\";", connectionString)) + .setStartingOffsets(OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST)) + .setValueOnlyDeserializer(new SimpleStringSchema()) + .build(); +++ DataStream<String> kafka = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source"); + kafka.print(); ++ String outputPath = "abfs://<container name>@<your storage account's dfs endpoint>"; ++ final FileSink<String> sink = FileSink + .forRowFormat(new Path(outputPath), new SimpleStringEncoder<String>("UTF-8")) + .withRollingPolicy( + DefaultRollingPolicy.builder() + .withRolloverInterval(Duration.ofMinutes(2)) + .withInactivityInterval(Duration.ofMinutes(3)) + .withMaxPartSize(MemorySize.ofMebiBytes(5)) + .build()) + .build(); ++ kafka.sinkTo(sink); ++ env.execute("Azure-IoTHub-Flink-ABFS"); + } +} ++``` ++#### Maven dependencies ++```xml +<dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-java</artifactId> + <version>${flink.version}</version> +</dependency> +<dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-streaming-java</artifactId> + <version>${flink.version}</version> +</dependency> +<dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-streaming-scala_2.12</artifactId> + <version>${flink.version}</version> +</dependency> +<dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-clients</artifactId> + <version>${flink.version}</version> +</dependency> +<dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-kafka</artifactId> + <version>${flink.version}</version> +</dependency> +<dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-files</artifactId> + <version>${flink.version}</version> +</dependency> +``` +++### Submit job ++Submit job using HDInsight on AKS's [Flink job submission API](./flink-job-management.md) + |
hdinsight-aks | Change Data Capture Connectors For Apache Flink | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/change-data-capture-connectors-for-apache-flink.md | + + Title: How to perform Change Data Capture of SQL Server with DataStream API and DataStream Source. +description: Learn how to perform Change Data Capture of SQL Server with DataStream API and DataStream Source. ++ Last updated : 08/29/2023+++# Change Data Capture of SQL Server with DataStream API and DataStream Source +++Change Data Capture (CDC) is a technique you can use to track row-level changes in database tables in response to create, update, and delete operations. In this article, we use [CDC Connectors for Apache Flink®](https://github.com/ververica/flink-cdc-connectors), which offer a set of source connectors for Apache Flink. The connectors integrate [Debezium®](https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/formats/debezium/#debezium-format) as the engine to capture the data changes. ++In this article, learn how to perform Change Data Capture of SQL Server using Datastream API. The SQLServer CDC connector can also be a DataStream source. ++## Prerequisites ++* [HDInsight on AKS Apache Flink 1.16.0](../flink/flink-create-cluster-portal.md) +* [HDInsight Kafka](../../hdinsight/kafk) + * You're required to ensure the network settings are taken care as described on [Using HDInsight Kafka](../flink/process-and-consume-data.md); that's to make sure HDInsight on AKS Flink and HDInsight Kafka are in the same VNet +* Azure SQLServer +* HDInsight Kafka cluster and HDInsight on AKS Flink clusters are located in the same VNet +* Install [IntelliJ IDEA](https://www.jetbrains.com/idea/download/#section=windows) for development on an Azure VM, which locates in HDInsight VNet ++### SQLServer CDC Connector ++The SQLServer CDC connector is a Flink source connector, which reads database snapshot first and then continues to read change events with exactly once processing even failures happen. The SQLServer CDC connector can also be a DataStream source. ++### Single Thread Reading ++The SQLServer CDC source can’t work in parallel reading, because there's only one task, which can receive change events. For more information, refer [SQLServer CDC Connector](https://ververica.github.io/flink-cdc-connectors/master/content/connectors/sqlserver-cdc.html). ++### DataStream Source ++The SQLServer CDC connector can also be a DataStream source. You can create a SourceFunction. ++## How the SQLServer CDC connector works? ++To optimize, configure and run a Debezium SQL Server connector. It's helpful to understand how the connector performs snapshots, streams change events, determines Kafka topic names, and uses metadata. ++- **Snapshots** : SQL Server CDC isn't designed to store a complete history of database changes. For the Debezium SQL Server connector, to establish a baseline for the current state of the database, it uses a process called *snapshotting*. +++## Apache Flink on HDInsight on AKS ++Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Apache Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale. ++For more information, refer ++* [Apache Flink®—Stateful Computations over Data Streams](https://flink.apache.org/) +* [What is Apache Flink in HDInsight on AKS](./flink-overview.md) ++## Apache Kafka on HDInsight ++Apache Kafka is an open-source distributed streaming platform that can be used to build real-time streaming data pipelines and applications. Kafka also provides message broker functionality similar to a message queue, where you can publish and subscribe to named data streams. ++For more information, refer [Apache Kafka in Azure HDInsight](../../hdinsight/kafk) ++## Perform a test ++#### Prepare DB and table on Sqlserver ++``` +CREATE DATABASE inventory; +GO +``` +**CDC is enabled on the SQL Server database** ++``` +USE inventory; +EXEC sys.sp_cdc_enable_db; +GO +``` ++**Verify that the user has access to the CDC table** ++``` +USE inventory +GO +EXEC sys.sp_cdc_help_change_data_capture +GO +``` +> [!NOTE] +> The query returns configuration information for each table in the database that is enabled for CDC and that contains change data that the caller is authorized to access. If the result is empty, verify that the user has privileges to access both the capture instance and the CDC tables. ++**Create and populate products with single insert with many rows** ++``` +CREATE TABLE products ( +id INTEGER IDENTITY(101,1) NOT NULL PRIMARY KEY, +name VARCHAR(255) NOT NULL, +description VARCHAR(512), +weight FLOAT +); ++INSERT INTO products(name,description,weight) +VALUES ('scooter','Small 2-wheel scooter',3.14); +INSERT INTO products(name,description,weight) +VALUES ('car battery','12V car battery',8.1); +INSERT INTO products(name,description,weight) +VALUES ('12-pack drill bits','12-pack of drill bits with sizes ranging from #40 to #3',0.8); +INSERT INTO products(name,description,weight) +VALUES ('hammer','12oz carpenter''s hammer',0.75); +INSERT INTO products(name,description,weight) +VALUES ('hammer','14oz carpenter''s hammer',0.875); +INSERT INTO products(name,description,weight) +VALUES ('hammer','16oz carpenter''s hammer',1.0); +INSERT INTO products(name,description,weight) +VALUES ('rocks','box of assorted rocks',5.3); +INSERT INTO products(name,description,weight) +VALUES ('jacket','water resistent black wind breaker',0.1); +INSERT INTO products(name,description,weight) +VALUES ('spare tire','24 inch spare tire',22.2); ++EXEC sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'products', @role_name = NULL, @supports_net_changes = 0; ++-- Create some very simple orders +CREATE TABLE orders ( +id INTEGER IDENTITY(10001,1) NOT NULL PRIMARY KEY, +order_date DATE NOT NULL, +purchaser INTEGER NOT NULL, +quantity INTEGER NOT NULL, +product_id INTEGER NOT NULL, +FOREIGN KEY (product_id) REFERENCES products(id) +); ++INSERT INTO orders(order_date,purchaser,quantity,product_id) +VALUES ('16-JAN-2016', 1001, 1, 102); +INSERT INTO orders(order_date,purchaser,quantity,product_id) +VALUES ('17-JAN-2016', 1002, 2, 105); +INSERT INTO orders(order_date,purchaser,quantity,product_id) +VALUES ('19-FEB-2016', 1002, 2, 106); +INSERT INTO orders(order_date,purchaser,quantity,product_id) +VALUES ('21-FEB-2016', 1003, 1, 107); ++EXEC sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'orders', @role_name = NULL, @supports_net_changes = 0; +GO +``` +##### Maven source code on IdeaJ ++In the below snippet, we use HDInsight Kafka 2.4.1. Based on your usage, update the version of Kafka on `<kafka.version>`. ++**maven pom.xml** ++```xml +<?xml version="1.0" encoding="UTF-8"?> +<project xmlns="http://maven.apache.org/POM/4.0.0" + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> + <modelVersion>4.0.0</modelVersion> ++ <groupId>org.example</groupId> + <artifactId>FlinkDemo</artifactId> + <version>1.0-SNAPSHOT</version> + <properties> + <maven.compiler.source>1.8</maven.compiler.source> + <maven.compiler.target>1.8</maven.compiler.target> + <flink.version>1.16.0</flink.version> + <java.version>1.8</java.version> + <scala.binary.version>2.12</scala.binary.version> + <kafka.version>2.4.1</kafka.version> // Replace with 3.2 if you're using HDInsight Kafka 3.2 + </properties> + <dependencies> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-java</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-streaming-java</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-clients</artifactId> + <version>${flink.version}</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-kafka</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-base --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-base</artifactId> + <version>${flink.version}</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-core</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-sql-connector-elasticsearch7 --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-sql-connector-elasticsearch7</artifactId> + <version>${flink.version}</version> + <scope>provided</scope> + </dependency> + <!-- https://mvnrepository.com/artifact/com.ververica/flink-sql-connector-sqlserver-cdc --> + <dependency> + <groupId>com.ververica</groupId> + <artifactId>flink-sql-connector-sqlserver-cdc</artifactId> + <version>2.2.1</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-table-common --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-table-common</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-table-planner --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-table-planner_2.12</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-table-api-scala --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-table-api-scala_2.12</artifactId> + <version>${flink.version}</version> + </dependency> + </dependencies> +</project> +``` ++**mssqlSinkToKafka.java** ++```java +import org.apache.flink.api.common.serialization.SimpleStringSchema; +import org.apache.flink.connector.base.DeliveryGuarantee; +import org.apache.flink.connector.kafka.sink.KafkaRecordSerializationSchema; +import org.apache.flink.connector.kafka.sink.KafkaSink; ++import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.streaming.api.functions.source.SourceFunction; ++import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema; +import com.ververica.cdc.connectors.sqlserver.SqlServerSource; ++public class mssqlSinkToKafka { ++ public static void main(String[] args) throws Exception { + // 1: Stream execution environment, update the kafka brokers below. + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1); //use parallelism 1 for sink to keep message ordering ++ String kafka_brokers = "wn0-sampleka:9092,wn1-sampleka:9092,wn2-sampleka:9092"; ++ // 2. sql server source - Update your sql server name, username, password + SourceFunction<String> sourceFunction = SqlServerSource.<String>builder() + .hostname("<samplehilosqlsever>.database.windows.net") + .port(1433) + .database("inventory") // monitor sqlserver database + .tableList("dbo.orders") // monitor products table + .username("username") // username + .password("password") // password + .deserializer(new JsonDebeziumDeserializationSchema()) // converts SourceRecord to JSON String + .build(); ++ DataStream<String> stream = env.addSource(sourceFunction); + stream.print(); ++ // 3. sink order table transaction to kafka + KafkaSink<String> sink = KafkaSink.<String>builder() + .setBootstrapServers(kafka_brokers) + .setRecordSerializer(KafkaRecordSerializationSchema.builder() + .setTopic("mssql_order") + .setValueSerializationSchema(new SimpleStringSchema()) + .build() + ) + .setDeliveryGuarantee(DeliveryGuarantee.AT_LEAST_ONCE) + .build(); + stream.sinkTo(sink); ++ // 4. run stream + env.execute(); + } +} +``` ++### Validation ++- Insert four rows into table order on sqlserver, then check on Kafka ++ :::image type="content" source="./media/change-data-capture-connectors-for-apache-flink/check-kafka-output.png" alt-text="Screenshot showing how to check Kafka output."::: + +- Insert more rows on sqlserver ++ :::image type="content" source="./media/change-data-capture-connectors-for-apache-flink/insert-more-rows-on-sql-server.png" alt-text="Screenshot showing how to insert more rows on sqlserver."::: ++- Check changes on Kafka ++ :::image type="content" source="./media/change-data-capture-connectors-for-apache-flink/check-changes-on-kafka.png" alt-text="Screenshot showing changes made in Kafka after inserting four rows."::: + +- Update `product_id=107` on sqlserver + + :::image type="content" source="./media/change-data-capture-connectors-for-apache-flink/update-product-id-107.png" alt-text="Screenshot showing update for product ID 107."::: + + - Check changes on Kafka for the updated ID 107 + + :::image type="content" source="./media/change-data-capture-connectors-for-apache-flink/check-changes-on-kafka-for-id-107.png" alt-text="Screenshot showing changes in Kafka for updated ID 107."::: + + - Delete `product_id=107` on sqlserver ++ :::image type="content" source="./media/change-data-capture-connectors-for-apache-flink/delete-product-id-107-on-sql-server.png" alt-text="Screenshot showing how to delete product ID 107."::: + + :::image type="content" source="./media/change-data-capture-connectors-for-apache-flink/delete-product-id-107-output.png" alt-text="Screenshot showing deleted items on SQL Server."::: + + - Check changes on Kafka for the deleted `product_id=107` + + :::image type="content" source="./media/change-data-capture-connectors-for-apache-flink/check-changes-on-kafka-for-deleted-records.png" alt-text="Screenshot showing in Kafka for deleted items."::: + + - The following JSON message on Kafka shows the change event in JSON format. + + :::image type="content" source="./media/change-data-capture-connectors-for-apache-flink/json-output.png" alt-text="Screenshot showing JSON output."::: + +### Reference ++* [SQLServer CDC Connector](https://github.com/ververic) is licensed under [Apache 2.0 License](https://github.com/ververica/flink-cdc-connectors/blob/master/LICENSE) +* [Apache Kafka in Azure HDInsight](../../hdinsight/kafk) +* [Flink Kafka Connector](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/kafka/#behind-the-scene) |
hdinsight-aks | Cosmos Db For Apache Cassandra | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/cosmos-db-for-apache-cassandra.md | + + Title: Using Azure Cosmos DB (Apache Cassandra) with HDInsight on AKS - Flink +description: Learn how to Sink HDInsight Kafka message into Azure Cosmos DB for Apache Cassandra, with Apache Flink running on HDInsight on AKS. ++ Last updated : 08/29/2023+++# Sink Kafka messages into Azure Cosmos DB for Apache Cassandra, with HDInsight on AKS - Flink +++This example uses [HDInsight on AKS Flink 1.16.0](../flink/flink-overview.md) to sink [HDInsight Kafka 3.2.0](/azure/hdinsight/kafka/apache-kafka-introduction) messages into [Azure Cosmos DB for Apache Cassandra](/azure/cosmos-db/cassandra/introduction) ++## Prerequisites ++* [HDInsight on AKS Flink 1.16.0](../flink/flink-create-cluster-portal.md) +* [HDInsight 5.1 Kafka 3.2](../../hdinsight/kafk) +* [Azure Cosmos DB for Apache Cassandra](../../cosmos-db/cassandra/index.yml) +* Prepare an Ubuntu VM as maven project development env in the same VNet as HDInsight on AKS. ++## Azure Cosmos DB for Apache Cassandra ++Azure Cosmos DB for Apache Cassandra can be used as the data store for apps written for Apache Cassandra. This compatibility means that by using existing Apache drivers compliant with CQLv4, your existing Cassandra application can now communicate with the API for Cassandra. ++For more information, see the following links. ++* [Azure Cosmos DB for Apache Cassandra](../../cosmos-db/cassandr). +* [Create a API for Cassandra account in Azure Cosmos DB](../../cosmos-db/cassandr). +++Get credentials uses it on Stream source code: +++## Implementation ++**On an Ubuntu VM, let's prepare the development environment** ++### Cloning repository of Azure Samples ++Refer GitHub readme to download maven, clone this repository using `Azure-Samples/azure-cosmos-db-cassandra-java-getting-started.git` from +[Azure Samples ](https://github.com/Azure-Samples/azure-cosmos-db-cassandra-java-getting-started) ++### Updating maven project for Cassandra ++Go to maven project folder **azure-cosmos-db-cassandra-java-getting-started-main** and update the changes required for this example ++**maven pom.xml** +``` xml ++<?xml version="1.0" encoding="UTF-8"?> +<project xmlns="http://maven.apache.org/POM/4.0.0" + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> + <modelVersion>4.0.0</modelVersion> ++ <groupId>com.azure.cosmosdb.cassandra</groupId> + <artifactId>cosmosdb-cassandra-examples</artifactId> + <version>1.0-SNAPSHOT</version> + <dependencies> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-java</artifactId> + <version>1.16.0</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-streaming-java</artifactId> + <version>1.16.0</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-clients</artifactId> + <version>1.16.0</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-files</artifactId> + <version>1.16.0</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-kafka</artifactId> + <version>1.16.0</version> + </dependency> + <dependency> + <groupId>com.datastax.cassandra</groupId> + <artifactId>cassandra-driver-core</artifactId> + <version>3.3.0</version> + </dependency> + <dependency> + <groupId>com.datastax.cassandra</groupId> + <artifactId>cassandra-driver-mapping</artifactId> + <version>3.1.4</version> + </dependency> + <dependency> + <groupId>com.datastax.cassandra</groupId> + <artifactId>cassandra-driver-extras</artifactId> + <version>3.1.4</version> + </dependency> + <dependency> + <groupId>org.slf4j</groupId> + <artifactId>slf4j-api</artifactId> + <version>1.7.5</version> + </dependency> + <dependency> + <groupId>org.slf4j</groupId> + <artifactId>slf4j-log4j12</artifactId> + <version>1.7.5</version> + </dependency> + </dependencies> ++ <build> + <plugins> + <plugin> + <artifactId>maven-assembly-plugin</artifactId> + <configuration> + <descriptorRefs> + <descriptorRef>jar-with-dependencies</descriptorRef> + </descriptorRefs> + <finalName>cosmosdb-cassandra-examples</finalName> + <appendAssemblyId>false</appendAssemblyId> + </configuration> + <executions> + <execution> + <id>make-assembly</id> + <phase>package</phase> + <goals> + <goal>single</goal> + </goals> + </execution> + </executions> + </plugin> + <plugin> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-compiler-plugin</artifactId> + <configuration> + <source>1.8</source> + <target>1.8</target> + </configuration> + </plugin> + </plugins> + </build> ++</project> ++``` +**Cosmos DB for Apache Cassandra's connection configuration** ++You're required to update your host-name and user-name, and keys in the below snippet. ++``` +root@flinkvm:/home/flinkvm/azure-cosmos-db-cassandra-java-getting-started-main/src/main/resources# cat config.properties +###Cassandra endpoint details on cosmosdb +cassandra_host=<update-host-name>.cassandra.cosmos.azure.com +cassandra_port = 10350 +cassandra_username=<update-user-name> +cassandra_password=mxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx +#ssl_keystore_file_path=<SSL key store file location> +#ssl_keystore_password=<SSL key store password> +``` ++**source structure** ++``` +root@flinkvm:/home/flinkvm/azure-cosmos-db-cassandra-java-getting-started-main/src/main/java/com/azure/cosmosdb/cassandra# ll +total 24 +drwxr-xr-x 5 root root 4096 May 12 12:46 ./ +drwxr-xr-x 3 root root 4096 Apr 9 2020 ../ +-rw-r--r-- 1 root root 1105 Apr 9 2020 User.java +drwxr-xr-x 2 root root 4096 May 15 03:53 examples/ +drwxr-xr-x 2 root root 4096 Apr 9 2020 repository/ +drwxr-xr-x 2 root root 4096 May 15 02:43 util/ +``` ++**util folder** +**CassandraUtils.java** ++> [!NOTE] +> Change ssl_keystore_file_path depends on the java cert location. On HDInsight on AKS Apache Flink, the path is `/usr/lib/jvm/msopenjdk-11-jre/lib/security` ++``` java +package com.azure.cosmosdb.cassandra.util; ++import com.datastax.driver.core.*; ++import javax.net.ssl.*; +import java.io.File; +import java.io.FileInputStream; +import java.io.InputStream; +import java.security.*; ++/** + * Cassandra utility class to handle the Cassandra Sessions + */ +public class CassandraUtils { ++ private Cluster cluster; + private Configurations config = new Configurations(); + private String cassandraHost = "<cassandra-host-ip>"; + private int cassandraPort = 10350; + private String cassandraUsername = "localhost"; + private String cassandraPassword = "<cassandra-password>"; + private File sslKeyStoreFile = null; + private String sslKeyStorePassword = "<keystore-password>"; +++ /** + * This method creates a Cassandra Session based on the the end-point details given in config.properties. + * This method validates the SSL certificate based on ssl_keystore_file_path & ssl_keystore_password properties. + * If ssl_keystore_file_path & ssl_keystore_password are not given then it uses 'cacerts' from JDK. + * @return Session Cassandra Session + */ + public Session getSession() { ++ try { + //Load cassandra endpoint details from config.properties + loadCassandraConnectionDetails(); ++ final KeyStore keyStore = KeyStore.getInstance("JKS"); + try (final InputStream is = new FileInputStream(sslKeyStoreFile)) { + keyStore.load(is, sslKeyStorePassword.toCharArray()); + } ++ final KeyManagerFactory kmf = KeyManagerFactory.getInstance(KeyManagerFactory + .getDefaultAlgorithm()); + kmf.init(keyStore, sslKeyStorePassword.toCharArray()); + final TrustManagerFactory tmf = TrustManagerFactory.getInstance(TrustManagerFactory + .getDefaultAlgorithm()); + tmf.init(keyStore); ++ // Creates a socket factory for HttpsURLConnection using JKS contents. + final SSLContext sc = SSLContext.getInstance("TLSv1.2"); + sc.init(kmf.getKeyManagers(), tmf.getTrustManagers(), new java.security.SecureRandom()); ++ JdkSSLOptions sslOptions = RemoteEndpointAwareJdkSSLOptions.builder() + .withSSLContext(sc) + .build(); + cluster = Cluster.builder() + .addContactPoint(cassandraHost) + .withPort(cassandraPort) + .withCredentials(cassandraUsername, cassandraPassword) + .withSSL(sslOptions) + .build(); ++ return cluster.connect(); + } catch (Exception ex) { + ex.printStackTrace(); + } + return null; + } ++ public Cluster getCluster() { + return cluster; + } ++ /** + * Closes the cluster and Cassandra session + */ + public void close() { + cluster.close(); + } ++ /** + * Loads Cassandra end-point details from config.properties. + * @throws Exception + */ + private void loadCassandraConnectionDetails() throws Exception { + cassandraHost = config.getProperty("cassandra_host"); + cassandraPort = Integer.parseInt(config.getProperty("cassandra_port")); + cassandraUsername = config.getProperty("cassandra_username"); + cassandraPassword = config.getProperty("cassandra_password"); + String ssl_keystore_file_path = config.getProperty("ssl_keystore_file_path"); + String ssl_keystore_password = config.getProperty("ssl_keystore_password"); ++ // If ssl_keystore_file_path, build the path using JAVA_HOME directory. + if (ssl_keystore_file_path == null || ssl_keystore_file_path.isEmpty()) { + String javaHomeDirectory = System.getenv("JAVA_HOME"); + if (javaHomeDirectory == null || javaHomeDirectory.isEmpty()) { + throw new Exception("JAVA_HOME not set"); + } + ssl_keystore_file_path = new StringBuilder(javaHomeDirectory).append("/lib/security/cacerts").toString(); + } ++ sslKeyStorePassword = (ssl_keystore_password != null && !ssl_keystore_password.isEmpty()) ? + ssl_keystore_password : sslKeyStorePassword; ++ sslKeyStoreFile = new File(ssl_keystore_file_path); ++ if (!sslKeyStoreFile.exists() || !sslKeyStoreFile.canRead()) { + throw new Exception(String.format("Unable to access the SSL Key Store file from %s", ssl_keystore_file_path)); + } + } +} +``` ++**Configurations.java** ++``` java +package com.azure.cosmosdb.cassandra.util; ++import org.slf4j.Logger; +import org.slf4j.LoggerFactory; ++import java.io.IOException; +import java.io.InputStream; +import java.util.Properties; ++/** + * Configuration utility to read the configurations from properties file + */ +public class Configurations { + private static final Logger LOGGER = LoggerFactory.getLogger(Configurations.class); + private static String PROPERTY_FILE = "config.properties"; + private static Properties prop = null; ++ private void loadProperties() throws IOException { + InputStream input = getClass().getClassLoader().getResourceAsStream(PROPERTY_FILE); + if (input == null) { + LOGGER.error("Sorry, unable to find {}", PROPERTY_FILE); + return; + } + prop = new Properties(); + prop.load(input); + } ++ public String getProperty(String propertyName) throws IOException { + if (prop == null) { + loadProperties(); + } + return prop.getProperty(propertyName); ++ } +} +``` ++**Examples folder** ++**CassandraSink.java** +``` java +package com.azure.cosmosdb.cassandra.examples; ++import com.datastax.driver.core.PreparedStatement; +import com.datastax.driver.core.Session; +import org.apache.flink.api.java.tuple.Tuple3; +import org.apache.flink.streaming.api.functions.sink.SinkFunction; +import com.azure.cosmosdb.cassandra.repository.UserRepository; +import com.azure.cosmosdb.cassandra.util.CassandraUtils; ++public class CassandraSink implements SinkFunction<Tuple3<Integer, String, String>> { ++ @Override + public void invoke(Tuple3<Integer, String, String> value, Context context) throws Exception { ++ CassandraUtils utils = new CassandraUtils(); + Session cassandraSession = utils.getSession(); + try { + UserRepository repository = new UserRepository(cassandraSession); ++ //Insert rows into user table + PreparedStatement preparedStatement = repository.prepareInsertStatement(); + repository.insertUser(preparedStatement, value.f0, value.f1, value.f2); ++ } finally { + if (null != utils) utils.close(); + if (null != cassandraSession) cassandraSession.close(); + } + } +} +``` ++**main class: CassandraDemo.java** ++> [!Note] +> * Replace Kafka Broker IPs with your cluster broker IPs +> * Prepare topic +> * user `/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --replication-factor 2 --partitions 3 --topic user --bootstrap-server wn0-flinkd:9092` ++``` java +package com.azure.cosmosdb.cassandra.examples; ++import org.apache.flink.api.common.eventtime.WatermarkStrategy; +import org.apache.flink.api.common.serialization.SimpleStringSchema; +import org.apache.flink.api.common.typeinfo.Types; +import org.apache.flink.api.java.tuple.Tuple3; +import org.apache.flink.connector.kafka.source.KafkaSource; +import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; ++public class CassandraDemo { + public static void main(String[] args) throws Exception { + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1); ++ // 1. read kafka message as stream input, update the broker IPs from your Kafka setup + String brokers = "<update-broker-ips>:9092,<update-broker-ips>:9092,<update-broker-ips>:9092"; ++ KafkaSource<String> source = KafkaSource.<String>builder() + .setBootstrapServers(brokers) + .setTopics("user") + .setGroupId("my-group") + .setStartingOffsets(OffsetsInitializer.earliest()) + .setValueOnlyDeserializer(new SimpleStringSchema()) + .build(); ++ DataStream<String> kafka = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source"); + kafka.print(); ++ DataStream<Tuple3<Integer,String,String>> dataStream = kafka.map(line-> { + String[] fields = line.split(","); + int v1 = Integer.parseInt(fields[0]); + Tuple3<Integer,String,String> tuple3 = Tuple3.of(v1,fields[1],fields[2]); + return tuple3; + }).returns(Types.TUPLE(Types.INT,Types.STRING,Types.STRING)); +++ dataStream.addSink(new CassandraSink()); ++ // 4. run stream + env.execute("sink Kafka to Cosmos DB for Apache Cassandra"); + } +} +``` ++### Building the project ++Run **mvn clean install** from azure-cosmos-db-cassandra-java-getting-started-main folder to build the project. This command generates cosmosdb-cassandra-examples.jar under target folder ++``` +root@flinkvm:/home/flinkvm/azure-cosmos-db-cassandra-java-getting-started-main/target# ll +total 91156 +drwxr-xr-x 7 root root 4096 May 15 03:54 ./ +drwxr-xr-x 7 root root 4096 May 15 03:54 ../ +drwxr-xr-x 2 root root 4096 May 15 03:54 archive-tmp/ +drwxr-xr-x 3 root root 4096 May 15 03:54 classes/ +-rw-r--r-- 1 root root 15542 May 15 03:54 cosmosdb-cassandra-examples-1.0-SNAPSHOT.jar +-rw-r--r-- 1 root root 93290819 May 15 03:54 cosmosdb-cassandra-examples.jar +drwxr-xr-x 3 root root 4096 May 15 03:54 generated-sources/ +drwxr-xr-x 2 root root 4096 May 15 03:54 maven-archiver/ +drwxr-xr-x 3 root root 4096 May 15 03:54 maven-status/ +``` ++### Uploading the jar for Apache Flink Job submission ++Upload jar into Azure storage and wget into webssh ++``` +msdata@pod-0 [ ~ ]$ ls -l cosmosdb-cassandra-examples.jar +-rw-r-- 1 msdata msdata 93290819 May 15 04:02 cosmosdb-cassandra-examples.jar +``` ++## Preparing Cosmos DB KeyStore and Table ++Run UserProfile class in /azure-cosmos-db-cassandra-java-getting-started-main/src/main/java/com/azure/cosmosdb/cassandra/examples to create Azure Cosmos DB's keystore and table. ++``` +bin/flink run -c com.azure.cosmosdb.cassandra.examples.UserProfile -j cosmosdb-cassandra-examples.jar +``` ++## Sink Kafka Topics into Cosmos DB (Apache Cassandra) ++Run CassandraDemo class to sink Kafka topic into Cosmos DB for Apache Cassandra ++``` +bin/flink run -c com.azure.cosmosdb.cassandra.examples.CassandraDemo -j cosmosdb-cassandra-examples.jar +``` +++## Validate Apache Flink Job Submission ++Check job on HDInsight on AKS Flink UI +++## Producing Messages in Kafka ++Produce message into Kafka topic ++``` python +sshuser@hn0-flinkd:~$ cat user.py +import time +from datetime import datetime +import random ++user_set = [ + 'John', + 'Mike', + 'Lucy', + 'Tom', + 'Machael', + 'Lily', + 'Zark', + 'Tim', + 'Andrew', + 'Pick', + 'Sean', + 'Luke', + 'Chunck' +] ++city_set = [ + 'Atmore', + 'Auburn', + 'Bessemer', + 'Birmingham', + 'Chickasaw', + 'Clanton', + 'Decatur', + 'Florence', + 'Greenville', + 'Jasper', + 'Huntsville', + 'Homer', + 'Homer' +] ++def main(): + while True: + unique_id = str(int(time.time())) + if random.randrange(10) < 4: + city = random.choice(city_set[:3]) + else: + city = random.choice(city_set) + user = random.choice(user_set) + print(unique_id + "," + user + "," + city ) + time.sleep(1) ++if __name__ == "__main__": + main() +``` ++``` +sshuser@hn0-flinkd:~$ python user.py | /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --bootstrap-server wn0-flinkd:9092 --topic user & +[2] 11516 +``` ++## Check table on Cosmos DB for Apache Cassandra on Azure portal +++### Preferences ++* [Azure Cosmos DB for Apache Cassandra](../../cosmos-db/cassandr). +* [Create a API for Cassandra account in Azure Cosmos DB](../../cosmos-db/cassandr) +* [Azure Samples ](https://github.com/Azure-Samples/azure-cosmos-db-cassandra-java-getting-started) |
hdinsight-aks | Create Kafka Table Flink Kafka Sql Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/create-kafka-table-flink-kafka-sql-connector.md | + + Title: How to create Kafka table on Apache FlinkSQL - Azure portal +description: Learn how to create Kafka table on Apache FlinkSQL ++ Last updated : 10/06/2023+++# Create Kafka table on Apache FlinkSQL +++Using this example, learn how to Create Kafka table on Apache FlinkSQL. ++## Prerequisites ++* [HDInsight Kafka](../../hdinsight/kafk) +* [HDInsight on AKS Apache Flink 1.16.0](../flink/flink-create-cluster-portal.md) ++## Kafka SQL connector on Apache Flink ++The Kafka connector allows for reading data from and writing data into Kafka topics. For more information, refer [Apache Kafka SQL Connector](https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/kafka) ++## Create a Kafka table on Apache Flink SQL ++### Prepare topic and data on HDInsight Kafka ++**Prepare messages with weblog.py** ++``` Python +import random +import json +import time +from datetime import datetime ++user_set = [ + 'John', + 'XiaoMing', + 'Mike', + 'Tom', + 'Machael', + 'Zheng Hu', + 'Zark', + 'Tim', + 'Andrew', + 'Pick', + 'Sean', + 'Luke', + 'Chunck' +] ++web_set = [ + 'https://google.com', + 'https://facebook.com?id=1', + 'https://tmall.com', + 'https://baidu.com', + 'https://taobao.com', + 'https://aliyun.com', + 'https://apache.com', + 'https://flink.apache.com', + 'https://hbase.apache.com', + 'https://github.com', + 'https://gmail.com', + 'https://stackoverflow.com', + 'https://python.org' +] ++def main(): + while True: + if random.randrange(10) < 4: + url = random.choice(web_set[:3]) + else: + url = random.choice(web_set) ++ log_entry = { + 'userName': random.choice(user_set), + 'visitURL': url, + 'ts': datetime.now().strftime("%m/%d/%Y %H:%M:%S") + } ++ print(json.dumps(log_entry)) + time.sleep(0.05) ++if __name__ == "__main__": + main() +``` ++**Pipeline to Kafka topic** ++``` +sshuser@hn0-contsk:~$ python weblog.py | /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --bootstrap-server wn0-contsk:9092 --topic click_events +``` ++**Other commands:** ++``` +-- create topic +/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --replication-factor 2 --partitions 3 --topic click_events --bootstrap-server wn0-contsk:9092 ++-- delete topic +/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --delete --topic click_events --bootstrap-server wn0-contsk:9092 ++-- consume topic +sshuser@hn0-contsk:~$ /usr/hdp/current/kafka-broker/bin/kafka-console-consumer.sh --bootstrap-server wn0-contsk:9092 --topic click_events --from-beginning +{"userName": "Luke", "visitURL": "https://flink.apache.com", "ts": "06/26/2023 14:33:43"} +{"userName": "Tom", "visitURL": "https://stackoverflow.com", "ts": "06/26/2023 14:33:43"} +{"userName": "Chunck", "visitURL": "https://google.com", "ts": "06/26/2023 14:33:44"} +{"userName": "Chunck", "visitURL": "https://facebook.com?id=1", "ts": "06/26/2023 14:33:44"} +{"userName": "John", "visitURL": "https://tmall.com", "ts": "06/26/2023 14:33:44"} +{"userName": "Andrew", "visitURL": "https://facebook.com?id=1", "ts": "06/26/2023 14:33:44"} +{"userName": "John", "visitURL": "https://tmall.com", "ts": "06/26/2023 14:33:44"} +{"userName": "Pick", "visitURL": "https://google.com", "ts": "06/26/2023 14:33:44"} +{"userName": "Mike", "visitURL": "https://tmall.com", "ts": "06/26/2023 14:33:44"} +{"userName": "Zheng Hu", "visitURL": "https://tmall.com", "ts": "06/26/2023 14:33:44"} +{"userName": "Luke", "visitURL": "https://facebook.com?id=1", "ts": "06/26/2023 14:33:44"} +{"userName": "John", "visitURL": "https://flink.apache.com", "ts": "06/26/2023 14:33:44"} ++``` ++### Apache Flink SQL client ++Detailed instructions are provided on how to use Secure Shell for [Flink SQL client](./flink-web-ssh-on-portal-to-flink-sql.md) ++### Download Kafka SQL Connector & Dependencies into SSH ++We're using the **Kafka 3.2.0** dependencies in the below step, You're required to update the command based on your Kafka version on HDInsight. +``` +wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.2.0/kafka-clients-3.2.0.jar +wget https://repo1.maven.org/maven2/org/apache/flink/flink-connector-kafka/1.16.0/flink-connector-kafka-1.16.0.jar +``` ++### Connect to Apache Flink SQL Client ++Let's now connect to the Flink SQL Client with Kafka SQL client jars +``` +msdata@pod-0 [ /opt/flink-webssh ]$ bin/sql-client.sh -j flink-connector-kafka-1.16.0.jar -j kafka-clients-3.2.0.jar +``` ++### Create Kafka table on Apache Flink SQL ++Let's create the Kafka table on Flink SQL, and select the Kafka table on Flink SQL. ++You're required to update your Kafka bootstrap server IPs in the below snippet. ++``` sql +CREATE TABLE KafkaTable ( +`userName` STRING, +`visitURL` STRING, +`ts` TIMESTAMP(3) METADATA FROM 'timestamp' +) WITH ( +'connector' = 'kafka', +'topic' = 'click_events', +'properties.bootstrap.servers' = '<update-kafka-bootstrapserver-ip>:9092,<update-kafka-bootstrapserver-ip>:9092,<update-kafka-bootstrapserver-ip>:9092', +'properties.group.id' = 'my_group', +'scan.startup.mode' = 'earliest-offset', +'format' = 'json' +); ++select * from KafkaTable; +``` +++### Produce Kafka messages ++Let's now produce Kafka messages to the same topic, using HDInsight Kafka +``` +python weblog.py | /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --bootstrap-server wn0-contsk:9092 --topic click_events +``` ++### Table on Apache Flink SQL ++You can monitor the table on Flink SQL +++Here are the streaming jobs on Flink Web UI +++## Reference ++* [Apache Kafka SQL Connector](https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/kafka) |
hdinsight-aks | Datastream Api Mongodb | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/datastream-api-mongodb.md | + + Title: DataStream API for MongoDB as a source and sink on Apache Flink +description: Learn how to use DataStream API for MongoDB as a source and sink on Apache Flink ++ Last updated : 08/29/2023+++# DataStream API for MongoDB as a source and sink on Apache Flink +++Apache Flink provides a MongoDB connector for reading and writing data from and to MongoDB collections with at-least-once guarantees. ++This example demonstrates on how to use HDInsight on AKS Apache Flink 1.16.0 along with your existing MongoDB as Sink and Source with Flink DataStream API MongoDB connector. ++MongoDB is a non-relational document database that provides support for JSON-like storage that helps store complex structures easily. ++In this example, you learn how to use MongoDB to source and sink with DataStream API. ++## Prerequisites ++* [HDInsight on AKS Flink 1.16.0](../flink/flink-create-cluster-portal.md) +* For this demonstration, use a Window VM as maven project develop env in the same VNET as HDInsight on AKS. +* We use the [Apache Flink - MongoDB Connector](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/mongodb/) +* For this demonstration, use an Ubuntu VM in the same VNET as HDInsight on AKS, install a MongoDB on this VM. ++## Installation of MongoDB on Ubuntu VM ++[Install MongoDB on Ubuntu](https://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-ubuntu/) ++[MongoDB Shell commands](https://www.mongodb.com/docs/mongodb-shell/run-commands/) ++**Prepare MongoDB environment**: +``` +root@contosoubuntuvm:/var/lib/mongodb# vim /etc/mongod.conf ++# network interfaces +net: + port: 27017 + bindIp: 0.0.0.0 ++-- Start mongoDB +root@contosoubuntuvm:/var/lib/mongodb# systemctl start mongod +root@contosoubuntuvm:/var/lib/mongodb# systemctl status mongod +● mongod.service - MongoDB Database Server + Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled) + Active: active (running) since Fri 2023-06-16 00:07:39 UTC; 5s ago + Docs: https://docs.mongodb.org/manual + Main PID: 415775 (mongod) + Memory: 165.4M + CGroup: /system.slice/mongod.service + └─415775 /usr/bin/mongod --config /etc/mongod.conf ++Jun 16 00:07:39 contosoubuntuvm systemd[1]: Started MongoDB Database Server. +Jun 16 00:07:39 contosoubuntuvm mongod[415775]: {"t":{"$date":"2023-06-16T00:07:39.091Z"},"s":"I", "c":"CONTROL", "id":7484500, "ctx":"-","msg"> ++-- check connectivity +root@contosoubuntuvm:/var/lib/mongodb# telnet 10.0.0.7 27017 +Trying 10.0.0.7... +Connected to 10.0.0.7. +Escape character is '^]'. ++-- Use mongosh to connect to mongodb +root@contosoubuntuvm:/var/lib/mongodb# mongosh "mongodb://10.0.0.7:27017/test" +Current Mongosh Log ID: 648bccc3b8a6b0885614b2dc +Connecting to: mongodb://10.0.0.7:27017/test?directConnection=true&appName=mongosh+1.10.0 +Using MongoDB: 6.0.6 +Using Mongosh: 1.10.0 ++For mongosh info see: https://docs.mongodb.com/mongodb-shell/ +++ The server generated these startup warnings when booting + 2023-06-16T00:07:39.103+00:00: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See http://dochub.mongodb.org/core/prodnotes-filesystem + 2023-06-16T00:07:40.108+00:00: Access control is not enabled for the database. Read and write access to data and configuration is unrestricted + 2023-06-16T00:07:40.108+00:00: /sys/kernel/mm/transparent_hugepage/enabled is 'always'. We suggest setting it to 'never' + 2023-06-16T00:07:40.108+00:00: vm.max_map_count is too low +++- Check `click_events` collection ++test> db.click_events.count() +0 +``` ++> [!NOTE] +> To ensure the MongoDB setup can be accessed outside, change bindIp to `0.0.0.0`. ++``` +vim /etc/mongod.conf +# network interfaces +net: + port: 27017 + bindIp: 0.0.0.0 +``` ++## Get started ++### Create a maven project on IdeaJ, to prepare the pom.xml for MongoDB Collection ++``` xml +<?xml version="1.0" encoding="UTF-8"?> +<project xmlns="http://maven.apache.org/POM/4.0.0" + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> +<modelVersion>4.0.0</modelVersion> ++<groupId>org.example</groupId> +<artifactId>MongoDBDemo</artifactId> +<version>1.0-SNAPSHOT</version> +<properties> + <maven.compiler.source>1.8</maven.compiler.source> + <maven.compiler.target>1.8</maven.compiler.target> + <flink.version>1.16.0</flink.version> + <java.version>1.8</java.version> + <scala.binary.version>2.12</scala.binary.version> +</properties> +<dependencies> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-java</artifactId> + <version>${flink.version}</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-streaming-java</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-clients</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-files --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-files</artifactId> + <version>${flink.version}</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-mongodb</artifactId> + <version>1.0.1-1.16</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-table-common --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-table-common</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-table-api-java --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-table-api-java</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/net.java.loci/jsr308-all --> + <dependency> + <groupId>net.java.loci</groupId> + <artifactId>jsr308-all</artifactId> + <version>1.1.2</version> + </dependency> +</dependencies> + <build> + <plugins> + <plugin> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-assembly-plugin</artifactId> + <version>3.0.0</version> + <configuration> + <appendAssemblyId>false</appendAssemblyId> + <descriptorRefs> + <descriptorRef>jar-with-dependencies</descriptorRef> + </descriptorRefs> + </configuration> + <executions> + <execution> + <id>make-assembly</id> + <phase>package</phase> + <goals> + <goal>single</goal> + </goals> + </execution> + </executions> + </plugin> + </plugins> + </build> +</project> +``` ++### Generate a stream source and sink to the MongoDB collection:click_events +**MongoDBSinkDemo.java** +``` java +package contoso.example; ++import com.mongodb.client.model.InsertOneModel; +import org.apache.flink.connector.base.DeliveryGuarantee; +import org.apache.flink.connector.mongodb.sink.MongoSink; +import org.apache.flink.streaming.api.datastream.DataStreamSource; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.bson.BsonDocument; ++public class MongoDBSinkDemo { + public static void main(String[] args) throws Exception { + // 1. get stream env + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); ++ // 2. event data source, update the ip address from 10.0.0.7 to your MongoDB IP + DataStreamSource<Event> stream = env.addSource(new ClickSource()); + stream.print(); ++ MongoSink<Event> sink = MongoSink.<Event>builder() + .setUri("mongodb://10.0.0.7:27017") + .setDatabase("test") + .setCollection("click_events") + .setBatchSize(1000) + .setBatchIntervalMs(1000) + .setMaxRetries(3) + .setDeliveryGuarantee(DeliveryGuarantee.AT_LEAST_ONCE) + .setSerializationSchema( + (input, context) -> new InsertOneModel<>(BsonDocument.parse(String.valueOf(input)))) + .build(); ++ stream.sinkTo(sink); ++ env.execute("Sink click events to MongoDB"); + } +} +``` +**Stream click event source:** +**ClickSource.java** +``` java +package contoso.example; +import org.apache.flink.streaming.api.functions.source.SourceFunction; ++import java.util.Calendar; +import java.util.Random; ++public class ClickSource implements SourceFunction<Event> { + // declare a flag + private Boolean running = true; ++ // declare a flag + public void run(SourceContext<Event> ctx) throws Exception{ + // generate random record + Random random = new Random(); + String[] users = {"Mary","Alice","Bob","Cary"}; + String[] urls = {"./home","./cart","./fav","./prod?id=100","./prod?id=10"}; ++ // loop generate + while (running) { + String user = users[random.nextInt(users.length)]; + String url = urls[random.nextInt(urls.length)]; + Long timestamp = Calendar.getInstance().getTimeInMillis(); + String ts = timestamp.toString(); + ctx.collect(new Event(user,url,ts)); + Thread.sleep(2000); + } + } + @Override + public void cancel() + { + running = false; + } +} +``` ++**Event.java** +``` java +package contoso.example; +import java.sql.Timestamp; ++public class Event { ++ public String user; + public String url; + public String ts; ++ public Event() { + } ++ public Event(String user, String url, String ts) { + this.user = user; + this.url = url; + this.ts = ts; + } ++ @Override + public String toString(){ + return "{" + + "user: \"" + user + "\"" + + ",url: \"" + url + "\"" + + ",ts: " + ts + + "}"; + } +} +``` +### Use MongoDB as a source and sink to ADLS Gen2 ++Write a program for MongoDB as a source and sink to ADLS Gen2 ++**MongoDBSourceDemo.java** +``` java +package contoso.example; ++import org.apache.flink.api.common.eventtime.WatermarkStrategy; +import org.apache.flink.api.common.serialization.SimpleStringEncoder; +import org.apache.flink.api.common.typeinfo.BasicTypeInfo; +import org.apache.flink.api.common.typeinfo.TypeInformation; +import org.apache.flink.configuration.MemorySize; +import org.apache.flink.connector.file.sink.FileSink; +import org.apache.flink.connector.mongodb.source.MongoSource; +import org.apache.flink.connector.mongodb.source.enumerator.splitter.PartitionStrategy; +import org.apache.flink.connector.mongodb.source.reader.deserializer.MongoDeserializationSchema; +import org.apache.flink.core.fs.Path; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.DefaultRollingPolicy; +import org.bson.BsonDocument; ++import java.time.Duration; ++public class MongoDBSourceDemo { + public static void main(String[] args) throws Exception { ++ StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); ++ MongoSource<String> mongoSource = MongoSource.<String>builder() + .setUri("mongodb://10.0.0.7:27017") // update with the correct IP address + .setDatabase("test") + .setCollection("click_events") + .setFetchSize(2048) + .setLimit(10000) + .setNoCursorTimeout(true) + .setPartitionStrategy(PartitionStrategy.SAMPLE) + .setPartitionSize(MemorySize.ofMebiBytes(64)) + .setSamplesPerPartition(10) + .setDeserializationSchema(new MongoDeserializationSchema<String>() { + @Override + public String deserialize(BsonDocument document) { + return document.toJson(); + } ++ @Override + public TypeInformation<String> getProducedType() { + return BasicTypeInfo.STRING_TYPE_INFO; + } + }) + .build(); ++ DataStream stream = env.fromSource(mongoSource, WatermarkStrategy.noWatermarks(), "MongoDB-Source"); + stream.print(); + // 3. sink to gen2, update with your container name and storage path + String outputPath = "abfs://<update-container>@<storage-path>.dfs.core.windows.net/flink/mongo_click_events"; + FileSink<String> gen2 = FileSink + .forRowFormat(new Path(outputPath), new SimpleStringEncoder<String>("UTF-8")) + .withRollingPolicy( + DefaultRollingPolicy.builder() + .withRolloverInterval(Duration.ofMinutes(5)) + .withInactivityInterval(Duration.ofMinutes(3)) + .withMaxPartSize(MemorySize.ofMebiBytes(5)) + .build()) + .build(); ++ stream.sinkTo(gen2); ++ env.execute("MongoDB as a Source Sink to Gen2"); + } +} +``` +### Package the maven jar, and submit to Apache Flink UI ++Package the maven jar, upload it to Storage and then wget it to [Flink CLI](./flink-web-ssh-on-portal-to-flink-sql.md) or directly upload to Flink UI to run. +++**Check Flink UI** +++### Validate results ++**Sink click events to Mongo DB's admin.click_events collection** +``` +test> db.click_events.count() +24 ++test> db.click_events.find() +[ + { + _id: ObjectId("648bc933a68ca7614e1f87a2"), + user: 'Alice', + url: './prod?id=10', + ts: Long("1686882611148") + }, + { + _id: ObjectId("648bc935a68ca7614e1f87a3"), + user: 'Bob', + url: './prod?id=10', + ts: Long("1686882613148") + }, +……. ++``` +**Use Mongo DB's admin.click_events collection as a source, and sink to ADLS Gen2** + |
hdinsight-aks | Fabric Lakehouse Flink Datastream Api | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/fabric-lakehouse-flink-datastream-api.md | + + Title: Microsoft Fabric with Apache Flink in HDInsight on AKS +description: An introduction to lakehouse on Microsoft Fabric with Apache Flink over HDInsight on AKS ++ Last updated : 08/29/2023++# Connect to OneLake in Microsoft Fabric with HDInsight on AKS cluster for Apache Flink +++This example demonstrates on how to use HDInsight on AKS Apache Flink with [Microsoft Fabric](/fabric/get-started/microsoft-fabric-overview). ++[Microsoft Fabric](/fabric/get-started/microsoft-fabric-overview) is an all-in-one analytics solution for enterprises that covers everything from data movement to data science, Real-Time Analytics, and business intelligence. It offers a comprehensive suite of services, including data lake, data engineering, and data integration, all in one place. +With Fabric, you don't need to piece together different services from multiple vendors. Instead, you can enjoy a highly integrated, end-to-end, and easy-to-use product that is designed to simplify your analytics needs. ++In this example, you learn how to connect to OneLake in Microsoft Fabric with HDInsight on AKS cluster for Apache Flink. ++## Prerequisites +* [HDInsight on AKS Flink 1.16.0](../flink/flink-create-cluster-portal.md) +* Create a License Mode of at least Premium Capacity Workspace on [Power BI](https://app.powerbi.com/) +* [Create a Lake House](/fabric/data-engineering/tutorial-build-lakehouse) on this workspace ++## Connect to One Lake Storage ++### Microsoft Fabric and Lakehouse ++**Lakehouse in Microsoft Fabric** ++[Microsoft Fabric Lakehouse](/fabric/data-engineering/lakehouse-overview) is a data architecture platform for storing, managing, and analyzing structured and unstructured data in a single location. ++> [!Note] +> [Microsoft Fabric](/fabric/get-started/microsoft-fabric-overview) is in [preview](/fabric/get-started/preview) ++#### Managed identity access to the Fabric workspace ++In this step, we provide access to the *user managed identity* to Fabric. You're required to type the *user assigned managed identity* and add to your Fabric workspace. ++ :::image type="content" source="./media/fabric-lakehouse-flink-datastream-api/managed-identity-access-fabric.png" alt-text="Screenshot showing how to provide access to the user managed identity to Fabric." border="true" lightbox="./media/fabric-lakehouse-flink-datastream-api/managed-identity-access-fabric.png"::: ++#### Prepare a Delta table under LakeHouse Files folder ++In this step, you see how we prepare a Delta table on the lakehouse on Microsoft Fabric; Flink developers can build into broader Lakehouse architecture with this setup. +++### Apache Flink DataStream Source code ++In this step, we prepare the jar to submit to the HDInsight on AKS, Apache Flink cluster. ++This step illustrates, that we package dependencies needed for onelakeDemo ++**maven pom.xml** +``` xml + <properties> + <maven.compiler.source>1.8</maven.compiler.source> + <maven.compiler.target>1.8</maven.compiler.target> + <flink.version>1.16.0</flink.version> + <java.version>1.8</java.version> + <scala.binary.version>2.12</scala.binary.version> + </properties> + <dependencies> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-java</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-streaming-java</artifactId> + <version>${flink.version}</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-core</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-clients</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-avro --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-avro</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-table-api-java-bridge --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-table-api-java-bridge</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-files --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-files</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-table --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-table</artifactId> + <version>${flink.version}</version> + <type>pom</type> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-parquet</artifactId> + <version>${flink.version}</version> + </dependency> + <dependency> + <groupId>org.apache.parquet</groupId> + <artifactId>parquet-avro</artifactId> + <version>1.12.2</version> + </dependency> + <dependency> + <groupId>org.apache.hadoop</groupId> + <artifactId>hadoop-core</artifactId> + <version>1.2.1</version> + </dependency> + </dependencies> + <build> + <plugins> + <plugin> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-assembly-plugin</artifactId> + <version>3.0.0</version> + <configuration> + <appendAssemblyId>false</appendAssemblyId> + <descriptorRefs> + <descriptorRef>jar-with-dependencies</descriptorRef> + </descriptorRefs> + </configuration> + <executions> + <execution> + <id>make-assembly</id> + <phase>package</phase> + <goals> + <goal>single</goal> + </goals> + </execution> + </executions> + </plugin> + </plugins> + </build> +</project> +``` ++**main: onelakeDemo** ++In this step, we read parquet file on Fabric lakehouse and then sink to another file in the same folder: ++``` java +package contoso.example; ++import org.apache.avro.Schema; +import org.apache.avro.generic.GenericRecord; +import org.apache.flink.api.common.eventtime.WatermarkStrategy; +import org.apache.flink.connector.file.src.FileSource; +import org.apache.flink.core.fs.Path; ++import org.apache.flink.formats.parquet.avro.AvroParquetReaders; +import org.apache.flink.formats.parquet.avro.ParquetAvroWriters; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.streaming.api.functions.sink.SinkFunction; +import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink; ++public class onelakeDemo { + public static void main(String[] args) throws Exception { + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); ++ Path sourcePath = new Path("abfss://contosoworkspace1@msit-onelake.dfs.fabric.microsoft.com/contosolakehouse.Lakehouse/Files/delta/tab1/"); + Path sinkPath = new Path("abfss://contosoworkspace1@msit-onelake.dfs.fabric.microsoft.com/contosolakehouse.Lakehouse/Files/delta/tab1_out/"); ++ Schema avroSchema = new Schema.Parser() + .parse("{\"type\":\"record\",\"name\":\"example\",\"fields\":[{\"name\":\"Date\",\"type\":\"string\"},{\"name\":\"Time\",\"type\":\"string\"},{\"name\":\"TargetTemp\",\"type\":\"string\"},{\"name\":\"ActualTemp\",\"type\":\"string\"},{\"name\":\"System\",\"type\":\"string\"},{\"name\":\"SystemAge\",\"type\":\"string\"},{\"name\":\"BuildingID\",\"type\":\"string\"}]}"); ++ FileSource<GenericRecord> source = + FileSource.forRecordStreamFormat( + AvroParquetReaders.forGenericRecord(avroSchema), sourcePath) + .build(); ++ StreamingFileSink<GenericRecord> sink = + StreamingFileSink.forBulkFormat( + sinkPath, + ParquetAvroWriters.forGenericRecord(avroSchema)) + .build(); ++ env.enableCheckpointing(10L); ++ DataStream<GenericRecord> stream = + env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source"); ++ stream.addSink((SinkFunction<GenericRecord>) sink); + env.execute(); + } +} +``` +### Package the jar and submit to Flink ++Here, we use the packaged jar and submit to Flink cluster ++++### Results on Microsoft Fabric ++Let's check the output on Microsoft Fabric ++++### References +* [Microsoft Fabric](/fabric/get-started/microsoft-fabric-overview) +* [Microsoft Fabric Lakehouse](/fabric/data-engineering/lakehouse-overview) |
hdinsight-aks | Flink Catalog Delta Hive | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/flink-catalog-delta-hive.md | + + Title: Table API and SQL - Use Delta Catalog type with Hive in HDInsight on AKS - Apache Flink +description: Learn about how to create Apache Flink-Delta Catalog in HDInsight on AKS - Apache Flink ++ Last updated : 08/29/2023+++# Create Apache Flink-Delta Catalog +++[Delta Lake](https://docs.delta.io/latest/delta-intro.html) is an open source project that enables building a Lakehouse architecture on top of data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes. ++In this article, we learn how Apache Flink SQL/TableAPI is used to implement a Delta catalog for Apache Flink, with Hive Catalog. Delta Catalog delegates all metastore communication to Hive Catalog. It uses the existing logic for Hive or In-Memory metastore communication that is already implemented in Flink. ++### Prerequisites +- You're required to have an operational Flink cluster with secure shell, learn how to [create a cluster](./flink-create-cluster-portal.md) +- You can refer this article on how to use CLI from [Secure Shell](./flink-web-ssh-on-portal-to-flink-sql.md) on Azure portal. ++### Add dependencies ++Once you launch the Secure Shell (SSH), let us start downloading the dependencies required to the SSH node, to illustrate the Delta table managed in Hive catalog. ++ ``` + wget https://repo1.maven.org/maven2/io/delta/delta-standalone_2.12/3.0.0rc1/delta-standalone_2.12-3.0.0rc1.jar -P $FLINK_HOME/lib + wget https://repo1.maven.org/maven2/io/delta/delta-flink/3.0.0rc1/delta-flink-3.0.0rc1.jar -P $FLINK_HOME/lib + wget https://repo1.maven.org/maven2/com/chuusai/shapeless_2.12/2.3.4/shapeless_2.12-2.3.4.jar -P $FLINK_HOME/lib + wget https://repo1.maven.org/maven2/org/apache/flink/flink-parquet/1.16.0/flink-parquet-1.16.0.jar -P $FLINK_HOME/lib + wget https://repo1.maven.org/maven2/org/apache/parquet/parquet-hadoop-bundle/1.12.2/parquet-hadoop-bundle-1.12.2.jar -P $FLINK_HOME/lib + ``` ++### Start the Apache Flink SQL Client +A detailed explanation is given on how to get started with Flink SQL Client using [Secure Shell](./flink-web-ssh-on-portal-to-flink-sql.md) on Azure portal. You're required to start the SQL Client as described on the article by running the following command. +``` +./bin/sql-client.sh +``` +#### Create Delta Catalog using Hive catalog ++```sql + CREATE CATALOG delta_catalog WITH ( + 'type' = 'delta-catalog', + 'catalog-type' = 'hive'); +``` +Using the delta catalog ++```sql + USE CATALOG delta_catalog; +``` ++#### Add dependencies to server classpath ++```sql + ADD JAR '/opt/flink-webssh/lib/delta-flink-3.0.0rc1.jar'; + ADD JAR '/opt/flink-webssh/lib/delta-standalone_2.12-3.0.0rc1.jar'; + ADD JAR '/opt/flink-webssh/lib/shapeless_2.12-2.3.4.jar'; + ADD JAR '/opt/flink-webssh/lib/parquet-hadoop-bundle-1.12.2.jar'; + ADD JAR '/opt/flink-webssh/lib/flink-parquet-1.16.0.jar'; +``` +#### Create Table ++We use arrival data of flights from a sample data, you can choose a table of your choice. ++```sql + CREATE TABLE flightsintervaldata1 (arrivalAirportCandidatesCount INT, estArrivalHour INT) PARTITIONED BY (estArrivalHour) WITH ('connector' = 'delta', 'table-path' = 'abfs://container@storage_account.dfs.core.windows.net'/delta-output); +``` +> [!NOTE] +> In the above step, the container and storage account *need not be same* as specified during the cluster creation. In case you want to specify another storage account, you can update `core-site.xml` with `fs.azure.account.key.<account_name>.dfs.core.windows.net: <azure_storage_key>` using configuration management. ++#### Insert Data into the Delta Table ++```sql + INSERT INTO flightsintervaldata1 SELECT 76, 12; +``` ++> [!IMPORTANT] +> - Delta-Flink Connector has an known [issue](https://github.com/delta-io/delta/issues/1931) with String DataType, String DataType is not being consumed properly for delta-flink while partitioning or otherwise. +> - Delta-Flink has a known [issue](https://github.com/delta-io/delta/issues/1971) on viewing the table schema in Trino for the table when registered in Hive metastore (HMS) from Flink. Read and Write operations using Trino with same Flink HMS are not operational due to this issue. ++#### Output of the Delta Table ++You can view the Delta Table output on the ABFS container + |
hdinsight-aks | Flink Catalog Iceberg Hive | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/flink-catalog-iceberg-hive.md | + + Title: Table API and SQL - Use Iceberg Catalog type with Hive in HDInsight on AKS - Apache Flink +description: Learn how to create Apache Flink-Iceberg Catalog in HDInsight on AKS - Apache Flink ++ Last updated : 08/29/2023+++# Create Apache Flink-Iceberg Catalog +++[Apache Iceberg](https://iceberg.apache.org/) is an open table format for huge analytic datasets. Iceberg adds tables to compute engines like Flink, using a high-performance table format that works just like a SQL table. Apache Iceberg [supports](https://iceberg.apache.org/multi-engine-support/#apache-flink) both Apache FlinkΓÇÖs DataStream API and Table API. ++In this article, we learn how to use Iceberg Table managed in Hive catalog, with HDInsight on AKS - Flink ++## Prerequisites +- You're required to have an operational Flink cluster with secure shell, learn how to [create a cluster](../flink/flink-create-cluster-portal.md) + - Refer this article on how to use CLI from [Secure Shell](./flink-web-ssh-on-portal-to-flink-sql.md) on Azure portal. ++### Add dependencies ++Once you launch the Secure Shell (SSH), let us start downloading the dependencies required to the SSH node, to illustrate the Iceberg table managed in Hive catalog. ++ ``` + wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.16/1.3.0/iceberg-flink-runtime-1.16-1.3.0.jar -P $FLINK_HOME/lib + wget https://repo1.maven.org/maven2/org/apache/parquet/parquet-column/1.12.2/parquet-column-1.12.2.jar -P $FLINK_HOME/lib + ``` ++## Start the Apache Flink SQL Client +A detailed explanation is given on how to get started with Flink SQL Client using [Secure Shell](./flink-web-ssh-on-portal-to-flink-sql.md) on Azure portal. You're required to start the SQL Client as described on the article by running the following command. +``` +./bin/sql-client.sh +``` +### Create Iceberg Table managed in Hive catalog ++With the following steps, we illustrate how you can create Flink-Iceberg Catalog using Hive catalog ++```sql + CREATE CATALOG hive_catalog WITH ( + 'type'='iceberg', + 'catalog-type'='hive', + 'uri'='thrift://hive-metastore:9083', + 'clients'='5', + 'property-version'='1', + 'warehouse'='abfs://container@storage_account.dfs.core.windows.net/ieberg-output'); +``` +> [!NOTE] +> - In the above step, the container and storage account *need not be same* as specified during the cluster creation. +> - In case you want to specify another storage account, you can update `core-site.xml` with `fs.azure.account.key.<account_name>.dfs.core.windows.net: <azure_storage_key>` using configuration management. ++```sql + USE CATALOG hive_catalog; +``` ++#### Add dependencies to server classpath ++```sql + ADD JAR '/opt/flink-webssh/lib/iceberg-flink-runtime-1.16-1.3.0.jar'; + ADD JAR '/opt/flink-webssh/lib/parquet-column-1.12.2.jar'; +``` +#### Create Database ++```sql + CREATE DATABASE iceberg_db_2; + USE iceberg_db_2; +``` +#### Create Table ++```sql + CREATE TABLE `hive_catalog`.`iceberg_db_2`.`iceberg_sample_2` + ( + id BIGINT COMMENT 'unique id', + data STRING + ) + PARTITIONED BY (data); +``` +#### Insert Data into the Iceberg Table ++```sql + INSERT INTO `hive_catalog`.`iceberg_db_2`.`iceberg_sample_2` VALUES (1, 'a'); +``` ++#### Output of the Iceberg Table ++You can view the Iceberg Table output on the ABFS container + |
hdinsight-aks | Flink Cluster Configuration | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/flink-cluster-configuration.md | + + Title: Flink cluster configuration - HDInsight on AKS - Apache Flink +description: Learn Flink cluster configuration troubleshoot in HDInsight on AKS - Apache Flink ++ Last updated : 09/26/2023+++# Troubleshoot Flink cluster configuration +++Incorrect cluster configuration may lead to deployment errors. Typically those errors occur when incorrect configuration provided in ARM template or input in Azure portal, for example, on [Configuration management](flink-configuration-management.md) page. ++Example configuration error: ++ :::image type="image" source="./media/flink-cluster-configuration/error.png" alt-text="Screenshot shows error." border="true" lightbox="./media/flink-cluster-configuration/error.png"::: ++The following table provides error codes and their description to help diagnose and fix common errors. ++## Configuration error ++| Error Code | Description | +||| +| FlinkClusterValidator#IdentityValidator | Checks if the task manager (TM) and job manager (JM) process size has suffix mb. | +| |Checks if the TM and JM process size is less than the configured pod memory. | +|FlinkClusterValidator#IdentityValidator | Verifies if the pod identity is configured correctly | +| FlinkClusterValidator#ClusterSpecValidator | Checks if the JM, TM and history server (HS) pod CPU configured is within the configurable/allocatable SKU limits | +| |Checks if the JM, TM and history server (HS) pod memory configured is within the configurable/allocatable SKU limits | +| FlinkClusterValidator#StorageSpecValidator | Storage container validation for the appropriate name of the container | +| | Verify with the supported storage types | ++## System error ++Some of the errors may occur due to environment conditions and be transient. These errors have reason starting with "System" as prefix. In such cases, try the following steps: ++1. Collect the following information: ++ - Azure request CorrelationId. It can be found either in Notifications area; or under Resource Group where cluster is located, on Deployments page; or in `az` command output. ++ - DeploymentId. It can be found in the Cluster Overview page. ++ - Detailed error message. ++1. Contact support team with this information. ++| Error code | Description | +||| +| System.DependencyFailure | Failure in one of cluster components. | ++++ |
hdinsight-aks | Flink Configuration Management | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/flink-configuration-management.md | + + Title: Apache Flink Configuration Management in HDInsight on AKS +description: Learn about Apache Flink Configuration Management in HDInsight on AKS ++ Last updated : 08/29/2023+++# Apache Flink configuration management +++HDInsight on AKS provides a set of default configurations of Apache Flink for most properties and a few based on common application profiles. However, in case you're required to tweak Flink configuration properties to improve performance for certain applications with state usage, parallelism, or memory settings, you can change certain properties at cluster level using **Configuration management** section in HDInsight on AKS Flink. ++1. Go to **Configuration Management** section on your Apache Flink cluster page ++ :::image type="content" source="./media/flink-configuration-management/configuration-page-revised.png" alt-text="Screenshot showing Apache Flink Configuration Management page." lightbox="./media/flink-configuration-management/configuration-page-revised.png"::: ++2. Update **configurations** as required at *Cluster level* ++ :::image type="content" source="./media/flink-configuration-management/update-configuration-revised.png" alt-text="Screenshot showing Apache Flink Update configuration page." lightbox="./media/flink-configuration-management/update-configuration-revised.png"::: + +Here the checkpoint interval is changed at *Cluster level*. ++3. Update the changes by clicking **OK** and then **Save**. ++Once saved, the new configurations get updated in a few minutes (~5 minutes). ++Configurations, which can be updated using Configuration Management Settings ++`processMemory size:` ++The default settings for the process memory size of or job manager and task manager would be the memory configured by the user during cluster creation. ++This size can be configured by using the below configuration property. In-order to change task manager process memory, use this configuration ++`taskmanager.memory.process.size : <value>` ++Example: +`taskmanager.memory.process.size : 2000mb` ++For job manager, ++`jobmanager.memory.process.size : <value>` ++> [!NOTE] +> The maximum configurable process memory is equal to the memory configured for `jobmanager/taskmanager`. ++## Checkpoint Interval ++The checkpoint interval determines how often Flink triggers a checkpoint. it's defined in milliseconds and can be set using the following configuration property: ++`execution.checkpoint.interval: <value>` ++Default setting is 60,000 milliseconds (1 min), this value can be changed as desired. ++## State Backend ++The state backend determines how Flink manages and persists the state of your application. It impacts how checkpoints are stored. You can configure the `state backend using the following property: ++`state.backend: <value>` ++By default HDInsight on AKS Flink uses rocks db ++## Checkpoint Storage Path ++We allow persistent checkpoints by default by storing the checkpoints in `abfs` storage as configured by the user. Even if the job fails, since the checkpoints are persisted, it can be easily started with the latest checkpoint. ++`state.checkpoints.dir: <path>` +Replace `<path>` with the desired path where the checkpoints are stored. ++By default, it's stored in the storage account (ABFS), configured by the user. This value can be changed to any path desired as long as the Flink pods can access it. ++## Maximum Concurrent Checkpoints ++You can limit the maximum number of concurrent checkpoints by setting the following property: +`checkpoint.max-concurrent-checkpoints: <value>` ++Replace `<value>` with the desired maximum number of concurrent checkpoints. For example, 1 to allow only one checkpoint at a time. + +## Maximum retained checkpoints ++You can limit the maximum number of checkpoints to be retained by setting the following property: +`state.checkpoints.num-retained: <value>` +Replace `<value>` with desired maximum number. By default we retain maximum five checkpoints. ++## Savepoint Storage path ++We allow persistent savepoints by default by storing the savepoints in `abfs` storage (as configured by the user). If the user wants to stop and later start the job with a particular savepoint, they can configure this location. +state.checkpoints.dir: `<path>` +Replace` <path>` with the desired path where the savepoints are stored. +By default, it's stored in the storage account, configured by the user. (We support ABFS). This value can be changed to any path desired as long as the Flink pods can access it. ++## Job manager high availability ++In HDInsight on AKS, Flink uses Kubernetes as backend. Even if the Job Manager fails in between due to any known/unknown issue, the pod is restarted within a few seconds. Hence, even if the job restarts due to this issue, the job is recovered back from the **latest checkpoint**. ++### FAQ ++**Why does the Job failure in between +Even if the jobs fail abruptly, if the checkpoints are happening continuously, then the job is restarted by default from the latest checkpoint.** ++Change the job strategy in between? +There are use cases, where the job needs to be modified while in production due to some job level bug. During that time, the user can stop the job, which would automatically take a savepoint and save it in savepoint location. ++`bin/flink stop <JOBID>` ++Example: ++``` +root [ ~ ]# ./bin/flink stop 60bdf21d9bc3bc65d63bc3d8fc6d5c54 +Suspending job "60bdf21d9bc3bc65d63bc3d8fc6d5c54" with a CANONICAL savepoint. +Savepoint completed. Path: abfs://flink061920231244@f061920231244st.dfs.core.windows.net/8255a11812144c28b4ddf1068460c96b/savepoints/savepoint-60bdf2-7717485d15e3 +``` ++Later the user can start the job with bug fix pointing to the savepoint. ++``` +./bin/flink run <JOB_JAR> -d <SAVEPOINT_LOC> +root [ ~ ]# ./bin/flink run examples/streaming/StateMachineExample.jar -s abfs://flink061920231244@f061920231244st.dfs.core.windows.net/8255a11812144c28b4ddf1068460c96b/savepoints/savepoint-60bdf2-7717485d15e3 +``` +Usage with built-in data generator: StateMachineExample [--error-rate `<probability-of-invalid-transition>] [--sleep <sleep-per-record-in-ms>]` ++Usage with Kafka: `StateMachineExample --kafka-topic <topic> [--brokers <brokers>]` ++Since savepoint is provided in the job, the Flink knows from where to start processing the data. +++### Reference +[Apache Flink Configurations](https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/) |
hdinsight-aks | Flink Create Cluster Portal | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/flink-create-cluster-portal.md | + + Title: Create an Apache Flink cluster - Azure portal +description: Creating an Apache Flink cluster in HDInsight on AKS in the Azure portal. ++ Last updated : 08/29/2023+++# Create an Apache Flink cluster in the Azure portal +++Complete the following steps to create an Apache Flink cluster by using the Azure portal. ++## Prerequisites ++Complete the prerequisites in the following sections: +* [Resource prerequisites](../prerequisites-resources.md) +* [Create a cluster pool](../quickstart-create-cluster.md#create-a-cluster-pool) ++> [!IMPORTANT] +> * For creating a cluster in new cluster pool, assign AKS agentpool MSI "Managed Identity Operator" role on the user-assigned managed identity created as part of resource prerequisite. In case you have required permissions, this step is automated during creation. +> * AKS agentpool managed identity gets created during cluster pool creation. You can identify the AKS agentpool managed identity by **(your clusterpool name)-agentpool**. Follow these steps to [assign the role](../../role-based-access-control/role-assignments-portal.md#step-2-open-the-add-role-assignment-page). ++## Create an Apache Flink cluster ++Flink clusters can be created once cluster pool deployment has been completed, let us go over the steps in case you're getting started with an existing cluster pool ++1. In the Azure portal, type *HDInsight cluster pools/HDInsight/HDInsight on AKS* and select Azure HDInsight on AKS cluster pools to go to the cluster pools page. On the HDInsight on AKS cluster pools page, select the cluster pool in which you want to create a new Flink cluster. + + :::image type="content" source="./media/create-flink-cluster/search-bar.png" alt-text="Diagram showing search bar in Azure portal."::: ++1. On the specific cluster pool page, click [**+ New cluster**](../quickstart-create-cluster.md) and provide the following information: ++ | Property| Description| + ||| + |Subscription | This field is autopopulated with the Azure subscription that was registered for the Cluster Pool.| + |Resource Group|This field is autopopulated and shows the resource group on the cluster pool.| + |Region|This field is autopopulated and shows the region selected on the cluster pool.| + |Cluster Pool|This field is autopopulated and shows the cluster pool name on which the cluster is now getting created.To create a cluster in a different pool, find that cluster pool in the portal and click **+ New cluster**.| + |HDInsight on AKS Pool Version|This field is autopopulated and shows the cluster pool version on which the cluster is now getting created.| + |HDInsight on AKS Version | Select the minor or patch version of the HDInsight on AKS of the new cluster.| + |Cluster type | From the drop-down list, select Flink.| + |Cluster name|Enter the name of the new cluster.| + |User-assigned managed identity | From the drop-down list, select the managed identity to use with the cluster. If you're the owner of the Managed Service Identity (MSI), and the MSI doesn't have Managed Identity Operator role on the cluster, click the link below the box to assign the permission needed from the AKS agent pool MSI. If the MSI already has the correct permissions, no link is shown. See the [Prerequisites](#prerequisites) for other role assignments required for the MSI.| + |Storage account|From the drop-down list, select the storage account to associate with the Flink cluster and specify the container name. The managed identity is further granted access to the specified storage account, using the 'Storage Blob Data Owner' role during cluster creation.| + |Virtual network | The virtual network for the cluster.| + |Subnet|The virtual subnet for the cluster.| ++1. Enabling **Hive catalog** for Flink SQL. ++ |Property| Description| + ||| + |Use Hive catalog|Enable this option to use an external Hive metastore. | + |SQL Database for Hive|From the drop-down list, select the SQL Database in which to add hive-metastore tables.| + |SQL admin username|Enter the SQL server admin username. This account is used by metastore to communicate to SQL database.| + |Key vault|From the drop-down list, select the Key Vault, which contains a secret with password for SQL server admin username. You are required to set up an access policy with all required permissions such as key permissions, secret permissions and certificate permissions to the MSI, which is being used for the cluster creation. The MSI needs a Key Vault Administrator role, add the required permissions using IAM.| + |SQL password secret name|Enter the secret name from the Key Vault where the SQL database password is stored.| ++ :::image type="content" source="./media/create-flink-cluster/flink-basics-page.png" alt-text="Screenshot showing basic tab."::: + > [!NOTE] + > By default we use the **Storage account** for Hive catalog same as the storage account and container used during cluster creation. ++1. Select **Next: Configuration** to continue. ++1. On the **Configuration** page, provide the following information: ++ |Property|Description| + ||| + |Node size|Select the node size to use for the Flink nodes both head and worker nodes.| + |Number of nodes|Select the number of nodes for Flink cluster; by default head nodes are two. The worker nodes sizing helps determine the task manager configurations for the Flink. The job manager and history server are on head nodes.| ++1. On the **Service Configuration** section, provide the following information: ++ |Property|Description| + ||| + |Task manager CPU|Integer. Enter the size of the Task manager CPUs (in cores).| + |Task manager memory in MB|Enter the Task manager memory size in MB. Min of 1800 MB.| + |Job manager CPU|Integer. Enter the number of CPUs for the Job manager (in cores).| + |Job manager memory in MB | Enter the memory size in MB. Minimum of 1800 MB.| + |History server CPU|Integer. Enter the number of CPUs for the Job manager (in cores).| + |History server memory in MB | Enter the memory size in MB. Minimum of 1800 MB.| ++ :::image type="content" source="./media/create-flink-cluster/flink-configuration-page.png" alt-text="screenshot showing configurations tab."::: ++ > [!NOTE] + > * History server can be enabled/disabled as required. + > * Schedule based autoscale is supported in Flink. You can schedule number of worker nodes as required. For example, it is enabled a schedule based autoscale with default worker node count as 3. And during weekdays from 9:00 UTC to 20:00 UTC, the worker nodes are scheduled to be 10. Later in the day, it needs to be defaulted to 3 nodes ( between 20:00 UTC to next day 09:00 UTC ). During weekends from 9:00 UTC to 20:00 UTC, worker nodes are 4. ++1. On the **Auto Scale & SSH** section, update the following: ++ |Property|Description| + ||| + |Auto Scale|Upon selection, you would be able to choose the schedule based autoscale to configure the schedule for scaling operations.| + |Enable SSH|Upon selection, you can opt for total number of SSH nodes required, which are the access points for the Flink CLI using Secure Shell. The maximum SSH nodes allowed is 5.| + + :::image type="content" source="./media/create-flink-cluster/service-configuration.png" alt-text="Screenshot showing autoscale service configuration."::: + + :::image type="content" source="./media/create-flink-cluster/autoscale-rules.png" alt-text="Screenshot showing auto scale rules."::: +1. Click the **Next: Integration** button to continue to the next page. ++1. On the **Integration** page, provide the following information: ++ |Property|Description| + ||| + |Log analytics| This feature is available only if the cluster pool has associated log analytics workspace, once enabled the logs to collect can be selected.| + |Azure Prometheus | This feature is to view Insights and Logs directly in your cluster by sending metrics and logs to Azure Monitor workspace.| ++ :::image type="content" source="./media/create-flink-cluster/flink-integrations-page.png" alt-text="screenshot showing integrations tab."::: + +1. Click the **Next: Tags** button to continue to the next page. + +1. On the **Tags** page, provide the following information: ++ | Property | Description| + ||| + |Name | Optional. Enter a name such as HDInsight on AKS to easily identify all resources associated with your cluster resources.| + | Value | You can leave this blank.| + | Resource | Select All resources selected.| ++1. Select **Next: Review + create** to continue. + +1. On the **Review + create** page, look for the **Validation succeeded** message at the top of the page and then click **Create**. ++The **Deployment is in process** page is displayed which the cluster is created. It takes 5-10 minutes to create the cluster. Once the cluster is created, the **"Your deployment is complete"** message is displayed. If you navigate away from the page, you can check your Notifications for the current status. |
hdinsight-aks | Flink How To Setup Event Hub | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/flink-how-to-setup-event-hub.md | + + Title: How to connect HDInsight on AKS Flink with Azure Event Hubs for Apache Kafka® +description: Learn how to connect HDInsight on AKS Flink with Azure Event Hubs for Apache Kafka® ++ Last updated : 08/29/2023+++# Connect HDInsight on AKS Flink with Azure Event Hubs for Apache Kafka® +++A well known use case for Apache Flink is stream analytics. The popular choice by many users to use the data streams, which are ingested using Apache Kafka. Typical installations of Flink and Kafka start with event streams being pushed to Kafka, which can be consumed by Flink jobs. Azure Event Hubs provides an Apache Kafka endpoint on an event hub, which enables users to connect to the event hub using the Kafka protocol. ++In this article, we explore how to connect [Azure Event Hubs](/azure/event-hubs/event-hubs-about) with [HDInsight on AKS Flink](./flink-overview.md) and cover the following ++> [!div class="checklist"] +> * Create an Event Hubs namespace +> * Create a HDInsight on AKS Cluster with Apache Flink +> * Run Flink producer +> * Package Jar for Apache Flink +> * Job Submission & Validation ++## Create Event Hubs namespace and Event Hubs ++1. To create Event Hubs namespace and Event Hubs, see [here](/azure/event-hubs/event-hubs-quickstart-kafka-enabled-event-hubs?tabs=connection-string) ++ :::image type="content" source="./media/flink-eventhub/flink-setup-event-hub.png" alt-text="Screenshot showing Event Hubs setup." border="true" lightbox="./media/flink-eventhub/flink-setup-event-hub.png"::: ++## Set up Flink Cluster on HDInsight on AKS ++1. Using existing HDInsight on AKS Cluster pool you can create a [Flink cluster](./flink-create-cluster-portal.md) ++1. Run the Flink producer adding the **bootstrap.servers** and the `producer.config` info ++ ``` + bootstrap.servers={YOUR.EVENTHUBS.FQDN}:9093 + client.id=FlinkExampleProducer + sasl.mechanism=PLAIN + security.protocol=SASL_SSL + sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \ + username="$ConnectionString" \ + password="{YOUR.EVENTHUBS.CONNECTION.STRING}"; + ``` + +1. Replace `{YOUR.EVENTHUBS.CONNECTION.STRING}` with the connection string for your Event Hubs namespace. For instructions on getting the connection string, see details on how to [get an Event Hubs connection string](/azure/event-hubs/event-hubs-get-connection-string). ++ For example, + ``` + sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" + password="Endpoint=sb://mynamespace.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=XXXXXXXXXXXXXXXX"; + ``` +## Packaging the JAR for Flink +1. Package com.example.app; + + ``` + import org.apache.flink.api.common.functions.MapFunction; + import org.apache.flink.api.common.serialization.SimpleStringSchema; + import org.apache.flink.streaming.api.datastream.DataStream; + import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; + import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer; //v0.11.0.0 + import java.io.FileNotFoundException; + import java.io.FileReader; + import java.util.Properties; ++ public class FlinkTestProducer { ++ private static final String TOPIC = "test"; + private static final String FILE_PATH = "src/main/resources/producer.config"; ++ public static void main(String... args) { + try { + Properties properties = new Properties(); + properties.load(new FileReader(FILE_PATH)); ++ final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); + DataStream stream = createStream(env); + FlinkKafkaProducer<String> myProducer = new FlinkKafkaProducer<>( + TOPIC, + new SimpleStringSchema(), // serialization schema + properties); ++ stream.addSink(myProducer); + env.execute("Testing flink print"); ++ } catch(FileNotFoundException e){ + System.out.println("FileNotFoundException: " + e); + } catch (Exception e) { + System.out.println("Failed with exception:: " + e); + } + } ++ public static DataStream createStream(StreamExecutionEnvironment env){ + return env.generateSequence(0, 200) + .map(new MapFunction<Long, String>() { + @Override + public String map(Long in) { + return "FLINK PRODUCE " + in; + } + }); + } + } + ``` + +1. Add the snippet to run the Flink Producer. ++ :::image type="content" source="./media/flink-eventhub/testing-flink.png" alt-text="Screenshot showing how to test Flink in Event Hubs." border="true" lightbox="./media/flink-eventhub/testing-flink.png"::: ++1. Once the code is executed, the events are stored in the topic **“TEST”** ++ :::image type="content" source="./media/flink-eventhub/events-stored-in-topic.png" alt-text="Screenshot showing Event Hubs stored in topic." border="true" lightbox="./media/flink-eventhub/events-stored-in-topic.png"::: |
hdinsight-aks | Flink Job Management | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/flink-job-management.md | + + Title: Flink job management in HDInsight on AKS +description: HDInsight on AKS provides a feature to manage and submit Apache Flink jobs directly through the Azure portal ++ Last updated : 09/07/2023+++# Flink job management +++HDInsight on AKS provides a feature to manage and submit Apache Flink jobs directly through the Azure portal (user-friendly interface) and ARM Rest APIs. ++This feature empowers users to efficiently control and monitor their Flink jobs without requiring deep cluster-level knowledge. ++## Benefits ++- **Simplified job management**: With the native integration of Apache Flink in the Azure portal, users no longer require extensive knowledge of Flink clusters to submit, manage, and monitor jobs. ++- **User-Friendly REST API**: HDInsight on AKS provides user friendly ARM Rest APIs to submit and manage Flink jobs. Users can submit Flink jobs from any Azure service using these Rest APIs. ++- **Effortless job updates and state management**: The native Azure portal integration provides a hassle-free experience for updating jobs and restoring them to their last saved state (savepoint). This functionality ensures continuity and data integrity throughout the job lifecycle. ++- **Automating Flink job using Azure pipeline**: Using HDInsight on AKS, Flink users have access to user-friendly ARM Rest API, you can seamlessly integrate Flink job operations into your Azure Pipeline. Whether you're launching new jobs, updating running jobs, or performing various job operations, this streamlined approach eliminates manual steps. It empowers you to manage your Flink cluster efficiently. ++## Prerequisites ++There are some prerequisites before submitting and managing jobs from portal or Rest APIs. ++- Create a directory in the primary storage account of the cluster to upload the job jar. ++- If the user wants to take savepoints, then create a directory in the storage account for job savepoints. ++ :::image type="image" source="./media/flink-job-management/create-directory.png" alt-text="Screenshot shows directory structure." border="true" lightbox="./media/flink-job-management/create-directory.png"::: +++## Key features and operations ++- **New job submission**: Users can effortlessly submit a new Flink, eliminating the need for complex configurations or external tools. ++- **Stop and start jobs with savepoints**: Users can gracefully stop and start their Flink jobs from their previous state (Savepoint). Savepoints ensure that job progress is preserved, enabling seamless resumptions. ++- **Job updates**: User can update the running job after updating the jar on storage account. This update automatically take the savepoint and start the job with a new jar. ++- **Stateless updates**: Performing a fresh restart for a job is simplified through stateless updates. This feature allows users to initiate a clean restart using updated job jar. ++- **Savepoint management**: At any given moment, users can create savepoints for their running jobs. These savepoints can be listed and used to restart the job from a specific checkpoint as needed. ++- **Cancel**: This cancels the job permanently. ++- **Delete**: Delete job history record. ++## Options to manage jobs in HDInsight on AKS ++HDInsight on AKS provides ways to manage Flink jobs. ++- [Azure portal](#azure-portal) +- [ARM Rest API](#arm-rest-api) ++### <a id="azure-portal">Job Management from Azure portal</a> ++To run the Flink job from portal go to: ++Portal --> HDInsight on AKS Cluster Pool --> Flink Cluster --> Settings --> Flink Jobs +++- **New job:** To submit a new job, upload the job jars to the storage account and create a savepoint directory. Complete the template with the necessary configurations and then submit the job. ++ :::image type="image" source="./media/flink-job-management/create-new-job.png" alt-text="Screenshot shows how to create new job." border="true" lightbox="./media/flink-job-management/create-new-job.png"::: ++ **Property details:** ++ | Property | Description | Default Value | Mandatory | + | -- | - | -- | - | + | Job name | Unique name for job. This is displayed on portal. Job name should be in small latter. | | Yes | + | Jar path | Storage path for job jar. Users should create directory in cluster storage and upload job jar.| Yes + | Entry class | Entry class for job from which job execution starts. | | Yes | + | Args | Argument for main program of job. Separate all arguments with spaces. | | No | + | parallelism | Job Flink Parallelism. | 2 | Yes | + | savepoint.directory | Savepoint directory for job. It is recommended that users should create a new directory for job savepoint in storage account. | `abfs://<container>@<account>/<deployment-ID>/savepoints` | No | ++ Once the job is launched, the job status on the portal is **RUNNING**. ++- **Stop:** Stop job did not require any parameter, user can stop the job by selecting the action. ++ :::image type="image" source="./media/flink-job-management/stop-job.png" alt-text="Screenshot shows how user can stop job." border="true" lightbox="./media/flink-job-management/stop-job.png"::: ++ Once the job is stopped, the job status on the portal is **STOPPED**. ++- **Start:** This action starts the job from savepoint. To start the job, select the stopped job and start it. ++ :::image type="image" source="./media/flink-job-management/start-job-savepoint.png" alt-text="Screenshot shows how user start job." border="true" lightbox="./media/flink-job-management/start-job-savepoint.png"::: ++ Fill the flow template with the required options and start it. Users need to select the savepoint from which user wants to start the job. By default, it takes the last successful savepoint. ++ :::image type="image" source="./media/flink-job-management/fill-flow-template.png" alt-text="Screenshot shows how fill flow template." border="true" lightbox="./media/flink-job-management/fill-flow-template.png"::: ++ **Property details**: ++ | Property | Description | Default Value | Mandatory | + | -- | - | -- | - | + | Args | Argument for main program of job. All arguments should be separated by space. | | No | + | Last savepoint | Last successful savepoint take before stopping job. This will used by default if not savepoint is selected. | | Not Editable | + | Save point name | Users can list the available savepoint for job and select one to start the job. | | No | ++ Once the job is started, the job status on the portal will be **RUNNING**. ++- **Update:** Update helps to restart jobs with updated job code. Users need to update the latest job jar in storage location and update the job from portal. This update stops the job with savepoint and starts again with latest jar. ++ :::image type="image" source="./media/flink-job-management/restart-job-with-updated-code.png" alt-text="Screenshot shows how restart jobs with updated job code." border="true" lightbox="./media/flink-job-management/restart-job-with-updated-code.png"::: ++ Template for updating job. + + :::image type="image" source="./media/flink-job-management/template-for-updating-job.png" alt-text="Screenshot shows template for updating job." border="true" lightbox="./media/flink-job-management/template-for-updating-job.png"::: ++ Once the job is updated, the job status on the portal is "RUNNING." ++- **Stateless update:** This job is like an update, but it involves a fresh restart of the job with the latest code. ++ :::image type="image" source="./media/flink-job-management/stateless-update.png" alt-text="Screenshot shows fresh restart of the job with the latest code." border="true" lightbox="./media/flink-job-management/stateless-update.png"::: ++ Template for updating job. ++ :::image type="image" source="./media/flink-job-management/template-for-updating-stateless-job.png" alt-text="Screenshot shows template for updating stateless job." border="true" lightbox="./media/flink-job-management/template-for-updating-stateless-job.png"::: ++ **Property details**: + + | Property | Description | Default Value | Mandatory | + | -- | - | -- | - | + | Args | Argument for main program of job. Separate all arguments with space. | | No | ++ Once the job is updated, the job status on the portal is RUNNING. + +- **Savepoint:** Take the savepoint for the Flink Job. ++ :::image type="image" source="./media/flink-job-management/savepoint-flink-job.png" alt-text="Screenshot shows savepoint for the Flink Job." border="true" lightbox="./media/flink-job-management/savepoint-flink-job.png"::: ++ Savepoint is time consuming process, and it takes some time. You can see job action status as in-progress. ++ :::image type="image" source="./media/flink-job-management/job-action-status.png" alt-text="Screenshot shows job action status." border="true" lightbox="./media/flink-job-management/job-action-status.png"::: ++- **Cancel:** This job helps user to terminate the job. ++ :::image type="image" source="./media/flink-job-management/terminate-job.png" alt-text="Screenshot shows how user can terminate the job." border="true" lightbox="./media/flink-job-management/terminate-job.png"::: ++- **Delete:** Delete job data from portal. ++ :::image type="image" source="./media/flink-job-management/delete-job-data.png" alt-text="Screenshot shows how user can delete job data from portal." border="true" lightbox="./media/flink-job-management/delete-job-data.png"::: ++- **View Job details:** To view the job detail user can click on job name, it gives the details about the job and last action result. ++ :::image type="image" source="./media/flink-job-management/view-job-details.png" alt-text="Screenshot shows how to view job details." border="true" lightbox="./media/flink-job-management/view-job-details.png"::: ++ For any failed action, this job json give detailed exceptions and reasons for failure. ++### <a id="arm-rest-api">Job Management Using Rest API</a> ++HDInsight on AKS - Flink supports user friendly ARM Rest APIs to submit job and manage job. Using this Flink REST API, you can seamlessly integrate Flink job operations into your Azure Pipeline. Whether you're launching new jobs, updating running jobs, or performing various job operations, this streamlined approach eliminates manual steps and empowers you to manage your Flink cluster efficiently. ++#### Base URL format for Rest API ++See following URL for rest API, users need to replace subscription, resource group, cluster pool, cluster name and HDInsight on AKS API version in this before using it. + `https://management.azure.com/subscriptions/{{USER_SUBSCRIPTION}}/resourceGroups/{{USER_RESOURCE_GROUP}}/providers/Microsoft.HDInsight/clusterpools/{{CLUSER_POOL}}/clusters/{{FLINK_CLUSTER}}/runjob?api-version={{API_VERSION}}` ++Using this REST API, users can initiate new jobs, stop jobs, start jobs, create savepoints, cancel jobs, and delete jobs. The current API_VERSION is 2023-06-01-preview. ++#### Rest API Authentication ++To authenticate Flink ARM Rest API users, need to get the bearer token or access token for ARM resource. To authenticate Azure ARM (Azure Resource Manager) REST API using a service principal, you can follow these general steps: ++- Create a Service Principal. ++ `az ad sp create-for-rbac --name <your-SP-name>` + +- Give owner permission to SP for `flink` cluster. ++- Login with service principal. ++ `az login --service-principal -u <client_id> -p <client_secret> --tenant <tenant_id>` ++- Get access token. ++ `$token = az account get-access-token --resource=https://management.azure.com/ | ConvertFrom-Json` ++ `$tok = $token.accesstoken` ++ Users can use token in URL shown. ++ `$data = Invoke-RestMethod -Uri $restUri -Method GET -Headers @{ Authorization = "Bearer $tok" }` + +**Authentication using Managed Identity:** Users can utilize resources that support Managed Identity to make calls to the Job REST API. For more details, please refer to the [Managed Identity](../../active-directory/managed-identities-azure-resources/tutorial-linux-vm-access-arm.md) documentation. + +#### LIST of APIs and Parameters ++- **New Job:** Rest API to submit new job to Flink. + + | Option | Value | + | -- | - | + | Method | POST | + | URL | `https://management.azure.com/subscriptions/{{USER_SUBSCRIPTION}}/resourceGroups/{{USER_RESOURCE_GROUP}}/providers/Microsoft.HDInsight/clusterpools/{{CLUSER_POOL}}/clusters/{{FLINK_CLUSTER}}/runJob?api-version={{API_VERSION}}` | + | Header | Authorization = "Bearer $token" | + + *Request Body:* ++ ``` + { + "properties": { + "jobType": "FlinkJob", + "jobName": "<JOB_NAME>", + "action": "NEW", + "jobJarDirectory": "<JOB_JAR_STORAGE_PATH>", + "jarName": "<JOB_JAR_NAME>", + "entryClass": "<JOB_ENTRY_CLASS>", + ΓÇ£argsΓÇ¥: ΓÇ¥<JOB_JVM_ARGUMENT>ΓÇ¥ + "flinkConfiguration": { + "parallelism": "<JOB_PARALLELISM>", + "savepoint.directory": "<JOB_SAVEPOINT_DIRECTORY_STORAGE_PATH>" + } + } + } + ``` + **Property details for JSON body:** + + | Property | Description | Default Value | Mandatory | + | -- | -- | - | | + | jobType | Type of Job.It should be ΓÇ£FlinkJobΓÇ¥ | | Yes| + | jobName | Unique name for job. This is displayed on portal. Job name should be in small latter.| | Yes | + | action | It indicates operation type on job. It should be ΓÇ£NEWΓÇ¥ always for new job launch. | | Yes | + | jobJarDirectory | Storage path for job jar directory. Users should create directory in cluster storage and upload job jar.| Yes | + | jarName | Name of job jar. | | Yes | + |entryClass | Entry class for job from which job execution starts. | | Yes | + | args | Argument for main program of job. Separate arguments with space. | | No | + | parallelism | Job Flink Parallelism. | 2 | Yes | + | savepoint.directory | Savepoint directory for job. It is recommended that users should create a new directory for job savepoint in storage account. | `abfs://<container>@<account>/<deployment-ID>/savepoints`| No | ++ Example: + + `Invoke-RestMethod -Uri $restUri -Method POST -Headers @{ Authorization = "Bearer $tok" } -Body $jsonString -ContentType "application/json"` ++- **Stop job:** Rest API for stopping current running job. ++ | Option | Value | + | -- | - | + | Method | POST | + | URL | `https://management.azure.com/subscriptions/{{USER_SUBSCRIPTION}}/resourceGroups/{{USER_RESOURCE_GROUP}}/providers/Microsoft.HDInsight/clusterpools/{{CLUSER_POOL}}/clusters/{{FLINK_CLUSTER}}/runJob?api-version={{API_VERSION}}` | + | Header | Authorization = "Bearer $token" | ++ *Request Body* ++ ``` + { + "properties": { + "jobType": "FlinkJob", + "jobName": "<JOB_NAME>", + "action": "STOP" + } + } + ``` ++ Property details for JSON body: ++ | Property | Description | Default Value | Mandatory | + | -- | -- | - | | + | jobType | Type of Job. It should be ΓÇ£FlinkJobΓÇ¥ | Yes | + | jobName | Job Name, which is used for launching the job | Yes | + | action | It should be ΓÇ£STOPΓÇ¥ | Yes | ++ Example: + + `Invoke-RestMethod -Uri $restUri -Method POST -Headers @{ Authorization = "Bearer $tok" } -Body $jsonString -ContentType "application/json"` ++- **Start job:** Rest API to start STOPPED job. + + | Option | Value | + | -- | - | + | Method | POST | + | URL | `https://management.azure.com/subscriptions/{{USER_SUBSCRIPTION}}/resourceGroups/{{USER_RESOURCE_GROUP}}/providers/Microsoft.HDInsight/clusterpools/{{CLUSER_POOL}}/clusters/{{FLINK_CLUSTER}}/runJob?api-version={{API_VERSION}}` | + | Header | Authorization = "Bearer $token" | +++ *Request Body* ++ ``` + { + "properties": { + "jobType": "FlinkJob", + "jobName": "<JOB_NAME>", + "action": "START", + "savePointName": "<SAVEPOINT_NAME>" + } + } + ``` ++ **Property details for JSON body:** ++ | Property | Description | Default Value | Mandatory | + | -- | -- | - | | + | jobType | Type of Job. It should be ΓÇ£FlinkJobΓÇ¥ | Yes | + | jobName | Job Name that is used for launching the job. | Yes | + | action | It should be ΓÇ£STARTΓÇ¥ | Yes | + | savePointName | Save point name to start the job. It is optional property, by default start operation take last successful savepoint. | No | ++ **Example:** + + `Invoke-RestMethod -Uri $restUri -Method POST -Headers @{ Authorization = "Bearer $tok" } -Body $jsonString -ContentType "application/json"` ++- **Update job:** Rest API for updating current running job. ++ | Option | Value | + | -- | - | + | Method | POST | + | URL | `https://management.azure.com/subscriptions/{{USER_SUBSCRIPTION}}/resourceGroups/{{USER_RESOURCE_GROUP}}/providers/Microsoft.HDInsight/clusterpools/{{CLUSER_POOL}}/clusters/{{FLINK_CLUSTER}}/runJob?api-version={{API_VERSION}}` | + | Header | Authorization = "Bearer $token" | ++ *Request Body* ++ ``` + { + "properties": { + "jobType": "FlinkJob", + "jobName": "<JOB_NAME>", + "action": "UPDATE", + ΓÇ£argsΓÇ¥ : ΓÇ£<JOB_JVM_ARGUMENT>ΓÇ¥, + "savePointName": "<SAVEPOINT_NAME>" + } + } ++ ``` ++ **Property details for JSON body:** ++ | Property | Description | Default Value | Mandatory | + | -- | -- | - | | + | jobType | Type of Job. It should be ΓÇ£FlinkJobΓÇ¥ | | Yes | + | jobName | Job Name that is used for launching the job. | | Yes | + | action | It should be ΓÇ£UPDATEΓÇ¥ always for new job launch. | | Yes | + | args | Job JVM arguments | | No | + | savePointName | Save point name to start the job. It is optional property, by default start operation will take last successful savepoint.| | No | ++ Example: + + `Invoke-RestMethod -Uri $restUri -Method POST -Headers @{ Authorization = "Bearer $tok" } -Body $jsonString -ContentType "application/json"` ++- **Stateless update job:** Rest API for stateless update. + + | Option | Value | + | -- | - | + | Method | POST | + | URL | `https://management.azure.com/subscriptions/{{USER_SUBSCRIPTION}}/resourceGroups/{{USER_RESOURCE_GROUP}}/providers/Microsoft.HDInsight/clusterpools/{{CLUSER_POOL}}/clusters/{{FLINK_CLUSTER}}/runJob?api-version={{API_VERSION}}` | + | Header | Authorization = "Bearer $token" | ++ *Request Body* ++ ``` + { + "properties": { + "jobType": "FlinkJob", + "jobName": "<JOB_NAME>", + "action": "STATELESS_UPDATE", + ΓÇ£argsΓÇ¥ : ΓÇ£<JOB_JVM_ARGUMENT>ΓÇ¥ + } + } + ``` ++ **Property details for JSON body:** ++ | Property | Description | Default Value | Mandatory | + | -- | -- | - | | + | jobType | Type of Job. It should be ΓÇ£FlinkJobΓÇ¥ | | Yes | + | jobName | Job Name that is used for launching the job. | | Yes | + | action | It should be ΓÇ£STATELESS_UPDATEΓÇ¥ always for new job launch. | | Yes | + | args | Job JVM arguments | | No | ++ **Example:** + + `Invoke-RestMethod -Uri $restUri -Method POST -Headers @{ Authorization = "Bearer $tok" } -Body $jsonString -ContentType "application/json"` ++- **Savepoint:** Rest API to trigger savepoint for job. + + | Option | Value | + | -- | - | + | Method | POST | + | URL | `https://management.azure.com/subscriptions/{{USER_SUBSCRIPTION}}/resourceGroups/{{USER_RESOURCE_GROUP}}/providers/Microsoft.HDInsight/clusterpools/{{CLUSER_POOL}}/clusters/{{FLINK_CLUSTER}}/runJob?api-version={{API_VERSION}}` | + | Header | Authorization = "Bearer $token" | ++ *Request Body* ++ ``` + { + "properties": { + "jobType": "FlinkJob", + "jobName": "<JOB_NAME>", + "action": "SAVEPOINT" + } + } + ``` ++ **Property details for JSON body:** ++ | Property | Description | Default Value | Mandatory | + | -- | -- | - | | + | jobType | Type of Job. It should be ΓÇ£FlinkJobΓÇ¥ | | Yes | + | jobName | Job Name that is used for launching the job. | | Yes | + | action | It should be ΓÇ£SAVEPOINTΓÇ¥ always for new job launch. | | Yes | ++ **Example:** + + `Invoke-RestMethod -Uri $restUri -Method POST -Headers @{ Authorization = "Bearer $tok" } -Body $jsonString -ContentType "application/json"` ++- **List savepoint:** Rest API to list all the savepoint from savepoint directory. ++ | Option | Value | + | -- | - | + | Method | POST | + | URL | `https://management.azure.com/subscriptions/{{USER_SUBSCRIPTION}}/resourceGroups/{{USER_RESOURCE_GROUP}}/providers/Microsoft.HDInsight/clusterpools/{{CLUSER_POOL}}/clusters/{{FLINK_CLUSTER}}/runJob?api-version={{API_VERSION}}` | + | Header | Authorization = "Bearer $token" | ++ *Request Body* ++ ``` + { + "properties": { + "jobType": "FlinkJob", + "jobName": "<JOB_NAME>", + "action": "LIST_SAVEPOINT" + } + } + ``` ++ **Property details for JSON body:** ++ | Property | Description | Default Value | Mandatory | + | -- | -- | - | | + | jobType | Type of Job. It should be ΓÇ£FlinkJobΓÇ¥ | | Yes | + | jobName | Job Name which is used for launching the job | | Yes | + | action | It should be ΓÇ£LIST_SAVEPOINTΓÇ¥ | | Yes | ++ **Example:** + + `Invoke-RestMethod -Uri $restUri -Method POST -Headers @{ Authorization = "Bearer $tok" } -Body $jsonString -ContentType "application/json"` ++- **Cancel:** Rest API to cancel the job. ++ | Option | Value | + | -- | - | + | Method | POST | + | URL | `https://management.azure.com/subscriptions/{{USER_SUBSCRIPTION}}/resourceGroups/{{USER_RESOURCE_GROUP}}/providers/Microsoft.HDInsight/clusterpools/{{CLUSER_POOL}}/clusters/{{FLINK_CLUSTER}}/runJob?api-version={{API_VERSION}}` | + | Header | Authorization = "Bearer $token" | ++ *Request Body* ++ ``` + { + "properties": { + "jobType": "FlinkJob", + "jobName": "<JOB_NAME>", + "action": "CANCEL" + } + } + ``` ++ **Property details for JSON body:** ++ | Property | Description | Default Value | Mandatory | + | -- | -- | - | | + | jobType | Type of Job. It should be `FlinkJob` | | Yes | + | jobName | Job Name that is used for launching the job. | | Yes | + | action | It should be CANCEL. | | Yes | ++ **Example:** + + `Invoke-RestMethod -Uri $restUri -Method POST -Headers @{ Authorization = "Bearer $tok" } -Body $jsonString -ContentType "application/json"` ++- **Delete:** Rest API to delete job. ++ | Option | Value | + | -- | - | + | Method | POST | + | URL | `https://management.azure.com/subscriptions/{{USER_SUBSCRIPTION}}/resourceGroups/{{USER_RESOURCE_GROUP}}/providers/Microsoft.HDInsight/clusterpools/{{CLUSER_POOL}}/clusters/{{FLINK_CLUSTER}}/runJob?api-version={{API_VERSION}}` | + | Header | Authorization = "Bearer $token" | ++ *Request Body* ++ ``` + { + "properties": { + "jobType": "FlinkJob", + "jobName": "<JOB_NAME>", + "action": "DELETE" + } + } + ``` ++ **Property details for JSON body:** ++ | Property | Description | Default Value | Mandatory | + | -- | -- | - | | + | jobType | Type of Job. It should be ΓÇ£FlinkJobΓÇ¥ | | Yes | + | jobName | Job Name that is used for launching the job. | | Yes | + | action | It should be DELETE. | | Yes | ++ **Example:** + + `Invoke-RestMethod -Uri $restUri -Method POST -Headers @{ Authorization = "Bearer $tok" } -Body $jsonString -ContentType "application/json"` ++- **List Jobs:** Rest API to list all the jobs and status of current action. ++ | Option | Value | + | -- | - | + | Method | GET | + | URL | `https://management.azure.com/subscriptions/{{USER_SUBSCRIPTION}}/resourceGroups/{{USER_RESOURCE_GROUP}}/providers/Microsoft.HDInsight/clusterpools/{{CLUSER_POOL}}/clusters/{{FLINK_CLUSTER}}/jobs?api-version={{API_VERSION}}` | + | Header | Authorization = "Bearer $token" | + + **Output:** ++ ``` + { + "value": [ + { + "id": "/subscriptions/{{USER_SUBSCRIPTION}}/resourceGroups/{{USER_RESOURCE_GROUP}}/providers/Microsoft.HDInsight/clusterpools/{{CLUSER_POOL}}/clusters/{{FLINK_CLUSTER}}/jobs/job1", + "properties": { + "jobType": "FlinkJob", + "jobName": "job1", + "jobJarDirectory": "<JOB_JAR_STORAGE_PATH>", + "jarName": "<JOB_JAR_NAME>", + "action": "STOP", + "entryClass": "<JOB_ENTRY_CLASS>", + "flinkConfiguration": { + "parallelism": "2", + "savepoint.directory": "<JOB_SAVEPOINT_DIRECTORY_STORAGE_PATH>s" + }, + "jobId": "20e9e907eb360b1c69510507f88cdb7b", + "status": "STOPPED", + "jobOutput": "Savepoint completed. Path: <JOB_SAVEPOINT_DIRECTORY_STORAGE_PATH>s/savepoint-20e9e9-8a48c6b905e5", + "actionResult": "SUCCESS", + "lastSavePoint": "<JOB_SAVEPOINT_DIRECTORY_STORAGE_PATH>s/savepoint-20e9e9-8a48c6b905e5" + } + } + ] + } + ``` ++> [!NOTE] +> When any action is in progress, actionResult will indicate it with the value 'IN_PROGRESS' On successful completion, it will show 'SUCCESS', and in case of failure, it will be 'FAILED'. |
hdinsight-aks | Flink Job Orchestration | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/flink-job-orchestration.md | + + Title: Azure data factory managed airflow - HDInsight on AKS +description: Learn Flink job orchestration using Azure Data Factory managed airflow ++ Last updated : 09/23/2023+++# Azure data factory managed airflow +++This article covers managing HDInsight Flink job using Azure REST API ([refer Job Management REST API section in this tutorial](flink-job-management.md)) and orchestration data pipeline with Azure Data Factory Managed Airflow. [Azure Data Factory Managed Airflow](/azure/data-factory/concept-managed-airflow) service is a simple and efficient way to create and manage [Apache Airflow](https://airflow.apache.org/) environments, enabling you to run data pipelines at scale easily. ++Apache Airflow is an open-source platform that programmatically creates, schedules, and monitors complex data workflows. It allows you to define a set of tasks, called operators, that can be combined into directed acyclic graphs (DAGs) to represent data pipelines. ++The following diagram shows the placement of Airflow, Key Vault, and HDInsight on AKS in Azure. +++Multiple Azure Service Principals are created based on the scope to limit the access it needs and to manage the client credential life cycle independently. ++It is recommended to rotate access keys or secrets periodically. ++## Setup steps ++1. [Setup Flink Cluster](flink-create-cluster-portal.md) ++1. Upload your Flink Job jar to the storage account -  It can be the primary storage account associated with the Flink cluster or any other storage account, where Assign the “Storage Blob Data Owner” role to the user-assigned MSI used for the cluster to this storage account. ++1. Azure Key Vault - You can follow [this tutorial to create a new Azure Key Vault](/azure/key-vault/general/quick-create-portal/) in case, if you don't have one. ++1. Create [Azure AD Service Principal](/cli/azure/ad/sp/) to access Key Vault – Grant permission to access Azure Key Vault with the “Key Vault Secrets Officer” role, and make a note of ‘appId’, ‘password’, and ‘tenant’ from the response. We need to use the same for Airflow to use Key Vault storage as backends for storing sensitive information. ++ ``` + az ad sp create-for-rbac -n <sp name> --role “Key Vault Secrets Officer” --scopes <key vault Resource ID> + ``` +++1. Create Managed Airflow [enable with Azure Key Vault to store and manage your sensitive information in a secure and centralized manner](/azure/data-factory/enable-azure-key-vault-for-managed-airflow). By doing this, you can use variables and connections, and they automatically be stored in Azure Key Vault. The name of connections and variables need to be prefixed by variables_prefix  defined in AIRFLOW__SECRETS__BACKEND_KWARGS. For example, If variables_prefix has a value as  hdinsight-aks-variables then for a variable key of hello, you would want to store your Variable at hdinsight-aks-variable -hello. ++ - Add the following settings for the Airflow configuration overrides in integrated runtime properties: ++ - AIRFLOW__SECRETS__BACKEND: + `"airflow.providers.microsoft.azure.secrets.key_vault.AzureKeyVaultBackend"` ++ - AIRFLOW__SECRETS__BACKEND_KWARGS: + `"{"connections_prefix": "airflow-connections", "variables_prefix": "hdinsight-aks-variables", "vault_url": <your keyvault uri>}”` ++ - Add the following setting for the Environment variables configuration in the Airflow integrated runtime properties: ++ - AZURE_CLIENT_ID = `<App Id from Create Azure AD Service Principal>` ++ - AZURE_TENANT_ID = `<Tenant from Create Azure AD Service Principal> ` ++ - AZURE_CLIENT_SECRET = `<Password from Create Azure AD Service Principal> ` ++ Add Airflow requirements - [apache-airflow-providers-microsoft-azure](https://airflow.apache.org/docs/apache-airflow-providers-microsoft-azure/stable/https://docsupdatetracker.net/index.html) ++ :::image type="content" source="./media/flink-job-orchestration/airflow-configuration-environment-variable.png" alt-text="Screenshot shows airflow configuration and environment variables." lightbox="./media/flink-job-orchestration/airflow-configuration-environment-variable.png"::: ++ +1. Create [Azure AD Service Principal](/cli/azure/ad/sp/) to access Azure – Grant permission to access HDInsight AKS Cluster with Contributor role, make a note of appId, password, and tenant from the response. ++ `az ad sp create-for-rbac -n <sp name> --role Contributor --scopes <Flink Cluster Resource ID>` ++1. Create the following secrets in your key vault with the value from the previous AD Service principal appId, password, and tenant, prefixed by property variables_prefix defined in AIRFLOW__SECRETS__BACKEND_KWARGS. The DAG code can access any of these variables without variables_prefix. ++ - hdinsight-aks-variables-api-client-id=`<App ID from previous step> ` ++ - hdinsight-aks-variables-api-secret=`<Password from previous step> ` ++ - hdinsight-aks-variables-tenant-id=`<Tenant from previous step> ` ++ ```python + from airflow.models import Variable ++ def retrieve_variable_from_akv(): ++ variable_value = Variable.get("client-id") ++ print(variable_value) + ``` ++ +## DAG definition ++A DAG (Directed Acyclic Graph) is the core concept of Airflow, collecting Tasks together, organized with dependencies and relationships to say how they should run. ++There are three ways to declare a DAG: ++ 1. You can use a context manager, which adds the DAG to anything inside it implicitly ++ 1. You can use a standard constructor, passing the DAG into any operators you use ++ 1. You can use the @dag decorator to turn a function into a DAG generator (from airflow.decorators import dag) ++DAGs are nothing without Tasks to run, and those are come in the form of either Operators, Sensors or TaskFlow. ++You can read more details about DAGs, Control Flow, SubDAGs, TaskGroups, etc. directly from [Apache Airflow](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html).  ++## DAG execution ++Example code is available on the [git](https://github.com/Azure-Samples/hdinsight-aks/blob/main/flink/airflow-python-sample-code); download the code locally on your computer and upload the wordcount.py to a blob storage. Follow the [steps](/azure/data-factory/how-does-managed-airflow-work#steps-to-import) to import DAG into your Managed Airflow created during setup. ++The wordcount.py is an example of orchestrating a Flink job submission using Apache Airflow with HDInsight on AKS. The example is based on the wordcount example provided on [Apache Flink](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/dataset/examples/). ++The DAG has two tasks: ++- get OAuth Token ++- Invoke HDInsight Flink Job Submission Azure REST API to submit a new job ++The DAG expects to have setup for the Service Principal, as described during the setup process for the OAuth Client credential and pass the following input configuration for the execution. ++### Execution steps ++1. Execute the DAG from the [Airflow UI](https://airflow.apache.org/docs/apache-airflow/stable/ui.html), you can open the Azure Data Factory Managed Airflow UI by clicking on Monitor icon. ++ :::image type="content" source="./media/flink-job-orchestration/airflow-user-interface-step-1.png" alt-text="Screenshot shows open the Azure data factory managed airflow UI by clicking on monitor icon." lightbox="./media/flink-job-orchestration/airflow-user-interface-step-1.png"::: ++1. Select the “FlinkWordCountExample” DAG from the “DAGs” page. ++ :::image type="content" source="./media/flink-job-orchestration/airflow-user-interface-step-2.png" alt-text="Screenshot shows select the Flink word count example." lightbox="./media/flink-job-orchestration/airflow-user-interface-step-2.png"::: ++1. Click on the “execute” icon from the top right corner and select “Trigger DAG w/ config”. ++ :::image type="content" source="./media/flink-job-orchestration/airflow-user-interface-step-3.png" alt-text="Screenshot shows select execute icon." lightbox="./media/flink-job-orchestration/airflow-user-interface-step-3.png"::: ++ +1. Pass required configuration JSON ++ ```JSON + { ++ "jarName":"WordCount.jar", ++ "jarDirectory":"abfs://filesystem@<storageaccount>.dfs.core.windows.net", ++ "subscritpion":"<cluster subscription id>", ++ "rg":"<cluster resource group>", ++ "poolNm":"<cluster pool name>", ++ "clusterNm":"<cluster name>" ++ } + ``` ++1. Click on “Trigger” button, it starts the execution of the DAG. ++1. You can visualize the status of DAG tasks from the DAG run ++ :::image type="content" source="./media/flink-job-orchestration/dag-task-status.png" alt-text="Screenshot shows dag task status." lightbox="./media/flink-job-orchestration/dag-task-status.png"::: ++1. Validate the job execution from portal ++ :::image type="content" source="./media/flink-job-orchestration/validate-job-execution.png" alt-text="Screenshot shows validate job execution." lightbox="./media/flink-job-orchestration/validate-job-execution.png"::: ++1. Validate the job from “Apache Flink Dashboard” ++ :::image type="content" source="./media/flink-job-orchestration/apache-flink-dashboard.png" alt-text="Screenshot shows apache Flink dashboard." lightbox="./media/flink-job-orchestration/apache-flink-dashboard.png"::: ++## Example code ++ This is an example of orchestrating data pipeline using Airflow with HDInsight on AKS + + The example is based on wordcount example provided on [Apache Flink](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/dataset/examples/) + + The DAG expects to have setup for Service Principal for the OAuth Client credential and pass following input configuration for the execution + ```JSON + { + 'jarName':'WordCount.jar', + 'jarDirectory':'abfs://filesystem@<storageaccount>.dfs.core.windows.net', + 'subscritpion':'<cluster subscription id>', + 'rg':'<cluster resource group>', + 'poolNm':'<cluster pool name>', + 'clusterNm':'<cluster name>' + } + + ``` ++ Refer to the [sample code](https://github.com/Azure-Samples/hdinsight-aks/blob/main/flink/airflow-python-sample-code). + + |
hdinsight-aks | Flink Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/flink-overview.md | + + Title: What is Apache Flink in Azure HDInsight on AKS? (Preview) +description: An introduction to Apache Flink in Azure HDInsight on AKS. ++ Last updated : 08/29/2023+++# What is Apache Flink in Azure HDInsight on AKS? (Preview) +++[Apache Flink](https://flink.apache.org/) is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations and stateful streaming applications at in-memory speed and at any scale. Applications are parallelized into possibly thousands of tasks that are distributed and concurrently executed in a cluster. Therefore, an application can use unlimited amounts of vCPUs, main memory, disk and network IO. Moreover, Flink easily maintains large application state. Its asynchronous and incremental checkpointing algorithm ensures minimal influence on processing latencies while guaranteeing exactly once state consistency. ++Apache Flink is a massively scalable analytics engine for stream processing. ++Some of the key features that Flink offers are: ++- Operations on bounded and unbounded streams +- In memory performance +- Ability for both streaming and batch computations +- Low latency, high throughput operations +- Exactly once processing +- High Availability +- State and fault tolerance +- Fully compatible with Hadoop ecosystem +- Unified SQL APIs for both stream and batch +++## Why Apache Flink? ++Apache Flink is an excellent choice to develop and run many different types of applications due to its extensive features set. FlinkΓÇÖs features include support for stream and batch processing, sophisticated state management, event-time processing semantics, and exactly once consistency guarantees for state. Flink doesn't have a single point of failure. Flink has been proven to scale to thousands of cores and terabytes of application state, delivers high throughput and low latency, and powers some of the worldΓÇÖs most demanding stream processing applications. ++- **Fraud detection**: Flink can be used to detect fraudulent transactions or activities in real time by applying complex rules and machine learning models on streaming data. +- **Anomaly detection**: Flink can be used to identify outliers or abnormal patterns in streaming data, such as sensor readings, network traffic, or user behavior. +- **Rule-based alerting**: Flink can be used to trigger alerts or notifications based on predefined conditions or thresholds on streaming data, such as temperature, pressure, or stock prices. +- **Business process monitoring**: Flink can be used to track and analyze the status and performance of business processes or workflows in real time, such as order fulfillment, delivery, or customer service. +- **Web application (social network)**: Flink can be used to power web applications that require real-time processing of user-generated data, such as messages, likes, comments, or recommendations. + +Read more on common use cases described on [Apache Flink Use cases](https://flink.apache.org/use-cases/#use-cases) ++## Apache Flink Cluster Deployment Types +Flink can execute applications in Session mode or Application mode. Currently HDInsight on AKS supports only Session clusters. You can run multiple Flink jobs on a Session cluster. ++## Apache Flink Job Management ++Flink schedules jobs using three distributed components, Job manager, Task manager, and Job Client, which are set in a Leader-Follower pattern. ++**Flink Job**: A Flink job or program consists of multiple tasks. Tasks are the basic unit of execution in Flink. Each Flink task has multiple instances depending on the level of parallelism and each instance is executed on a TaskManager. ++**Job manager**: Job manager acts as a scheduler and schedules tasks on task managers. ++**Task manager**: Task Managers come with one or more slots to execute tasks in parallel. ++**Job client**: Job client communicates with job manager to submit Flink jobs ++**Flink Web UI**: Flink features a web UI to inspect, monitor, and debug running applications. +++## Checkpoints in Apache Flink ++Every function and operator in Flink can be stateful. Stateful functions store data across the processing of individual elements/events, making state a critical building block for any type of more elaborate operation. In order to make state fault tolerant, Flink needs to **checkpoint the state**. Checkpoints allow Flink to recover state and positions in the streams to give the application the same semantics as a failure-free execution that means they play an important role for Flink to recover from failure both its state and the corresponding stream positions. ++Checkpointing is enabled in HDInsight on AKS Flink by default. Default settings on HDInsight on AKS maintain the last five checkpoints in persistent storage. In case, your job fails, the job can be restarted from the latest checkpoint. ++## State Backends ++Backends determine where state is stored. Stream processing applications are often stateful, *remembering* information from processed events and using it to influence further event processing. In Flink, the remembered information, that is, state, is stored locally in the configured state backend. ++When checkpointing is activated, such state is persisted upon checkpoints to guard against data loss and recover consistently. How the state is represented internally, and how and where it's persisted upon checkpoints depends on the chosen **State Backend**. HDInsight on AKS uses the RocksDB as default StateBackend. ++**Supported state backends:** ++* HashMapStateBackend +* EmbeddedRocksDBStateBackend ++### The HashMapStateBackend ++The `HashMapStateBackend` holds data internally as objects on the Java heap. Key/value state and window operators hold hash tables that store the values, triggers, etc. ++The HashMapStateBackend is encouraged for: ++* Jobs with large state, long windows, large key/value states. +* All high-availability setups. ++it 's also recommended to set managed memory to zero. This value ensures that the maximum amount of memory is allocated for user code on the JVM. +Unlike `EmbeddedRocksDBStateBackend`, the `HashMapStateBackend` stores data as objects on the heap so that it 's unsafe to reuse objects. ++### The EmbeddedRocksDBStateBackend ++The `EmbeddedRocksDBStateBackend` holds in-flight data in a [RocksDB](http://rocksdb.org) database that is (per default). Unlike storing java objects in `HashMapStateBackend`, data is stored as serialized byte arrays, which mainly define the type serializer, resulting in key comparisons being byte-wise instead of using JavaΓÇÖs `hashCode()` and `equals()` methods. ++By default, we use RocksDb as the state backend. RocksDB is an embeddable persistent key-value store for fast storage. ++``` +state.backend: rocksdb +state.checkpoints.dir: <STORAGE_LOCATION> +``` +By default, HDInsight on AKS stores the checkpoints in the storage account configured by the user, so that the checkpoints are persisted. ++### Incremental Checkpoints ++RocksDB supports Incremental Checkpoints, which can dramatically reduce the checkpointing time in comparison to full checkpoints. Instead of producing a full, self-contained backup of the state backend, incremental checkpoints only record the changes that happened since the latest completed checkpoint. An incremental checkpoint builds upon (typically multiple) previous checkpoints. ++Flink applies RocksDBΓÇÖs internal compaction mechanism in a way that is self-consolidating over time. As a result, the incremental checkpoint history in Flink doesn't grow indefinitely, and old checkpoints are eventually subsumed and pruned automatically. Recovery time of incremental checkpoints may be longer or shorter compared to full checkpoints. If your network bandwidth is the bottleneck, it may take a bit longer to restore from an incremental checkpoint, because it implies fetching more data (more deltas). ++Restore from an incremental checkpoint is faster, if the bottleneck is your CPU or IOPs, because restore from an incremental checkpoint means not to rebuild the local RocksDB tables from FlinkΓÇÖs canonical key value snapshot format (used in savepoints and full checkpoints). ++While we encourage the use of incremental checkpoints for large state, you need to enable this feature manually: ++* Setting a default in your `flink-conf.yaml: state.backend.incremental: true` enables incremental checkpoints, unless the application overrides this setting in the code. This statement is true by default. +* You can alternatively configure this value directly in the code (overrides the config default): ++``` +EmbeddedRocksDBStateBackend` backend = new `EmbeddedRocksDBStateBackend(true); +``` ++By default, we preserve the last five checkpoints in the checkpoint dir configured. ++This value can be changed by changing the following config" ++`state.checkpoints.num-retained: 5` ++## Windowing in Flink ++Windowing is a key feature in stream processing systems such as Apache Flink. Windowing splits the continuous stream into finite batches on which computations can be performed. In Flink, windowing can be done on the entire steam or per-key basis. ++Windowing refers to the process of dividing a stream of events into finite, nonoverlapping segments called windows. This feature allows users to perform computations on specific subsets of data based on time or key-based criteria. ++Windows allow users to split the streamed data into segments that can be processed. Due to the unbounded nature of data streams, there's no situation where all the data is available, because users would be waiting indefinitely for new data points to arrive - so instead, windowing offers a way to define a subset of data points that you can then process and analyze. The trigger defines when the window is considered ready for processing, and the function set for the window specifies how to process the data. ++Learn [more](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/operators/windows/) ++### Reference ++[Apache Flink](https://flink.apache.org/) |
hdinsight-aks | Flink Table Api And Sql | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/flink-table-api-and-sql.md | + + Title: Table API and SQL - HDInsight on AKS - Apache Flink +description: Learn about Table API and SQL in HDInsight on AKS - Apache Flink ++ Last updated : 08/29/2023+++# Table API and SQL in HDInsight on AKS - Apache Flink +++Apache Flink features two relational APIs - the Table API and SQL - for unified stream and batch processing. The Table API is a language-integrated query API that allows the composition of queries from relational operators such as selection, filter, and join intuitively. FlinkΓÇÖs SQL support is based on Apache Calcite, which implements the SQL standard. ++The Table API and SQL interfaces integrate seamlessly with each other and FlinkΓÇÖs DataStream API. You can easily switch between all APIs and libraries, which build upon them. ++## Apache Flink SQL ++Like other SQL engines, Flink queries operate on top of tables. It differs from a traditional database because Flink doesn't manage data at rest locally; instead, its queries operate continuously over external tables. ++Flink data processing pipelines begin with source tables and end with sink tables. Source tables produce rows operated over during the query execution; they're the tables referenced in the *FROM* clause of a query. Connectors can be of type HDInsight Kafka, HDInsight HBase, Azure Event Hubs, databases, filesystems, or any other system whose connector lies in the classpath. ++## Using SQL Client in HDInsight on AKS - Flink ++You can refer this article on how to use CLI from [Secure Shell](./flink-web-ssh-on-portal-to-flink-sql.md) on Azure portal. Here are some quick samples of how to get started. + +- To start the SQL client + + ``` + ./bin/sql-client.sh + ``` +- To pass an initialization sql file to run along with sql-client + + ``` + ./sql-client.sh -i /path/to/init_file.sql + ``` ++- To set a configuration in sql-client + + ``` + SET execution.runtime-mode = streaming; + SET sql-client.execution.result-mode = table; + SET sql-client.execution.max-table-result.rows = 10000; + ``` ++## SQL DDL ++Flink SQL supports the following CREATE statements + +- CREATE TABLE +- CREATE DATABASE +- CREATE CATALOG + +Following is an example syntax to define a source table using jdbc connector to connect to MSSQL, with id, name as columns in a **CREATE TABLE** Statement ++```sql +CREATE TABLE student_information ( + id BIGINT, + name STRING, + address STRING, + grade STRING, + PRIMARY KEY (id) NOT ENFORCED + ) WITH ( + 'connector' = 'jdbc', + 'url' = 'jdbc:sqlserver://servername.database.windows.net;database=dbname;encrypt=true;trustServerCertificate=true;create=false;loginTimeout=30', + 'table-name' = 'students', + 'username' = 'username', + 'password' = 'password' + ); +``` ++**CREATE DATABASE** : +```sql +CREATE DATABASE students; +``` +**CREATE CATALOG**: +```sql +CREATE CATALOG myhive WITH ('type'='hive'); +``` +You can run Continuous Queries on Top of these tables +```sql + SELECT id, + COUNT(*) as student_count + FROM student_information + GROUP BY grade; +``` +Write out to **Sink Table** from **Source Table**: +```sql + INSERT INTO grade_counts + SELECT id, + COUNT(*) as student_count + FROM student_information + GROUP BY grade; +``` ++## Adding Dependencies for Apache Flink SQL ++JAR statements are used to add user jars into the classpath or remove user jars from the classpath or show added jars in the classpath in the runtime. ++Flink SQL supports the following JAR statements: ++- ADD JAR +- SHOW JARS +- REMOVE JAR +```sql +Flink SQL> ADD JAR '/path/hello.jar'; +[INFO] Execute statement succeed. ++Flink SQL> ADD JAR 'hdfs:///udf/common-udf.jar'; +[INFO] Execute statement succeed. ++Flink SQL> SHOW JARS; ++-++| jars | ++-++| /path/hello.jar | +| hdfs:///udf/common-udf.jar | ++-+++Flink SQL> REMOVE JAR '/path/hello.jar'; +[INFO] The specified jar is removed from session classloader. +``` ++## Hive Metastore in HDInsight on AKS - Flink ++Catalogs provide metadata, such as databases, tables, partitions, views, and functions and information needed to access data stored in a database or other external systems. ++In HDInsight on AKS, Flink we support two catalog options: ++**GenericInMemoryCatalog** ++The *GenericInMemoryCatalog* is an in-memory implementation of a catalog. All the objects are available only for the lifetime of the sql session. ++**HiveCatalog** ++The *HiveCatalog* serves two purposes; as persistent storage for pure Flink metadata, and as an interface for reading and writing existing Hive metadata. ++> [!NOTE] +> In HDInsight on AKS, Flink comes with an integrated option of Hive Metastore. You can opt for Hive Metastore during [cluster creation](../flink/flink-create-cluster-portal.md) ++## How to Create and Register Flink Databases to Catalogs ++You can refer this article on how to use CLI and get started with Flink SQL Client from [Secure Shell](./flink-web-ssh-on-portal-to-flink-sql.md) on Azure portal. ++- Start `sql-client.sh` session + + :::image type="content" source="./media/flink-table-sql-api/default-catalog.png" alt-text="Screenshot showing default hive catalog."::: ++ Default_catalog is the default in-memory catalog +- Let us now check default database of in-memory catalog + :::image type="content" source="./media/flink-table-sql-api/default-database-in-memory-catalogs.png" alt-text="Screenshot showing default in-memory catalogs."::: +- Let us create Hive Catalog of version 3.1.2 and use it + + ```sql + CREATE CATALOG myhive WITH ('type'='hive'); + USE CATALOG myhive; + ``` + > [!NOTE] + > HDInsight on AKS Flink supports **Hive 3.1.2** and **Hadoop 3.3.2**. The `hive-conf-dir` is set to location `/opt/hive-conf` + +- Let us create Database in hive catalog and make it default for the session (unless changed). + :::image type="content" source="./media/flink-table-sql-api/create-default-hive-catalog.png" alt-text="Screenshot showing creating database in hive catalog and making it default catalog for the session."::: ++## How to Create and Register Hive Tables to Hive Catalog ++- Follow the instructions on [How to Create and Register Flink Databases to Catalog](#how-to-create-and-register-flink-databases-to-catalogs) +- Let us create Flink Table of connector type Hive without Partition ++ ```sql + CREATE TABLE hive_table(x int, days STRING) WITH ( 'connector' = 'hive', 'sink.partition-commit.delay'='1 s', 'sink.partition-commit.policy.kind'='metastore,success-file'); + ``` +- Insert Data into hive_table + ```sql + INSERT INTO hive_table SELECT 2, '10'; + INSERT INTO hive_table SELECT 3, '20'; + ``` +- Read data from hive_table + ```sql + Flink SQL> SELECT * FROM hive_table; + 2023-07-24 09:46:22,225 INFO org.apache.hadoop.mapred.FileInputFormat[] - Total input files to process : 3 + +-+-+--+ + | op | x | days | + +-+-+--+ + | +I | 3 | 20 | + | +I | 2 | 10 | + | +I | 1 | 5 | + +-+-+--+ + Received a total of 3 rows + ``` + > [!NOTE] + > Hive Warehouse Directory is located in the designated container of storage account chosen during Apache Flink cluster creation, can be found at directory hive/warehouse/ +- Lets create Flink Table of connector type hive with Partition + ```sql + CREATE TABLE partitioned_hive_table(x int, days STRING) PARTITIONED BY (days) WITH ( 'connector' = 'hive', 'sink.partition-commit.delay'='1 s', 'sink.partition-commit.policy.kind'='metastore,success-file'); + ``` +> [!IMPORTANT] +> There is a known limitation in Flink. The last ΓÇÿnΓÇÖ columns are chosen for partitions, irrespective of user defined partition column. [FLINK-32596](https://issues.apache.org/jira/browse/FLINK-32596) The partition key will be wrong when use Flink dialect to create Hive table. |
hdinsight-aks | Flink Web Ssh On Portal To Flink Sql | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/flink-web-ssh-on-portal-to-flink-sql.md | + + Title: How to enter the HDInsight on AKS Flink CLI client using Secure Shell (SSH) on Azure portal +description: How to enter the HDInsight on AKS Flink SQL & DStream CLI client using webssh on Azure portal ++ Last updated : 08/29/2023+++# Access CLI Client using Secure Shell (SSH) on Azure portal +++This example guides how to enter the HDInsight on AKS Flink CLI client using SSH on Azure portal, we cover both Flink SQL and Flink DataStream ++## Prerequisites +- You're required to select SSH during [creation](./flink-create-cluster-portal.md) of Flink Cluster ++## Connecting to the SSH from Azure portal ++Once the Flink cluster is created, you can observe on the left pane the **Settings** option to access **Secure Shell** +++## Apache Flink SQL ++#### Connecting to SQL Client ++You're required to change directory to `/opt/flink-webssh/bin` and then execute `./sql-client.sh` ++++You're now on SQL Client on Flink ++Refer to [this](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sqlclient/) document to perform few more tests. ++## Apache Flink DataStream ++Flink provides a Command-Line Interface (CLI)ΓÇ» `bin/flink` to run programs that are packaged as JAR files and to control their execution. ++The CLI is part Secure Shell (SSH), and it connects to the running JobManager and use the client configurations specified atΓÇ»`conf/flink-conf.yaml`. ++Submitting a job means to upload the jobΓÇÖs JAR to the SSH pod and initiating the job execution. To illustrate an example for this article, we select a long-running job likeΓÇ»`examples/streaming/StateMachineExample.jar`. ++> [!NOTE] +> For managing dependencies, the expectation is to build and submit a fat jar for the job. ++- Upload the fat job jar from ABFS to webssh. +- Based on your use case, youΓÇÖre required to edit the client configurations using [Flink configuration management](../flink/flink-configuration-management.md) under flink-client-configs. + +- Let us run StateMachineExample.jar ++ ``` + ./bin/flink run \ + --detached \ + ./examples/streaming/StateMachineExample.jar + ``` +> [!NOTE] +> Submitting the job using `--detached` will make the command return after the submission is done. The output contains the ID of the newly submitted job. ++## Reference ++* [Flink SQL Client](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sqlclient/) |
hdinsight-aks | Fraud Detection Flink Datastream Api | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/fraud-detection-flink-datastream-api.md | + + Title: Fraud detection with the Apache Flink DataStream API +description: Learn about Fraud detection with the Apache Flink DataStream API ++ Last updated : 08/29/2023+++# Fraud detection with the Apache Flink DataStream API +++In this article, learn how to run Fraud detection use case with the Apache Flink DataStream API. ++## Prerequisites ++* [HDInsight on AKS Flink 1.16.0](../flink/flink-create-cluster-portal.md) +* IntelliJ Idea community edition installed locally ++## Develop code in IDE ++- For the sample job, refer [Fraud Detection with the DataStream API](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/try-flink/datastream/) +- Build the skeleton of the code using Flink Maven Archetype by using InterlliJ Idea IDE. +- Once the IDE is opened, go to **File** -> **New** -> **Project** -> **Maven Archetype**. +- Enter the details as shown in the image. ++ :::image type="content" source="./media/fraud-detection-flink-datastream-api/maven-archetype.png" alt-text="Screenshot showing Maven Archetype." border="true" lightbox="./media/fraud-detection-flink-datastream-api/maven-archetype.png"::: ++- After you create the Maven Archetype, it generates 2 java classes FraudDetectionJob and FraudDetector. +- Update the `FraudDetector` with the following code. ++ ``` + package spendreport; + + import org.apache.flink.api.common.state.ValueState; + import org.apache.flink.api.common.state.ValueStateDescriptor; + import org.apache.flink.api.common.typeinfo.Types; + import org.apache.flink.configuration.Configuration; + import org.apache.flink.streaming.api.functions.KeyedProcessFunction; + import org.apache.flink.util.Collector; + import org.apache.flink.walkthrough.common.entity.Alert; + import org.apache.flink.walkthrough.common.entity.Transaction; + + public class FraudDetector extends KeyedProcessFunction<Long, Transaction, Alert> { + + private static final long serialVersionUID = 1L; + + private static final double SMALL_AMOUNT = 1.00; + private static final double LARGE_AMOUNT = 500.00; + private static final long ONE_MINUTE = 60 * 1000; + + private transient ValueState<Boolean> flagState; + private transient ValueState<Long> timerState; + + @Override + public void open(Configuration parameters) { + ValueStateDescriptor<Boolean> flagDescriptor = new ValueStateDescriptor<>( + "flag", + Types.BOOLEAN); + flagState = getRuntimeContext().getState(flagDescriptor); + + ValueStateDescriptor<Long> timerDescriptor = new ValueStateDescriptor<>( + "timer-state", + Types.LONG); + timerState = getRuntimeContext().getState(timerDescriptor); + } + + @Override + public void processElement( + Transaction transaction, + Context context, + Collector<Alert> collector) throws Exception { + + // Get the current state for the current key + Boolean lastTransactionWasSmall = flagState.value(); + + // Check if the flag is set + if (lastTransactionWasSmall != null) { + if (transaction.getAmount() > LARGE_AMOUNT) { + //Output an alert downstream + Alert alert = new Alert(); + alert.setId(transaction.getAccountId()); + + collector.collect(alert); + } + // Clean up our state + cleanUp(context); + } + + if (transaction.getAmount() < SMALL_AMOUNT) { + // set the flag to true + flagState.update(true); + + long timer = context.timerService().currentProcessingTime() + ONE_MINUTE; + context.timerService().registerProcessingTimeTimer(timer); + + timerState.update(timer); + } + } + + @Override + public void onTimer(long timestamp, OnTimerContext ctx, Collector<Alert> out) { + // remove flag after 1 minute + timerState.clear(); + flagState.clear(); + } + + private void cleanUp(Context ctx) throws Exception { + // delete timer + Long timer = timerState.value(); + ctx.timerService().deleteProcessingTimeTimer(timer); + + // clean up all state + timerState.clear(); + flagState.clear(); + } + } + + ``` ++This job uses a source that generates an infinite stream of credit card transactions for you to process. Each transaction contains an account ID (accountId), timestamp (timestamp) of when the transaction occurred, and US$ amount (amount). The logic is that if transaction of the small amount (< 1.00) immediately followed by a large amount (> 500) it sets off alarm and updates the output logs. It uses data from TransactionIterator following class, which is hardcoded so that account ID 3 is detected as fraudulent transaction. ++For more information, refer [Sample TransactionIterator.java](https://github.com/apache/flink/blob/master/flink-walkthroughs/flink-walkthrough-common/src/main/java/org/apache/flink/walkthrough/common/source/TransactionIterator.java) ++## Create JAR file ++After making the code changes, create the jar using the following steps in IntelliJ Idea IDE ++- Go to **File** -> **Project Structure** -> **Project Settings** -> **Artifacts** +- Click **+** (plus sign) -> **Jar** -> From modules with dependencies. +- Select a **Main Class** (the one with main() method) if you need to make the jar runnable. +- Select **Extract to the target Jar**. +- Click **OK**. +- Click **Apply** and then **OK**. +- The following step sets the "skeleton" to where the jar will be saved to. + :::image type="content" source="./media/fraud-detection-flink-datastream-api/extract-target-jar.png" alt-text="Screenshot showing how to extract target Jar." border="true" lightbox="./media/fraud-detection-flink-datastream-api/extract-target-jar.png"::: ++- To build and save + - Go to **Build -> Build Artifact -> Build** ++ :::image type="content" source="./media/fraud-detection-flink-datastream-api/build-artifact.png" alt-text="Screenshot showing how to build artifact."::: + + :::image type="content" source="./media/fraud-detection-flink-datastream-api/extract-target-jar-1.png" alt-text="Screenshot showing how to extract the target jar."::: ++## Run the job in Apache Flink environment ++- Once the jar is generated, it can be used to submit the job from Flink UI using submit job section. ++ +- After the job is submitted, it's moved to running state, and the Task manager logs will be generated. ++++- From the logs, view the alert is generated for Account ID 3. ++## Reference +* [Fraud Detector v2: State + Time](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/try-flink/datastream/#fraud-detector-v2-state--time--1008465039) +* [Sample TransactionIterator.java](https://github.com/apache/flink/blob/master/flink-walkthroughs/flink-walkthrough-common/src/main/java/org/apache/flink/walkthrough/common/source/TransactionIterator.java) |
hdinsight-aks | Hive Dialect Flink | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/hive-dialect-flink.md | + + Title: Hive dialect in Flink +description: Hive dialect in Flink HDInsight on AKS ++ Last updated : 09/18/2023+++# Hive dialect in Flink +++In this article, learn how to use Hive dialect in HDInsight on AKS - Flink. ++## Introduction ++The user cannot change the default `flink` dialect to hive dialect for their usage on HDInsight on AKS - Flink. All the SQL operations fail once changed to hive dialect with the following error. ++```Caused by: ++*java.lang.ClassCastException: class jdk.internal.loader.ClassLoaders$AppClassLoader cannot be cast to class java.net.URLClassLoader* +``` ++The reason for this issue arises due to an open [Hive Jira](https://issues.apache.org/jira/browse/HIVE-21584). Currently, Hive assumes that the system class loader is an instance of URLClassLoader. In `Java 11`, this assumption is not the case. ++## How to use Hive dialect in Flink ++- Execute the following steps in [webssh](./flink-web-ssh-on-portal-to-flink-sql.md): ++ 1. Remove the existing flink-sql-connector-hive*jar in lib location + ```command + rm /opt/flink-webssh/lib/flink-sql-connector-hive*jar + ``` + 1. Download the below jar in `webssh` pod and add it under the /opt/flink-webssh/lib wget https://aka.ms/hdiflinkhivejdk11jar. + (The above hive jar has the fix [https://issues.apache.org/jira/browse/HIVE-27508](https://issues.apache.org/jira/browse/HIVE-27508)) ++ 1. ``` + mv $FLINK_HOME/opt/flink-table-planner_2.12-1.16.0-0.0.18.jar $FLINK_HOME/lib/flink-table-planner_2.12-1.16.0-0.0.18.jar + ``` ++ 1. ``` + mv $FLINK_HOME/lib/flink-table-planner-loader-1.16.0-0.0.18.jar $FLINK_HOME/opt/flink-table-planner-loader-1.16.0-0.0.18.jar + ``` ++ 1. Add the following keys in the `flink` configuration management under core-site.xml section: + ``` + fs.azure.account.key.<STORAGE>.dfs.core.windows.net: <KEY> + flink.hadoop.fs.azure.account.key.<STORAGE>.dfs.core.windows.net: <KEY> + ``` ++- Here is an overview of [hive-dialect queries](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/hive-compatibility/hive-dialect/queries/overview/) + + - Executing Hive dialect in Flink without partitioning + + ```sql + root [ ~ ]# ./bin/sql-client.sh + Flink SQL> + Flink SQL> create catalog myhive with ('type' = 'hive', 'hive-conf-dir' = '/opt/hive-conf'); + [INFO] Execute statement succeed. ++ Flink SQL> use catalog myhive; + [INFO] Execute statement succeed. ++ Flink SQL> load module hive; + [INFO] Execute statement succeed. ++ Flink SQL> use modules hive,core; + [INFO] Execute statement succeed. ++ Flink SQL> set table.sql-dialect=hive; + [INFO] Session property has been set. ++ Flink SQL> set sql-client.execution.result-mode=tableau; + [INFO] Session property has been set. ++ Flink SQL> select explode(array(1,2,3));Hive Session ID = 6ba45be2-360e-4bee-8842-2765c91581c8 + ++ > [!WARNING] + > An illegal reflective access operation has occurred ++ > [!WARNING] + > Illegal reflective access by org.apache.hadoop.hive.common.StringInternUtils (file:/opt/flink-webssh/lib/flink-sql-connector-hive-3.1.2_2.12-1.16-SNAPSHOT.jar) to field java.net.URI.string ++ > [!WARNING] + > Please consider reporting this to the maintainers of org.apache.hadoop.hive.common.StringInternUtils ++ > [!WARNING] + > `Use --illegal-access=warn` to enable warnings of further illegal reflective access operations ++ > [!WARNING] + > All illegal access operations will be denied in a future release + select explode(array(1,2,3)); +++ +-+-+ + | op | col | + +-+-+ + | +I | 1 | + | +I | 2 | + | +I | 3 | + +-+-+ ++ Received a total of 3 rows ++ Flink SQL> create table tttestHive Session ID = fb8b652a-8dad-4781-8384-0694dc16e837 ++ [INFO] Execute statement succeed. ++ Flink SQL> insert into table tttestHive Session ID = f239dc6f-4b58-49f9-ad02-4c73673737d8),(3,'c'),(4,'d'); ++ [INFO] Submitting SQL update statement to the cluster... + [INFO] SQL update statement has been successfully submitted to the cluster: + Job ID: d0542da4c4252f9494298666ff4e9f8e ++ Flink SQL> set execution.runtime-mode=batch; + [INFO] Session property has been set. ++ Flink SQL> select * from tttestHive Session ID = 61b6eb3b-90a6-499c-aced-0598366c5b31 ++ +--+-+ + | key | value | + +--+-+ + | 1 | a | + | 1 | a | + | 2 | b | + | 3 | c | + | 3 | c | + | 3 | c | + | 4 | d | + | 5 | e | + +--+-+ + 8 rows in set ++ Flink SQL> QUIT;Hive Session ID = 2dadad92-436e-426e-a88c-66eafd740d98 ++ [INFO] Exiting Flink SQL CLI Client... ++ Shutting down the session... + done. + root [ ~ ]# exit + ``` ++ The data is written in the same container configured in the hive/warehouse directory. ++ :::image type="content" source="./media/hive-dialect-flink/flink-container-table-1.png" alt-text="Screenshot shows container table 1." lightbox="./media/hive-dialect-flink/flink-container-table-1.png"::: ++ - Executing Hive dialect in Flink with partitions ++```sql + create table tblpart2 (key int, value string) PARTITIONED by ( part string ) tblproperties ('sink.partition-commit.delay'='1 s', 'sink.partition-commit.policy.kind'='metastore,success-file'); ++ insert into table tblpart2 Hive Session ID = 78fae85f-a451-4110-bea6-4aa1c172e282),(2,'b','d'),(3,'c','d'),(3,'c','a'),(4,'d','e'); +``` + :::image type="content" source="./media/hive-dialect-flink/flink-container-table-2.png" alt-text="Screenshot shows container table 2." lightbox="./media/hive-dialect-flink/flink-container-table-2.png"::: ++ :::image type="content" source="./media/hive-dialect-flink/flink-container-table-3.png" alt-text="Screenshot shows container table 3." lightbox="./media/hive-dialect-flink/flink-container-table-3.png"::: |
hdinsight-aks | Integration Of Azure Data Explorer | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/integration-of-azure-data-explorer.md | + + Title: Integration of Azure Data Explorer and Flink +description: Integration of Azure Data Explorer and Flink in HDInsight on AKS ++ Last updated : 09/18/2023+++# Integration of Azure Data Explorer and Flink ++Azure Data Explorer is a fully managed, high-performance, big data analytics platform that makes it easy to analyze high volumes of data in near real time. ++ADX helps users in analysis of large volumes of data from streaming applications, websites, IoT devices, etc. Integrating Flink with ADX helps you to process real-time data and analyze it in ADX. ++## Prerequisites +- [Create HDInsight on AKS Flink cluster](./flink-create-cluster-portal.md) +- [Create Azure data explorer](/azure/data-explorer/create-cluster-and-database/) ++## Steps to use Azure Data Explorer as sink in Flink ++1. [Create HDInsight on AKS Flink cluster](./flink-create-cluster-portal.md). ++1. [Create ADX with database](/azure/data-explorer/create-cluster-and-database/) and table as required. ++1. Add ingestor permissions for the managed identity in Kusto. ++ ``` + .add database <DATABASE_NAME> ingestors ΓÇ»('aadapp=CLIENT_ID_OF_MANAGED_IDENTITY') + ``` +1. Run a sample program defining the Kusto cluster URI (Uniform Resource Identifier), database and managed identity used, and the table it needs to write to. ++1. Clone the flink-connector-kusto project: https://github.com/Azure/flink-connector-kusto.git ++1. Create the table in ADX using following command + + ```Sample table + .create table CryptoRatesHeartbeatTimeBatch (processing_dttm: datetime, ['type']: string, last_trade_id: string, product_id: string, sequence: long, ['time']: datetime) + ``` + ++1. Update FlinkKustoSinkSample.java file with the right Kusto cluster URI, database and the managed identity used. ++ ```JAVA + String database = "sdktests"; //ADX database name ++ String msiClientId = ΓÇ£xxxx-xxxx-xxxxΓÇ¥; //Provide the client id of the Managed identity which is linked to the Flink cluster + String cluster = "https://trdp-1665b5eybxs0tbett.z8.kusto.fabric.microsoft.com/"; //Data explorer Cluster URI + KustoConnectionOptions kustoConnectionOptions = KustoConnectionOptions.builder() + .setManagedIdentityAppId(msiClientId).setClusterUrl(cluster).build(); + String defaultTable = "CryptoRatesHeartbeatTimeBatch"; //Table where the data needs to be written + KustoWriteOptions kustoWriteOptionsHeartbeat = KustoWriteOptions.builder() + .withDatabase(database).withTable(defaultTable).withBatchIntervalMs(30000) + ``` + ++ Later build the project using ΓÇ£mvn clean packageΓÇ¥ ++1. Locate the JAR file named 'samples-java-1.0-SNAPSHOT-shaded.jar' under the 'sample-java/target' folder, then upload this JAR file in the Flink UI and submit the job. ++1. Query the Kusto table to verify the output + + :::image type="content" source="./media/integration-of-azure-data-explorer/kusto-table-to-verify-output.png" alt-text="screenshot shows query the Kusto table to verify the output." lightbox="./media/integration-of-azure-data-explorer/kusto-table-to-verify-output.png"::: ++ There is no delay in writing the data to the Kusto table from Flink. + |
hdinsight-aks | Join Stream Kafka Table Filesystem | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/join-stream-kafka-table-filesystem.md | + + Title: Enrich the events from Kafka with the attributes from FileSystem with Apache Flink +description: Learn how to join stream from Kafka with table from fileSystem using DataStream API ++ Last updated : 08/29/2023+++# Enrich the events from Kafka with attributes from ADLS Gen2 with Apache Flink +++In this article, you can learn how you can enrich the real time events by joining a stream from Kafka with table on ADLS Gen2 using Flink Streaming. We use Flink Streaming API to join events from HDInsight Kafka with attributes from ADLS Gen2, further we use attributes-joined events to sink into another Kafka topic. ++## Prerequisites ++* [HDInsight on AKS Flink 1.16.0](../flink/flink-create-cluster-portal.md) +* [HDInsight Kafka](../../hdinsight/kafk) + * You're required to ensure the network settings are taken care as described on [Using HDInsight Kafka](../flink/process-and-consume-data.md); that's to make sure HDInsight on AKS Flink and HDInsight Kafka are in the same VNet +* For this demonstration, we're using a Window VM as maven project develop environment in the same VNet as HDInsight on AKS ++## Kafka topic preparation ++We're creating a topic called `user_events`. ++- The purpose is to read a stream of real-time events from a Kafka topic using Flink. We have every event with the following fields: + ``` + user_id, + item_id, + type, + timestamp, + ``` ++**Kafka 2.4.1** +``` +/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --replication-factor 2 --partitions 3 --topic user_events --zookeeper zk0-contos:2181 +/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --replication-factor 2 --partitions 3 --topic user_events_output --zookeeper zk0-contos:2181 +``` ++**Kafka 3.2.0** +``` +/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --replication-factor 2 --partitions 3 --topic user_events --bootstrap-server wn0-contsk:9092 +/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --replication-factor 2 --partitions 3 --topic user_events_output --bootstrap-server wn0-contsk:9092 +``` ++## Prepare file on ADLS Gen2 ++We are creating a file called `item attributes` in our storage ++- The purpose is to read a batch of `item attributes` from a file on ADLS Gen2. Each item has the following fields: + ``` + item_id, + brand, + category, + timestamp, + ``` +++## Develop the Apache Flink job ++In this step we perform the following activities +- Enrich the `user_events` topic from Kafka by joining with `item attributes` from a file on ADLS Gen2. +- We push the outcome of this step, as an enriched user activity of events into a Kafka topic. ++### Develop Maven project ++**pom.xml** ++``` xml +<?xml version="1.0" encoding="UTF-8"?> +<project xmlns="http://maven.apache.org/POM/4.0.0" + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> + <modelVersion>4.0.0</modelVersion> ++ <groupId>contoso.example</groupId> + <artifactId>FlinkKafkaJoinGen2</artifactId> + <version>1.0-SNAPSHOT</version> ++ <properties> + <maven.compiler.source>1.8</maven.compiler.source> + <maven.compiler.target>1.8</maven.compiler.target> + <flink.version>1.16.0</flink.version> + <java.version>1.8</java.version> + <scala.binary.version>2.12</scala.binary.version> + <kafka.version>3.2.0</kafka.version> //replace with 2.4.1 if you are using HDInsight Kafka 2.4.1 + </properties> + <dependencies> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-java</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-streaming-java</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-clients</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-files --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-files</artifactId> + <version>${flink.version}</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-kafka</artifactId> + <version>${flink.version}</version> + </dependency> + </dependencies> + <build> + <plugins> + <plugin> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-assembly-plugin</artifactId> + <version>3.0.0</version> + <configuration> + <appendAssemblyId>false</appendAssemblyId> + <descriptorRefs> + <descriptorRef>jar-with-dependencies</descriptorRef> + </descriptorRefs> + </configuration> + <executions> + <execution> + <id>make-assembly</id> + <phase>package</phase> + <goals> + <goal>single</goal> + </goals> + </execution> + </executions> + </plugin> + </plugins> + </build> +</project> +``` ++**Join the Kafka topic with ADLS Gen2 File** ++**KafkaJoinGen2Demo.java** ++``` java +package contoso.example; ++import org.apache.flink.api.common.eventtime.WatermarkStrategy; +import org.apache.flink.api.common.functions.MapFunction; +import org.apache.flink.api.common.functions.RichMapFunction; +import org.apache.flink.api.common.serialization.SimpleStringSchema; +import org.apache.flink.api.java.tuple.Tuple4; +import org.apache.flink.api.java.tuple.Tuple7; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.connector.kafka.sink.KafkaRecordSerializationSchema; +import org.apache.flink.connector.kafka.sink.KafkaSink; +import org.apache.flink.connector.kafka.source.KafkaSource; +import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; ++import java.io.BufferedReader; +import java.io.FileReader; +import java.util.HashMap; +import java.util.Map; ++public class KafkaJoinGen2Demo { + public static void main(String[] args) throws Exception { + // 1. Set up the stream execution environment + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); ++ // Kafka source configuration, update with your broker IPs + String brokers = "<broker-ip>:9092,<broker-ip>:9092,<broker-ip>:9092"; + String inputTopic = "user_events"; + String outputTopic = "user_events_output"; + String groupId = "my_group"; ++ // 2. Register the cached file, update your container name and storage name + env.registerCachedFile("abfs://<container-name>@<storagename>.dfs.core.windows.net/flink/data/item.txt", "file1"); ++ // 3. Read a stream of real-time user behavior event from a Kafka topic + KafkaSource<String> kafkaSource = KafkaSource.<String>builder() + .setBootstrapServers(brokers) + .setTopics(inputTopic) + .setGroupId(groupId) + .setStartingOffsets(OffsetsInitializer.earliest()) + .setValueOnlyDeserializer(new SimpleStringSchema()) + .build(); ++ DataStream<String> kafkaData = env.fromSource(kafkaSource, WatermarkStrategy.noWatermarks(), "Kafka Source"); ++ // Parse Kafka source data + DataStream<Tuple4<String, String, String, String>> userEvents = kafkaData.map(new MapFunction<String, Tuple4<String, String, String, String>>() { + @Override + public Tuple4<String, String, String, String> map(String value) throws Exception { + // Parse the line into a Tuple4 + String[] parts = value.split(","); + return new Tuple4<>(parts[0], parts[1], parts[2], parts[3]); + } + }); ++ // 4. Enrich the user activity events by joining the items' attributes from a file + DataStream<Tuple7<String,String,String,String,String,String,String>> enrichedData = userEvents.map(new MyJoinFunction()); ++ // 5. Output the enriched user activity events to a Kafka topic + KafkaSink<String> sink = KafkaSink.<String>builder() + .setBootstrapServers(brokers) + .setRecordSerializer(KafkaRecordSerializationSchema.builder() + .setTopic(outputTopic) + .setValueSerializationSchema(new SimpleStringSchema()) + .build() + ) + .build(); ++ enrichedData.map(value -> value.toString()).sinkTo(sink); ++ // 6. Execute the Flink job + env.execute("Kafka Join Batch gen2 file, sink to another Kafka Topic"); + } ++ private static class MyJoinFunction extends RichMapFunction<Tuple4<String,String,String,String>, Tuple7<String,String,String,String,String,String,String>> { + private Map<String, Tuple4<String, String, String, String>> itemAttributes; ++ @Override + public void open(Configuration parameters) throws Exception { + super.open(parameters); ++ // Read the cached file and parse its contents into a map + itemAttributes = new HashMap<>(); + try (BufferedReader reader = new BufferedReader(new FileReader(getRuntimeContext().getDistributedCache().getFile("file1")))) { + String line; + while ((line = reader.readLine()) != null) { + String[] parts = line.split(","); + itemAttributes.put(parts[0], new Tuple4<>(parts[0], parts[1], parts[2], parts[3])); + } + } + } ++ @Override + public Tuple7<String,String,String,String,String,String,String> map(Tuple4<String,String,String,String> value) throws Exception { + Tuple4<String, String, String, String> broadcastValue = itemAttributes.get(value.f1); + if (broadcastValue != null) { + return Tuple7.of(value.f0,value.f1,value.f2,value.f3,broadcastValue.f1,broadcastValue.f2,broadcastValue.f3); + } else { + return null; + } + } + } +} +``` ++## Package jar and submit to Apache Flink ++We're submitting the packaged jar to Flink: +++++### Produce real-time `user_events` topic on Kafka ++ We are able to produce real-time user behavior event `user_events` in Kafka. +++### Consume the `itemAttributes` joining with `user_events` on Kafka ++We are now using `itemAttributes` on filesystem join user activity events `user_events`. +++We continue to produce and consume the user activity and item attributes in the following images ++++## Reference ++[Flink Examples](https://github.com/flink-extended/) |
hdinsight-aks | Monitor Changes Postgres Table Flink | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/monitor-changes-postgres-table-flink.md | + + Title: Change Data Capture (CDC) of PostgreSQL table using Apache FlinkSQL +description: Learn how to perform CDC on PostgreSQL table using Apache FlinkSQL CDC ++ Last updated : 08/29/2023+++# Change Data Capture (CDC) of PostgreSQL table using Apache FlinkSQL +++Change Data Capture (CDC) is a technique you can use to track row-level changes in database tables in response to create, update, and delete operations. In this article, we use [CDC Connectors for Apache Flink®](https://github.com/ververica/flink-cdc-connectors), which offer a set of source connectors for Apache Flink. The connectors integrate [Debezium®](https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/formats/debezium/#debezium-format) as the engine to capture the data changes. ++Flink supports to interpret Debezium JSON and Avro messages as INSERT/UPDATE/DELETE messages into Apache Flink SQL system. ++This support is useful in many cases to: ++- Synchronize incremental data from databases to other systems +- Audit logs +- Build real-time materialized views on databases +- View temporal join changing history of a database table +++Now, let's learn how to monitor changes on PostgreSQL table using Flink-SQL CDC. The PostgreSQL CDC connector allows for reading snapshot data and incremental data from PostgreSQL database. ++## Prerequisites ++* [Azure PostgresSQL flexible server Version 14.7](/azure/postgresql/flexible-server/overview) +* [HDInsight on AKS Flink 1.16.0](./flink-create-cluster-portal.md) +* Linux virtual Machine to use PostgreSQL client +* Add the NSG rule that allows inbound and outbound connections on port 5432 in HDInsight on AKS pool subnet. ++## Prepare PostgreSQL table & Client ++- Using a Linux virtual machine, install PostgreSQL client using below commands ++ ``` + sudo apt-get update + sudo apt-get install postgresql-client + ``` ++- Install the certificate to connect to PostgreSQL server using SSL ++ `wget --no-check-certificate https://dl.cacerts.digicert.com/DigiCertGlobalRootCA.crt.pem` ++- Connect to the server (replace host, username and database name accordingly) ++ ``` + psql --host=flinkpostgres.postgres.database.azure.com --port=5432 --username=admin --dbname=postgres --set=sslmode=require --set=sslrootcert=DigiCertGlobalRootCA.crt.pem + ``` +- After connecting to the database successfully, create a sample table + ``` + CREATE TABLE shipments ( + shipment_id SERIAL NOT NULL PRIMARY KEY, + order_id SERIAL NOT NULL, + origin VARCHAR(255) NOT NULL, + destination VARCHAR(255) NOT NULL, + is_arrived BOOLEAN NOT NULL + ); + ALTER SEQUENCE public.shipments_shipment_id_seq RESTART WITH 1001; + ALTER TABLE public.shipments REPLICA IDENTITY FULL; + INSERT INTO shipments + VALUES (default,10001,'Beijing','Shanghai',false), + (default,10002,'Hangzhou','Shanghai',false), + (default,10003,'Shanghai','Hangzhou',false); + ``` ++- To enable CDC on PostgreSQL database, you're required to make the following changes. + + - WAL level must be changed to **logical**. This value can be changed in server parameters section on Azure portal. ++ :::image type="content" source="./media/monitor-changes-postgres-table-flink/enable-cdc-on-postgres-database.png" alt-text="Screenshot showing how to enable-cdc-on-postgres-database." border="true" lightbox="./media/monitor-changes-postgres-table-flink/enable-cdc-on-postgres-database.png"::: ++ - User accessing the table must have 'REPLICATION' role added ++ ALTER USER `<username>` WITH REPLICATION; ++## Create Apache Flink PostgreSQL CDC table ++- To create Flink PostgreSQL CDC table, download all the dependent jars. Use the `pom.xml` file with the following contents. ++ ```xml + <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> + <modelVersion>4.0.0</modelVersion> + <groupId>com.dep.download</groupId> + <artifactId>dep-download</artifactId> + <version>1.0-SNAPSHOT</version> + <!-- https://mvnrepository.com/artifact/com.ververica/flink-sql-connector-sqlserver-cdc --> + <dependencies> + <dependency> + <groupId>com.ververica</groupId> + <artifactId>flink-sql-connector-postgres-cdc</artifactId> + <version>2.3.0</version> + </dependency> + </dependencies> + </project> + ``` +- Use maven command to download all the dependent jars ++ ``` + mvn -DoutputDirectory=target -f pom.xml dependency:copy-dependencies -X + ``` ++ > [!NOTE] + > * If your web ssh pod does not contain maven please follow the links to download and install it. + > * https://maven.apache.org/download.cgi + > * https://maven.apache.org/install.html + > * In order to download jsr jar file use the following command + > * `wget https://repo1.maven.org/maven2/net/java/loci/jsr308-all/1.1.2/jsr308-all-1.1.2.jar` ++- Once the dependent jars are downloaded start the [Flink SQL client](./flink-web-ssh-on-portal-to-flink-sql.md), with these jars to be imported into the session. Complete command as follows, ++ ```sql + /opt/flink-webssh/bin/sql-client.sh -j + /opt/flink-webssh/target/flink-sql-connector-postgres-cdc-2.3.0.jar -j + /opt/flink-webssh/target/slf4j-api-1.7.15.jar -j + /opt/flink-webssh/target/hamcrest-2.1.jar -j + /opt/flink-webssh/target/flink-shaded-guava-30.1.1-jre-16.0.jar -j + /opt/flink-webssh/target/awaitility-4.0.1.jar -j + /opt/flink-webssh/target/jsr308-all-1.1.2.jar + ``` + These commands start the sql client with the dependencies as, ++ :::image type="content" source="./media/monitor-changes-postgres-table-flink/start-the-sql-client.png" alt-text="Screenshot showing start-the-sql-client." border="true" lightbox="./media/monitor-changes-postgres-table-flink/start-the-sql-client.png"::: ++ :::image type="content" source="./media/monitor-changes-postgres-table-flink/sql-client-status.png" alt-text="Screenshot showing sql-client-status." border="true" lightbox="./media/monitor-changes-postgres-table-flink/sql-client-status.png"::: +++- Create a Flink PostgreSQL CDC table using CDC connector ++ ``` + CREATE TABLE shipments ( + shipment_id INT, + order_id INT, + origin STRING, + destination STRING, + is_arrived BOOLEAN, + PRIMARY KEY (shipment_id) NOT ENFORCED + ) WITH ( + 'connector' = 'postgres-cdc', + 'hostname' = 'flinkpostgres.postgres.database.azure.com', + 'port' = '5432', + 'username' = 'username', + 'password' = 'admin', + 'database-name' = 'postgres', + 'schema-name' = 'public', + 'table-name' = 'shipments', + 'decoding.plugin.name' = 'pgoutput' + ); + ``` +## Validation ++- Run 'select *' command to monitor the changes. + + `select * from shipments;` ++ :::image type="content" source="./media/monitor-changes-postgres-table-flink/run-select-command.png" alt-text="Screenshot showing how to run-select-command." border="true" lightbox="./media/monitor-changes-postgres-table-flink/run-select-command.png"::: ++### Reference ++[PostgreSQL CDC Connector](https://ververica.github.io/flink-cdc-connectors/release-2.1/content/connectors/postgres-cdc.html) is licensed under [Apache 2.0 License](https://github.com/ververica/flink-cdc-connectors/blob/master/LICENSE) |
hdinsight-aks | Process And Consume Data | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/process-and-consume-data.md | + + Title: Using HDInsight Kafka with HDInsight on AKS Apache Flink +description: Learn how to use HDInsight Kafka with HDInsight on AKS Apache Flink ++ Last updated : 08/29/2023++ +# Using HDInsight Kafka with HDInsight on AKS - Apache Flink +++A well known use case for Apache Flink is stream analytics. The popular choice by many users to use the data streams, which are ingested using Apache Kafka. Typical installations of Flink and Kafka start with event streams being pushed to Kafka, which can be consumed by Flink jobs. ++This example uses HDInsight on AKS Flink 1.16.0 to process streaming data consuming and producing Kafka topic. ++> [!NOTE] +> FlinkKafkaConsumer is deprecated and will be removed with Flink 1.17, please use KafkaSource instead. +> FlinkKafkaProducer is deprecated and will be removed with Flink 1.15, please use KafkaSink instead. ++## Prerequisites ++* Both Kafka and Flink need to be in the same VNet or there should be vnet-peering between the two clusters. +* [Creation of VNet](../../hdinsight/hdinsight-create-virtual-network.md). +* [Create a Kafka cluster in the same VNet](../../hdinsight/kafk). You can choose Kafka 3.2 or 2.4 on HDInsight based on your current usage. ++ :::image type="content" source="./media/process-consume-data/create-kafka-cluster-in-the-same-vnet.png" alt-text="Screenshot showing how to create a Kafka cluster in the same VNet." border="true" lightbox="./media/process-consume-data/create-kafka-cluster-in-the-same-vnet.png"::: + +* Add the VNet details in the virtual network section. +* Create a [HDInsight on AKS Cluster pool](../quickstart-create-cluster.md) with same VNet. +* Create a Flink cluster to the cluster pool created. ++## Apache Flink-Kafka Connector ++Flink provides an [Apache Kafka Connector](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/kafka/) for reading data from and writing data to Kafka topics with exactly once guarantees. ++**Maven dependency** +``` xml + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-kafka</artifactId> + <version>1.16.0</version> + </dependency> +``` ++## Building Kafka Sink ++Kafka sink provides a builder class to construct an instance of a KafkaSink. We use the same to construct our Sink and use it along with HDInsight on AKS Flink ++**SinKafkaToKafka.java** +``` java +import org.apache.flink.api.common.eventtime.WatermarkStrategy; +import org.apache.flink.api.common.functions.MapFunction; +import org.apache.flink.api.common.serialization.SimpleStringSchema; +import org.apache.flink.connector.base.DeliveryGuarantee; ++import org.apache.flink.connector.kafka.sink.KafkaRecordSerializationSchema; +import org.apache.flink.connector.kafka.sink.KafkaSink; +import org.apache.flink.connector.kafka.source.KafkaSource; +import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer; ++import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.datastream.DataStreamSource; +import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; ++public class SinKafkaToKafka { + public static void main(String[] args) throws Exception { + // 1. get stream execution environment + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); ++ // 2. read kafka message as stream input, update your broker IPs below + String brokers = "X.X.X.X:9092,X.X.X.X:9092,X.X.X.X:9092"; + KafkaSource<String> source = KafkaSource.<String>builder() + .setBootstrapServers(brokers) + .setTopics("clicks") + .setGroupId("my-group") + .setStartingOffsets(OffsetsInitializer.earliest()) + .setValueOnlyDeserializer(new SimpleStringSchema()) + .build(); ++ DataStream<String> stream = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source"); + + // 3. transformation: + // https://www.taobao.com,1000 > + // Event{user: "Tim",url: "https://www.taobao.com",timestamp: 1970-01-01 00:00:01.0} + SingleOutputStreamOperator<String> result = stream.map(new MapFunction<String, String>() { + @Override + public String map(String value) throws Exception { + String[] fields = value.split(","); + return new Event(fields[0].trim(), fields[1].trim(), Long.valueOf(fields[2].trim())).toString(); + } + }); ++ // 4. sink click into another kafka events topic + KafkaSink<String> sink = KafkaSink.<String>builder() + .setBootstrapServers(brokers) + .setProperty("transaction.timeout.ms","900000") + .setRecordSerializer(KafkaRecordSerializationSchema.builder() + .setTopic("events") + .setValueSerializationSchema(new SimpleStringSchema()) + .build()) + .setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE) + .build(); ++ result.sinkTo(sink); ++ // 5. execute the stream + env.execute("kafka Sink to other topic"); + } +} +``` +**Writing a Java program Event.java** +``` java +import java.sql.Timestamp; ++public class Event { ++ public String user; + public String url; + public Long timestamp; ++ public Event() { + } ++ public Event(String user,String url,Long timestamp) { + this.user = user; + this.url = url; + this.timestamp = timestamp; + } ++ @Override + public String toString(){ + return "Event{" + + "user: \"" + user + "\"" + + ",url: \"" + url + "\"" + + ",timestamp: " + new Timestamp(timestamp) + + "}"; + } +} +``` +## Package the jar and submit the job to Flink ++++## Produce the topic - clicks on Kafka +++## Consume the topic - events on Kafka +++## Reference ++* [Apache Kafka Connector](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/kafka) |
hdinsight-aks | Sink Kafka To Kibana | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/sink-kafka-to-kibana.md | + + Title: Use Elasticsearch along with HDInsight on AKS - Apache Flink +description: Learn how to use Elasticsearch along HDInsight on AKS - Apache Flink ++ Last updated : 08/29/2023+++# Using Elasticsearch with HDInsight on AKS - Apache Flink +++Flink for real-time analytics can be used to build a dashboard application that visualizes the streaming data using Elasticsearch and Kibana. ++Flink can be used to analyze a stream of taxi ride events and compute metrics. Metrics can include number of rides per hour, the average fare per ride, or the most popular pickup locations. You can write these metrics to an Elasticsearch index using a Flink sink and use Kibana to connect and create charts or dashboards to display metrics in real-time. ++In this article, learn how to Use Elastic along HDInsight Flink. ++## Elasticsearch and Kibana ++Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including +* Textual +* Numerical +* Geospatial +* Structured +* Unstructured. ++Kibana is a free and open frontend application that sits on top of the elastic stack, providing search and data visualization capabilities for data indexed in Elasticsearch. ++For more information, refer +* [Elasticsearch](https://www.elastic.co) +* [Kibana](https://www.elastic.co/what-is/kibana) +++## Prerequisites ++* [HDInsight on AKS Flink 1.16.0](./flink-create-cluster-portal.md) +* Elasticsearch-7.13.2 +* Kibana-7.13.2 +* [HDInsight 5.0 - Kafka 2.4.1](../../hdinsight/kafk) +* IntelliJ IDEA for development on an Azure VM which in the same Vnet +++### How to Install Elasticsearch on Ubuntu 20.04 ++- APT Update & Install OpenJDK +- Add Elastic Search GPG key and Repository + - Steps for adding the GPG key + ``` + sudo apt-get install apt-transport-https + wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg + ``` + - Add Repository + ``` + echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list + ``` +- Run system update +``` +sudo apt update +``` ++- Install ElasticSearch on Ubuntu 20.04 Linux +``` +sudo apt install elasticsearch +``` +- Start ElasticSearch Services + + - Reload Daemon: + ``` + sudo systemctl daemon-reload + ``` + - Enable + ``` + sudo systemctl enable elasticsearch + ``` + - Start + ``` + sudo systemctl start elasticsearch + ``` + - Check Status + ``` + sudo systemctl status elasticsearch + ``` + - Stop + ``` + sudo systemctl stop elasticsearch + ``` ++### How to Install Kibana on Ubuntu 20.04 ++For installing and configuring Kibana Dashboard, we donΓÇÖt need to add any other repository because the packages are available through the already added ElasticSearch. ++We use the following command to install Kibana ++``` +sudo apt install kibana +``` ++- Reload daemon + ``` + sudo systemctl daemon-reload + ``` + - Start and Enable: + ``` + sudo systemctl enable kibana + sudo systemctl start kibana + ``` + - To check the status: + ``` + sudo systemctl status kibana + ``` +### Access the Kibana Dashboard web interface ++In order to make Kibana accessible from output, need to set network.host to 0.0.0.0 ++configure /etc/kibana/kibana.yml on Ubuntu VM ++> [!NOTE] +> 10.0.1.4 is a local private IP, that we have used which can be accessed in maven project develop Windows VM. You're required to make modifications according to your network security requirements. We use the same IP later to demo for performing analytics on Kibana. ++``` +server.host: "0.0.0.0" +server.name: "elasticsearch" +server.port: 5601 +elasticsearch.hosts: ["http://10.0.1.4:9200"] +``` +++## Prepare Click Events on HDInsight Kafka ++We use python output as input to produce the streaming data ++``` +sshuser@hn0-contsk:~$ python weblog.py | /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --bootstrap-server wn0-contsk:9092 --topic click_events +``` +Now, lets check messages in this topic ++``` +sshuser@hn0-contsk:~$ /usr/hdp/current/kafka-broker/bin/kafka-console-consumer.sh --bootstrap-server wn0-contsk:9092 --topic click_events +``` +``` +{"userName": "Tim", "visitURL": "https://www.bing.com/new", "ts": "07/31/2023 05:47:12"} +{"userName": "Luke", "visitURL": "https://github.com", "ts": "07/31/2023 05:47:12"} +{"userName": "Zark", "visitURL": "https://github.com", "ts": "07/31/2023 05:47:12"} +{"userName": "Zark", "visitURL": "https://docs.python.org", "ts": "07/31/2023 05:47:12"} +``` +++## Creating Kafka Sink to Elastic ++Let us write maven source code on the Windows VM ++**Main: kafkaSinkToElastic.java** +``` java +import org.apache.flink.api.common.eventtime.WatermarkStrategy; +import org.apache.flink.api.common.serialization.SimpleStringSchema; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.connector.elasticsearch.sink.Elasticsearch7SinkBuilder; +import org.apache.flink.connector.kafka.source.KafkaSource; +import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer; +import org.apache.http.HttpHost; +import org.elasticsearch.action.index.IndexRequest; +import org.elasticsearch.client.Requests; ++import java.util.HashMap; +import java.util.Map; ++public class kafkaSinkToElastic { + public static void main(String[] args) throws Exception { + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1); ++ // 1. read kafka message + String kafka_brokers = "<broker1 IP>:9092,<broker2 IP>:9092,<broker3 IP>:9092"; + KafkaSource<String> source = KafkaSource.<String>builder() + .setBootstrapServers(kafka_brokers) + .setTopics("click_events") + .setGroupId("my-group") + .setStartingOffsets(OffsetsInitializer.earliest()) + .setValueOnlyDeserializer(new SimpleStringSchema()) + .build(); ++ DataStream<String> kafka = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source"); ++ // 2. sink to elasticsearch + kafka.sinkTo( + new Elasticsearch7SinkBuilder<String>() + .setBulkFlushMaxActions(1) + .setHosts(new HttpHost("10.0.1.4", 9200, "http")) + .setEmitter( + (element, context, indexer) -> indexer.add(createIndexRequest(element))) + .build()); ++ // 3. execute stream + env.execute("Kafka Sink To Elastic"); ++ } + private static IndexRequest createIndexRequest(String element) { + String[] logContent =element.replace("{","").replace("}","").split(","); + Map<String, String> esJson = new HashMap<>(); + esJson.put("username", logContent[0]); + esJson.put("visitURL", logContent[1]); + esJson.put("ts", logContent[2]); + return Requests.indexRequest() + .index("kafka_user_clicks") + .id(element) + .source(esJson); + } +} +``` ++**Creating a pom.xml on Maven** ++``` xml +<?xml version="1.0" encoding="UTF-8"?> +<project xmlns="http://maven.apache.org/POM/4.0.0" + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> + <modelVersion>4.0.0</modelVersion> + <groupId>contoso.example</groupId> + <artifactId>FlinkElasticSearch</artifactId> + <version>1.0-SNAPSHOT</version> + <properties> + <maven.compiler.source>1.8</maven.compiler.source> + <maven.compiler.target>1.8</maven.compiler.target> + <flink.version>1.16.0</flink.version> + <java.version>1.8</java.version> + <kafka.version>3.2.0</kafka.version> + </properties> + <dependencies> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-streaming-java</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-core --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-core</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-clients</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-kafka</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-elasticsearch-base --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-elasticsearch7</artifactId> + <version>${flink.version}</version> + </dependency> + </dependencies> + <build> + <plugins> + <plugin> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-assembly-plugin</artifactId> + <version>3.0.0</version> + <configuration> + <appendAssemblyId>false</appendAssemblyId> + <descriptorRefs> + <descriptorRef>jar-with-dependencies</descriptorRef> + </descriptorRefs> + </configuration> + <executions> + <execution> + <id>make-assembly</id> + <phase>package</phase> + <goals> + <goal>single</goal> + </goals> + </execution> + </executions> + </plugin> + </plugins> + </build> +</project> +``` ++**Package the jar and submit to Flink to run on WebSSH** ++On [Secure Shell for Flink](./flink-web-ssh-on-portal-to-flink-sql.md), you can use the following commands ++``` +msdata@pod-0 [ ~ ]$ ls -l FlinkElasticSearch-1.0-SNAPSHOT.jar +-rw-r-- 1 msdata msdata 114616575 Jul 31 06:09 FlinkElasticSearch-1.0-SNAPSHOT.jar +msdatao@pod-0 [ ~ ]$ bin/flink run -c contoso.example.kafkaSinkToElastic -j FlinkElasticSearch-1.0-SNAPSHOT.jar +Job has been submitted with JobID e0eba72d5143cea53bcf072335a4b1cb +``` +## Start Elasticsearch and Kibana to perform analytics on Kibana ++**startup Elasticsearch and Kibana on Ubuntu VM and Using Kibana to Visualize Results** ++- Access Kibana at IP, which you have set earlier. +- Configure an index pattern by clicking **Stack Management** in the left-side toolbar and find **Index Patterns**, then click **Create Index Pattern** and enter the full index name kafka_user_clicks to create the index pattern. +++- Once the index pattern is set up, you can explore the data in Kibana + - Click "Discover" in the left-side toolbar. + + :::image type="content" source="./media/sink-kafka-to-kibana/kibana-discover.png" alt-text="Screenshot showing how to navigate to discover button." lightbox="./media/sink-kafka-to-kibana/kibana-discover.png"::: + + - Kibana lists the content of the created index with kafka-click-events + + :::image type="content" source="./media/sink-kafka-to-kibana/elastic-discover-kafka-click-events.png" alt-text="Screenshot showing elastic with the created index with the kafka-click-events." lightbox="./media/sink-kafka-to-kibana/elastic-discover-kafka-click-events.png" ::: + +- Let us create a dashboard to display various views. + +- Let's use a **Area** (area graph), then select the **kafka_click_events** index and edit the Horizontal axis and Vertical axis to illustrate the events + ++- If we set an auto refresh or click **Refresh**, the plot is updating real time as we have created a Flink Streaming job ++++## Validation on Apache Flink Job UI ++You can find the job in running state on your Flink Web UI +++## Reference +* [Apache Kafka SQL Connector](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/table/kafka) +* [Elasticsearch SQL Connector](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/table/elasticsearch) |
hdinsight-aks | Sink Sql Server Table Using Flink Sql | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/sink-sql-server-table-using-flink-sql.md | + + Title: Change Data Capture (CDC) of SQL Server using Apache Flink SQL +description: Learn how to perform CDC of SQL Server using Apache Flink SQL ++ Last updated : 08/29/2023+++# Change Data Capture (CDC) of SQL Server using Apache Flink SQL +++Change Data Capture (CDC) is a technique you can use to track row-level changes in database tables in response to create, update, and delete operations. In this article, we use [CDC Connectors for Apache Flink®](https://github.com/ververica/flink-cdc-connectors), which offer a set of source connectors for Apache Flink. The connectors integrate [Debezium®](https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/formats/debezium/#debezium-format) as the engine to capture the data changes. ++Flink supports to interpret Debezium JSON and Avro messages as INSERT/UPDATE/DELETE messages into Flink SQL system. ++This support is useful in many cases to: ++- Synchronize incremental data from databases to other systems +- Audit logs +- Build real-time materialized views on databases +- View temporal join changing history of a database table ++Now, let us learn how to use Change Data Capture (CDC) of SQL Server using Flink SQL. The SQLServer CDC connector allows for reading snapshot data and incremental data from SQLServer database. ++## Prerequisites + * [HDInsight on AKS Flink 1.16.0](../flink/flink-create-cluster-portal.md) + * [Azure SQL Server](/azure/azure-sql/azure-sql-iaas-vs-paas-what-is-overview) ++### Apache Flink SQLServer CDC Connector ++The SQLServer CDC connector is a Flink Source connector, which reads database snapshot first and then continues to read change events with exactly once processing even failures happen. This example uses FLINK CDC to create a SQLServerCDC table on FLINK SQL ++### Use SSH to use Flink SQL-client ++We have already covered this section in detail on how to use [secure shell](./flink-web-ssh-on-portal-to-flink-sql.md) with Flink. ++## Prepare table and enable cdc feature on SQL Server sqldb ++Let us prepare a table and enable the CDC, You can refer the detailed steps listed on [SQL Documentation](/sql/relational-databases/track-changes/enable-and-disable-change-data-capture-sql-server?) ++**Create a database** +``` SQL +CREATE DATABASE inventory; +GO +``` ++**Enable CDC on the SQL Server database** ++``` SQL +USE inventory; +EXEC sys.sp_cdc_enable_db; +GO +``` ++**Verify that the user has access to the CDC table** +``` SQL +USE inventory +GO +EXEC sys.sp_cdc_help_change_data_capture +GO +``` ++> [!NOTE] +> The query returns configuration information for each table in the database (enabled for CDC). If the result is empty, verify that the user has privileges to access both the capture instance as well as the CDC tables. +++**Create and populate our products using a single insert with many rows** ++``` SQL +CREATE TABLE products ( +id INTEGER IDENTITY(101,1) NOT NULL PRIMARY KEY, +name VARCHAR(255) NOT NULL, +description VARCHAR(512), +weight FLOAT +); ++INSERT INTO products(name,description,weight) +VALUES ('scooter','Small 2-wheel scooter',3.14); +INSERT INTO products(name,description,weight) +VALUES ('car battery','12V car battery',8.1); +INSERT INTO products(name,description,weight) +VALUES ('12-pack drill bits','12-pack of drill bits with sizes ranging from #40 to #3',0.8); +INSERT INTO products(name,description,weight) +VALUES ('hammer','12oz carpenter''s hammer',0.75); +INSERT INTO products(name,description,weight) +VALUES ('hammer','14oz carpenter''s hammer',0.875); +INSERT INTO products(name,description,weight) +VALUES ('hammer','16oz carpenter''s hammer',1.0); +INSERT INTO products(name,description,weight) +VALUES ('rocks','box of assorted rocks',5.3); +INSERT INTO products(name,description,weight) +VALUES ('jacket','water resistent black wind breaker',0.1); +INSERT INTO products(name,description,weight) +VALUES ('spare tire','24 inch spare tire',22.2); ++EXEC sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'products', @role_name = NULL, @supports_net_changes = 0; ++-- Creating simple orders on SQL Table ++CREATE TABLE orders ( +id INTEGER IDENTITY(10001,1) NOT NULL PRIMARY KEY, +order_date DATE NOT NULL, +purchaser INTEGER NOT NULL, +quantity INTEGER NOT NULL, +product_id INTEGER NOT NULL, +FOREIGN KEY (product_id) REFERENCES products(id) +); ++INSERT INTO orders(order_date,purchaser,quantity,product_id) +VALUES ('16-JAN-2016', 1001, 1, 102); +INSERT INTO orders(order_date,purchaser,quantity,product_id) +VALUES ('17-JAN-2016', 1002, 2, 105); +INSERT INTO orders(order_date,purchaser,quantity,product_id) +VALUES ('19-FEB-2016', 1002, 2, 106); +INSERT INTO orders(order_date,purchaser,quantity,product_id) +VALUES ('21-FEB-2016', 1003, 1, 107); ++EXEC sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'orders', @role_name = NULL, @supports_net_changes = 0; +GO +``` +## Download SQLServer CDC connector and its dependencies on SSH ++**WSL to ubuntu on local to check all dependencies related *flink-sql-connector-sqlserver-cdc* jar** ++``` +myvm@MININT-481C9TJ:/mnt/c/Work/99_tools/apache-maven-3.9.0/bin$ vim pom.xml ++<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> + <modelVersion>4.0.0</modelVersion> + <groupId>com.dep.download</groupId> + <artifactId>dep-download</artifactId> + <version>1.0-SNAPSHOT</version> +<!-- https://mvnrepository.com/artifact/com.ververica/flink-sql-connector-sqlserver-cdc --> + <dependency> + <groupId>com.ververica</groupId> + <artifactId>flink-sql-connector-sqlserver-cdc</artifactId> + <version>2.3.0</version> + </dependency> +</project> ++myvm@MININT-481C9TJ:/mnt/c/Work/99_tools/apache-maven-3.9.0/bin$ mkdir target ++myvm@MININT-481C9TJ:/mnt/c/Work/99_tools/apache-maven-3.9.0/bin$ /mnt/c/Work/99_tools/apache-maven-3.9.0/bin/mvn -DoutputDirectory=target -f pom.xml dependency:copy-dependencies +[INFO] Scanning for projects... ++myvm@MININT-481C9TJ:/mnt/c/Work/99_tools/apache-maven-3.9.0/bin$ cd target +myvm@MININT-481C9TJ:/mnt/c/Work/99_tools/apache-maven-3.9.0/bin/target$ ll +total 19436 +drwxrwxrwx 1 msdata msdata 4096 Feb 9 08:39 ./ +drwxrwxrwx 1 msdata msdata 4096 Feb 9 08:37 ../ +-rwxrwxrwx 1 msdata msdata 85388 Feb 9 08:39 awaitility-4.0.1.jar* +-rwxrwxrwx 1 msdata msdata 3085931 Feb 9 08:39 flink-shaded-guava-30.1.1-jre-16.0.jar* +-rwxrwxrwx 1 msdata msdata 16556459 Feb 9 08:39 flink-sql-connector-sqlserver-cdc-2.3.0.jar* +-rwxrwxrwx 1 msdata msdata 123103 Feb 9 08:39 hamcrest-2.1.jar* +-rwxrwxrwx 1 msdata msdata 40502 Feb 9 08:39 slf4j-api-1.7.15.jar* +``` +**Let us download jars to SSH** +```sql +wget https://repo1.maven.org/maven2/com/ververica/flink-connector-sqlserver-cdc/2.4.0/flink-connector-sqlserver-cdc-2.4.0.jar +wget https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-guava/30.1.1-jre-16.0/flink-shaded-guava-30.1.1-jre-16.0.jar +wget https://repo1.maven.org/maven2/org/awaitility/awaitility/4.0.1/awaitility-4.0.1.jar +wget https://repo1.maven.org/maven2/org/hamcrest/hamcrest/2.1/hamcrest-2.1.jar +wget https://repo1.maven.org/maven2/net/java/loci/jsr308-all/1.1.2/jsr308-all-1.1.2.jar ++msdata@pod-0 [ ~/jar ]$ ls -l +total 6988 +-rw-r-- 1 msdata msdata 85388 Sep 6 2019 awaitility-4.0.1.jar +-rw-r-- 1 msdata msdata 107097 Jun 25 03:47 flink-connector-sqlserver-cdc-2.4.0.jar +-rw-r-- 1 msdata msdata 3085931 Sep 27 2022 flink-shaded-guava-30.1.1-jre-16.0.jar +-rw-r-- 1 msdata msdata 123103 Dec 20 2018 hamcrest-2.1.jar +-rw-r-- 1 msdata msdata 3742993 Mar 30 2011 jsr308-all-1.1.2.jar +``` ++### Add jar into sql-client.sh and connect to Flink SQL Client ++```sql +msdata@pod-0 [ ~ ]$ bin/sql-client.sh -j jar/flink-sql-connector-sqlserver-cdc-2.4.0.jar -j jar/flink-shaded-guava-30.1.1-jre-16.0.jar -j jar/hamcrest-2.1.jar -j jar/awaitility-4.0.1.jar -j jar/jsr308-all-1.1.2.jar +``` +## Create SQLServer CDC table ++``` sql +SET 'sql-client.execution.result-mode' = 'tableau'; ++CREATE TABLE orders ( + id INT, + order_date DATE, + purchaser INT, + quantity INT, + product_id INT, + PRIMARY KEY (id) NOT ENFORCED +) WITH ( + 'connector' = 'sqlserver-cdc', + 'hostname' = '<updatehostname>.database.windows.net', //update with the host name + 'port' = '1433', + 'username' = '<update-username>', //update with the user name + 'password' = '<update-password>', //update with the password + 'database-name' = 'inventory', + 'table-name' = 'dbo.orders' +); ++select * from orders; +``` +++### Perform changes on table from SQLServer side +++## Validation ++Monitor the table on Flink SQL ++++### Reference +* [SQLServer CDC Connector](https://ververica.github.io/flink-cdc-connectors/master/content/connectors/sqlserver-cdc.html) is licensed under [Apache 2.0 License](https://github.com/ververica/flink-cdc-connectors/blob/master/LICENSE) |
hdinsight-aks | Use Apache Nifi With Datastream Api | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/use-apache-nifi-with-datastream-api.md | + + Title: Use Apache NiFi with HDInsight on AKS Apache Flink to publish into ADLS Gen2 +description: Learn how to use Apache NiFi to consume Processed Kafka topic from HDInsight Apache Flink on AKS and publish into ADLS Gen2 ++ Last updated : 08/29/2023+++# Use Apache NiFi to consume processed Kafka topics from Apache Flink and publish into ADLS Gen2 +++Apache NiFi is a software project from the Apache Software Foundation designed to automate the flow of data between software systems. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. ++For more information, see [Apache NiFi](https://nifi.apache.org) ++In this document, we process streaming data using HDInsight Kafka and perform some transformations on HDInsight Apache Flink on AKS, consume these topics and write the contents into ADLS Gen2 on Apache NiFi. ++By combining the low latency streaming features of Apache Flink and the dataflow capabilities of Apache NiFi, you can process events at high volume. This combination helps you to trigger, enrich, filter, to enhance overall user experience. Both these technologies complement each other with their strengths in event streaming and correlation. ++## Prerequisites ++* [HDInsight on AKS Flink 1.16.0](../flink/flink-create-cluster-portal.md) +* [HDInsight Kafka](../../hdinsight/kafk) + * You're required to ensure the network settings are taken care as described on [Using HDInsight Kafka](../flink/process-and-consume-data.md); that's to make sure HDInsight on AKS Flink and HDInsight Kafka are in the same VNet +* For this demonstration, we're using a Window VM as maven project develop env in the same VNET as HDInsight on AKS +* For this demonstration, we're using an Ubuntu VM in the same VNET as HDInsight on AKS, install Apache NiFi 1.22.0 on this VM ++## Prepare HDInsight Kafka topic ++For purposes of this demonstration, we're using a HDInsight Kafka Cluster, let us prepare HDInsight Kafka topic for the demo. ++> [!NOTE] +> Setup a HDInsight [Kafka](../../hdinsight/kafk) Cluster and Replace broker list with your own list before you get started for both Kafka 2.4 and 3.2. ++**HDInsight Kafka 2.4.1** +``` +/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --replication-factor 2 --partitions 3 --topic click_events --zookeeper zk0-contsk:2181 +``` ++**HDInsight Kafka 3.2.0** +``` +/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --replication-factor 2 --partitions 3 --topic click_events --bootstrap-server wn0-contsk:9092 +``` +## Setup Apache NiFi 1.22.0 ++For this demo, we install Apache NiFi 1.22.0 on an Ubuntu VM in the same VNet as HDInsight Flink on AKS, or you can also use your NiFi setup. ++[Apache NiFi Downloads](https://nifi.apache.org/download.html) ++``` +root@contosoubuntuvm:/home/myvm/nifi-1.22.0/bin# ./nifi.sh start ++Java home: /home/myvm/jdk-18.0.1.1 +NiFi home: /home/myvm/nifi-1.22.0 ++Bootstrap Config File: /home/myvm/nifi-1.22.0/conf/bootstrap.conf +++root@contosoubuntuvm:/home/myvm/nifi-1.22.0/bin# jps +454421 NiFi +454467 Jps +454396 RunNiFi +``` ++**Configuring NiFi UI** ++Here, we configure NiFi properties in order to be accessed outside the localhost VM. ++`$nifi_home/conf/nifi.properties` +++## Process streaming data from HDInsight Kafka On HDInsight on AKS Flink ++Let us develop the source code on Maven, to build the jar. ++**SinkToKafka.java** ++``` java +package contoso.example; ++import org.apache.flink.api.common.serialization.SimpleStringSchema; +import org.apache.flink.api.common.typeinfo.Types; +import org.apache.flink.connector.base.DeliveryGuarantee; +import org.apache.flink.connector.kafka.sink.KafkaRecordSerializationSchema; +import org.apache.flink.connector.kafka.sink.KafkaSink; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.datastream.DataStreamSource; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +++public class SinkToKafka { + public static void main(String[] args) throws Exception { + // 1. get stream env, update the broker-ips with your own + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); + String brokers = "<update-brokerip>:9092,<update-brokerip>:9092,<update-brokerip>:9092"; // Replace the broker list with your own ++ // 2. event data source + DataStreamSource<Event> stream = env.addSource(new ClickSource()); ++ DataStream<String> dataStream = stream.map(line-> { + String str1 = line.toString(); + return str1; + }).returns(Types.STRING); ++ // 3. sink click events to kafka + KafkaSink<String> sink = KafkaSink.<String>builder() + .setBootstrapServers(brokers) + .setRecordSerializer(KafkaRecordSerializationSchema.builder() + .setTopic("click_events") + .setValueSerializationSchema(new SimpleStringSchema()) + .build() + ) + .setDeliveryGuarantee(DeliveryGuarantee.AT_LEAST_ONCE) + .build(); ++ dataStream.sinkTo(sink); + env.execute("Sink click events to Kafka"); + } +} +``` ++**Event.java** +``` java +import java.sql.Timestamp; ++public class Event { ++ public String user; + public String url; + public String ts; + public Event() { + } ++ public Event(String user, String url, String ts) { + this.user = user; + this.url = url; + this.ts = ts; + } ++ @Override + public String toString(){ + return "\"" + ts + "\"" + "," + "\"" + user + "\"" + "," + "\"" + url + "\""; + } +} +``` ++**ClickSource.java** +``` java +import org.apache.flink.streaming.api.functions.source.SourceFunction; +import java.util.Calendar; +import java.util.Random; ++public class ClickSource implements SourceFunction<Event> { + // declare a flag + private Boolean running = true; ++ // declare a flag + public void run(SourceContext<Event> ctx) throws Exception{ + // generate random record + Random random = new Random(); + String[] users = {"Mary","Alice","Bob","Cary"}; + String[] urls = {"./home","./cart","./fav","./prod?id=100","./prod?id=10"}; ++ // loop generate + while (running) { + String user = users[random.nextInt(users.length)]; + String url = urls[random.nextInt(urls.length)]; + Long timestamp = Calendar.getInstance().getTimeInMillis(); + String ts = timestamp.toString(); + ctx.collect(new Event(user,url,ts)); +// Thread.sleep(2000); + } + } + @Override + public void cancel() + { + running = false; + } +} +``` +**Maven pom.xml** ++You can replace 2.4.1 with 3.2.0 in case you're using HDInsight Kafka 3.2.0, where applicable on the pom.xml ++``` xml +<?xml version="1.0" encoding="UTF-8"?> +<project xmlns="http://maven.apache.org/POM/4.0.0" + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> + <modelVersion>4.0.0</modelVersion> ++ <groupId>org.example</groupId> + <artifactId>FlinkDemoKafka</artifactId> + <version>1.0-SNAPSHOT</version> + <properties> + <maven.compiler.source>1.8</maven.compiler.source> + <maven.compiler.target>1.8</maven.compiler.target> + <flink.version>1.16.0</flink.version> + <java.version>1.8</java.version> + <scala.binary.version>2.12</scala.binary.version> + <kafka.version>2.4.1</kafka.version> > Replace 2.4.1 with 3.2.0 , in case you're using HDInsight Kafka 3.2.0 + </properties> + <dependencies> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-streaming-java</artifactId> + <version>${flink.version}</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-java</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-clients</artifactId> + <version>${flink.version}</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-kafka</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-files --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-files</artifactId> + <version>${flink.version}</version> + </dependency> + </dependencies> + <build> + <plugins> + <plugin> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-assembly-plugin</artifactId> + <version>3.0.0</version> + <configuration> + <appendAssemblyId>false</appendAssemblyId> + <descriptorRefs> + <descriptorRef>jar-with-dependencies</descriptorRef> + </descriptorRefs> + </configuration> + <executions> + <execution> + <id>make-assembly</id> + <phase>package</phase> + <goals> + <goal>single</goal> + </goals> + </execution> + </executions> + </plugin> + </plugins> + </build> +</project> +``` ++## Submit streaming job to HDInsight on AKS - Flink ++Now, lets submit streaming job as mentioned in the previous step into HDInsight on AKS - Flink +++## Check the topic on HDInsight Kafka ++Check the topic on HDInsight Kafka. ++``` +root@hn0-contos:/home/sshuser# /usr/hdp/current/kafka-broker/bin/kafka-console-consumer.sh --topic click_events --bootstrap-server wn0-contos:9092 +"1685939238525","Cary","./home" +"1685939240527","Bob","./fav" +"1685939242528","Cary","./prod?id=10" +"1685939244528","Mary","./prod?id=100" +"1685939246529","Alice","./fav" +"1685939248530","Mary","./cart" +"1685939250530","Mary","./prod?id=100" +"1685939252530","Alice","./prod?id=100" +"1685939254530","Alice","./prod?id=10" +"1685939256530","Cary","./prod?id=100" +"1685939258531","Mary","./prod?id=10" +"1685939260531","Cary","./home" +"1685939262531","Mary","./prod?id=10" +"1685939264531","Cary","./prod?id=100" +"1685939266532","Mary","./cart" +"1685939268532","Bob","./fav" +"1685939270532","Mary","./home" +"1685939272533","Cary","./fav" +"1685939274533","Alice","./cart" +"1685939276533","Bob","./prod?id=10" +"1685939278533","Mary","./cart" +"1685939280533","Alice","./fav" +``` ++## Create flow on NiFi UI ++> [!NOTE] +> In this example, we use Azure User Managed Identity to credentials for ADLS Gen2. ++In this demonstration, we have used Apache NiFi instance installed on an Ubuntu VM. We're accessing the NiFi web interface from a Windows VM. The Ubuntu VM needs to have a managed identity assigned to it and network security group (NSG) rules configured. ++To use Managed Identity authentication with the PutAzureDataLakeStorage processor in NiFi. You're required to ensure Ubuntu VM on which NiFi is installed has a managed identity assigned to it, or assign a managed identity to the Ubuntu VM. +++Once you have assigned a managed identity to the Azure VM, you need to make sure that the VM can connect to the IMDS (Instance Metadata Service) endpoint. The IMDS endpoint is available at the IP address shown in this example. You need to update your network security group rules to allow outbound traffic from the Ubuntu VM to this IP address. +++**Run the flow:** +++[**Using Processor ConsumerKafka_2_0's properties setting:**](https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-kafka-2-0-nar/1.22.0/org.apache.nifi.processors.kafka.pubsub.ConsumeKafka_2_0/https://docsupdatetracker.net/index.html) ++++[**Using Processor PutAzureDataLakeStorage properties setting:**](https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-azure-nar/1.22.0/org.apache.nifi.processors.azure.storage.PutAzureDataLakeStorage/https://docsupdatetracker.net/index.html) +++[**Using PutAzureDataLakeStorage credential setting:**](https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-azure-nar/1.22.0/org.apache.nifi.services.azure.storage.ADLSCredentialsControllerService/https://docsupdatetracker.net/index.html) +++### Lets check output in ADLS Gen2 +++## Reference ++* [Apache NiFi](https://nifi.apache.org) +* [Apache NiFi Downloads](https://nifi.apache.org/download.html) +* [Consume Kafka](https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-kafka-2-0-nar/1.11.4/org.apache.nifi.processors.kafka.pubsub.ConsumeKafka_2_0/https://docsupdatetracker.net/index.html) +* [Azure Data Lake Storage](https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-azure-nar/1.12.0/org.apache.nifi.processors.azure.storage.PutAzureDataLakeStorage/https://docsupdatetracker.net/index.html) +* [ADLS Credentials Controller Service](https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-azure-nar/1.12.0/org.apache.nifi.services.azure.storage.ADLSCredentialsControllerService/https://docsupdatetracker.net/index.html) +* [Download IntelliJ IDEA for development](https://www.jetbrains.com/idea/download/#section=windows) |
hdinsight-aks | Use Azure Pipelines To Run Flink Jobs | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/use-azure-pipelines-to-run-flink-jobs.md | + + Title: How to use Azure Pipelines with HDInsight on AKS - Flink +description: Learn how to use Azure Pipelines with HDInsight on AKS - Flink ++ Last updated : 09/25/2023+++# How to use Azure Pipelines with HDInsight on AKS - Flink +++In this article, you'll learn how to use Azure Pipelines with HDInsight on AKS to submit Flink jobs via the cluster's REST API. We guide you through the process using a sample YAML pipeline and a PowerShell script, both of which streamline the automation of the REST API interactions. +++## Prerequisites ++- Azure subscription. If you do not have an Azure subscription, create a free account. ++- A GitHub account where you can create a repository. [Create one for free](https://azure.microsoft.com/free). ++- Create `.pipeline` directory, copy [flink-azure-pipelines.yml](https://hdiconfigactions.blob.core.windows.net/hiloflinkblob/flink-azure-pipelines.yml) and [flink-job-azure-pipeline.ps1](https://hdiconfigactions.blob.core.windows.net/hiloflinkblob/flink-job-azure-pipeline.ps1) ++- Azure DevOps organization. Create one for free. If your team already has one, then make sure you are an administrator of the Azure DevOps project that you want to use. ++- Ability to run pipelines on Microsoft-hosted agents. To use Microsoft-hosted agents, your Azure DevOps organization must have access to Microsoft-hosted parallel jobs. You can either purchase a parallel job or you can request a free grant. ++- A Flink Cluster. If you don’t have one, [Create a Flink Cluster in HDInsight on AKS](flink-create-cluster-portal.md). ++- Create one directory in cluster storage account to copy job jar. This directory later you need to configure in pipeline YAML for job jar location (<JOB_JAR_STORAGE_PATH>). ++## Steps to set up pipeline ++### Create a service principal for Azure Pipelines ++ Create [Azure AD Service Principal](/cli/azure/ad/sp/) to access Azure – Grant permission to access HDInsight on AKS Cluster with Contributor role, make a note of appId, password, and tenant from the response. + ``` + az ad sp create-for-rbac -n <service_principal_name> --role Contributor --scopes <Flink Cluster Resource ID>` + ``` + + Example: ++ ``` + az ad sp create-for-rbac -n azure-flink-pipeline --role Contributor --scopes /subscriptions/abdc-1234-abcd-1234-abcd-1234/resourceGroups/myResourceGroupName/providers/Microsoft.HDInsight/clusterpools/hiloclusterpool/clusters/flinkcluster` + ``` ++### Create a key vault ++ 1. Create Azure Key Vault, you can follow [this tutorial](/azure/key-vault/general/quick-create-portal) to create a new Azure Key Vault. ++ 1. Create three Secrets ++ - *cluster-storage-key* for storage key. ++ - *service-principal-key* for principal clientId or appId. ++ - *service-principal-secret* for principal secret. ++ :::image type="content" source="./media/use-azure-pipelines-to-run-flink-jobs/create-key-vault.png" alt-text="Screenshot showing how to create key vault." lightbox="./media/use-azure-pipelines-to-run-flink-jobs/create-key-vault.png"::: ++ 1. Grant permission to access Azure Key Vault with the “Key Vault Secrets Officer” role to service principal. +++### Setup pipeline ++ 1. Navigate to your Project and click Project Settings. ++ 1. Scroll down and select Service Connections, and then New Service Connection. ++ 1. Select Azure Resource Manager. ++ :::image type="content" source="./media/use-azure-pipelines-to-run-flink-jobs/select-new-service-connection.png" alt-text="Screenshot showing how to select a new service connection." lightbox="./media/use-azure-pipelines-to-run-flink-jobs/select-new-service-connection.png"::: ++ 1. In the authentication method, select Service Principal (manual). ++ :::image type="content" source="./media/use-azure-pipelines-to-run-flink-jobs/new-service-connection.png" alt-text="Screenshot shows new service connection." lightbox="./media/use-azure-pipelines-to-run-flink-jobs/new-service-connection.png"::: ++ 1. Edit the service connection properties. Select the service principal you recently created. ++ :::image type="content" source="./media/use-azure-pipelines-to-run-flink-jobs/edit-service-connection.png" alt-text="Screenshot showing how to edit service connection." lightbox="./media/use-azure-pipelines-to-run-flink-jobs/edit-service-connection.png"::: ++ 1. Click Verify to check whether the connection was set up correctly. If you encounter the following error: ++ :::image type="content" source="./media/use-azure-pipelines-to-run-flink-jobs/service-connection-error-message.png" alt-text="Screenshot showing service connection error message." lightbox="./media/use-azure-pipelines-to-run-flink-jobs/service-connection-error-message.png"::: + + 1. Then you need to assign the Reader role to the subscription. ++ 1. After that, the verification should be successful. ++ 1. Save the service connection. ++ :::image type="content" source="./media/use-azure-pipelines-to-run-flink-jobs/tenant-id.png" alt-text="Screenshot showing how to view the Tenant-ID." lightbox="./media/use-azure-pipelines-to-run-flink-jobs/tenant-id.png"::: ++ 1. Navigate to pipelines and click on New Pipeline. ++ :::image type="content" source="./media/use-azure-pipelines-to-run-flink-jobs/create-new-pipeline.png" alt-text="Screenshot showing how to create a new pipeline." lightbox="./media/use-azure-pipelines-to-run-flink-jobs/create-new-pipeline.png"::: ++ 1. Select GitHub as the location of your code. ++ 1. Select the repository. See [how to create a repository](https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-new-repository) in GitHub. select-github-repo image. ++ :::image type="content" source="./media/use-azure-pipelines-to-run-flink-jobs/search-your-code.png" alt-text="Screenshot showing how to search your code." lightbox="./media/use-azure-pipelines-to-run-flink-jobs/search-your-code.png"::: + ++ 1. Select the repository. For more information, [see How to create a repository in GitHub](https://docs.github.com/repositories/creating-and-managing-repositories/creating-a-new-repository). ++ :::image type="content" source="./media/use-azure-pipelines-to-run-flink-jobs/select-github-repo.png" alt-text="Screenshot showing how to select a GitHub repository." lightbox="./media/use-azure-pipelines-to-run-flink-jobs/select-github-repo.png"::: + + 1. From configure your pipeline option, you can choose **Existing Azure Pipelines YAML file**. Select branch and pipeline script that you copied earlier. (.pipeline/flink-azure-pipelines.yml) ++ :::image type="content" source="./media/use-azure-pipelines-to-run-flink-jobs/configure-pipeline.png" alt-text="Screenshot showing how to configure pipeline."::: ++ 1. Replace value in variable section. ++ :::image type="content" source="./media/use-azure-pipelines-to-run-flink-jobs/replace-value.png" alt-text="Screenshot showing how to replace value."::: ++ 1. Correct code build section based on your requirement and configure <JOB_JAR_LOCAL_PATH> in variable section for job jar local path. ++ :::image type="content" source="./media/use-azure-pipelines-to-run-flink-jobs/code-build-section.png" alt-text="Screenshot shows code build section."::: ++ 1. Add pipeline variable "action" and configure value "RUN." ++ :::image type="content" source="./media/use-azure-pipelines-to-run-flink-jobs/pipeline-variable.png" alt-text="Screenshot shows how to add pipeline variable."::: ++ you can change the values of variable before running pipeline. ++ - NEW: This value is default. It launches new job and if the job is already running then it updates the running job with latest jar. ++ - SAVEPOINT: This value takes the save point for running job. ++ - DELETE: Cancel or delete the running job. ++ 1. Save and run the pipeline. You can see the running job on portal in Flink Job section. + + :::image type="content" source="./media/use-azure-pipelines-to-run-flink-jobs/save-run-pipeline.png" alt-text="Screenshot shows how to save and run pipeline."::: +++> [!NOTE] +> This is one sample to submit the job using pipeline. You can follow the Flink REST API doc to write your own code to submit job. |
hdinsight-aks | Use Flink Cli To Submit Jobs | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/use-flink-cli-to-submit-jobs.md | + + Title: How to use Apache Flink CLI to submit jobs +description: Learn how to use Apache Flink CLI to submit jobs ++ Last updated : 08/29/2023+++# Apache Flink Command-Line Interface (CLI) +++Apache Flink provides a CLI (Command Line Interface) **bin/flink** to run jobs (programs) that are packaged as JAR files and to control their execution. The CLI is part of the Flink setup and can be set up on a single-node VM. It connects to the running JobManager specified in **conf/flink-conf.yaml**. ++## Installation Steps ++To install Flink CLI on Linux, you need a **Linux VM** to execute the installation script. You need to run a bash environment if you are on **Windows**. ++> [!NOTE] +> This does NOT work on Windows **GIT BASH**, you need to install [WSL](/windows/wsl/install) to make this work on Windows. ++### Requirements +* Install JRE 11. If not installed, follow the steps described in `/java/openjdk/download`. +* Add java to PATH or define JAVA_HOME environment variable pointing to JRE installation directory, such that `$JAVA_HOME/bin/java` exists. ++### Install or update ++Both installing and updating the CLI require rerunning the install script. Install the CLI by running curl. ++```bash +curl -L https://aka.ms/hdionaksflinkcliinstalllinux | bash +``` ++This command installs Flink CLI in the user's home directory (`$HOME/flink-cli`). The script can also be downloaded and run locally. You may have to restart your shell in order for changes to take effect. ++## Run an Apache Flink command to test ++ ```bash + cd $HOME/flink-cli ++ bin/flink list -D azure.tenant.id=<update-tenant-id> -D rest.address=<flink-cluster-fqdn> + ``` + > [!NOTE] + > If executing via SSH pod, use the command ```bin/flink list``` to give you the complete output. ++ If you don't want to add those parameters every time, add them to **conf/flink-conf.yaml**. + + ```bash + rest.address: <flink-cluster-fqdn> + azure.tenant.id: <tenant-id> + ``` + Now the command becomes + + ```bash + bin/flink list + ``` ++ You should see output like the following: ++ ```output + To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code E4LW35GFD to authenticate. + ``` ++ Open [https://microsoft.com/devicelogin](https://microsoft.com/devicelogin) in your browser, and enter the code, then use your microsoft.com ID to log in. After successful login, you should see output like the following if no job is running. + + ```output + Waiting for response... + No running jobs. + No scheduled jobs. + ``` ++#### curl `Object Moved` error ++If you get an error from curl related to the -L parameter, or an error message including the text "Object Moved", try using the full URL instead of the aka.ms redirect: ++```bash +curl https://hdiconfigactions.blob.core.windows.net/hiloflinkblob/install.sh | bash +``` ++## Examples +Here are some examples of actions supported by Flink’s CLI tool: ++| Action | Purpose | +|-|-| +| run | This action executes jobs. It requires at least the jar containing the job. Flink- or job-related arguments can be passed if necessary. | +| info | This action can be used to print an optimized execution graph of the passed job. Again, the jar containing the job needs to be passed. | +| list | This action *lists all running or scheduled jobs*.| +| savepoint | This action can be used to *create or disposing savepoints* for a given job. It may be necessary to specify a savepoint directory besides the JobID. | +| cancel | This action can be used to *cancel running jobs* based on their JobID. | +| stop | This action combines the *cancel and savepoint actions to stop* a running job but also creates a savepoint to start from again. | ++All actions and their parameters can be accessed through the following commands: ++```bash +bin/flink --help +``` ++The usage information of each individual action ++```bash +bin/flink <action> --help +``` ++> [!TIP] +> * If you have a Proxy blocking the connection: In order to get the installation scripts, your proxy needs to allow HTTPS connections to the following addresses: `https://aka.ms/` and `https://hdiconfigactions.blob.core.windows.net` +> * To resolve the issue, add the user or group to the [authorization profile](../hdinsight-on-aks-manage-authorization-profile.md). |
hdinsight-aks | Use Flink Delta Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/use-flink-delta-connector.md | + + Title: How to use Apache Flink & Delta connector in HDInsight on AKS +description: Learn how to use Apache Flink-Delta connector ++ Last updated : 08/29/2023+++# How to use Apache Flink-Delta connector +++By using Apache Flink and Delta Lake together, you can create a reliable and scalable data lakehouse architecture. The Flink/Delta Connector allows you to write data to Delta tables with ACID transactions and exactly once processing. It means that your data streams are consistent and error-free, even if you restart your Flink pipeline from a checkpoint. The Flink/Delta Connector ensures that your data isn't lost or duplicated, and that it matches the Flink semantics. ++In this article, you learn how to use Flink-Delta connector ++> [!div class="checklist"] +> * Read the data from the delta table. +> * Write the data to a delta table. +> * Query it in Power BI. ++## What is Apache Flink-Delta connector ++Flink-Delta Connector is a JVM library to read and write data from Apache Flink applications to Delta tables utilizing the Delta Standalone JVM library. The connector provides exactly once delivery guarantee. ++## Apache Flink-Delta Connector includes ++* DeltaSink for writing data from Apache Flink to a Delta table. +* DeltaSource for reading Delta tables using Apache Flink. ++We are using the following connector, to match with the HDInsight on AKS Flink version. ++|Connector's version| Flink's version| +|-|-| +|0.6.0 |X >= 1.15.3| ++## Prerequisites ++* [HDInsight on AKS Flink 1.16.0](./flink-create-cluster-portal.md) +* storage account +* [Power BI desktop](https://www.microsoft.com/download/details.aspx?id=58494) ++## Read data from delta table ++There are two types of delta sources, when it comes to reading data from delta table. ++* Bounded: Batch processing +* Continuous: Streaming processing ++In this example, we're using a bounded state of delta source. ++**Sample xml file** ++```xml +<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> + <modelVersion>4.0.0</modelVersion> ++ <groupId>org.example.flink.delta</groupId> + <artifactId>flink-delta</artifactId> + <version>1.0-SNAPSHOT</version> + <packaging>jar</packaging> ++ <name>Flink Quickstart Job</name> ++ <properties> + <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> + <flink.version>1.16.0</flink.version> + <target.java.version>1.8</target.java.version> + <scala.binary.version>2.12</scala.binary.version> + <maven.compiler.source>${target.java.version}</maven.compiler.source> + <maven.compiler.target>${target.java.version}</maven.compiler.target> + <log4j.version>2.17.1</log4j.version> + </properties> ++ <repositories> + <repository> + <id>apache.snapshots</id> + <name>Apache Development Snapshot Repository</name> + <url>https://repository.apache.org/content/repositories/snapshots/</url> + <releases> + <enabled>false</enabled> + </releases> + <snapshots> + <enabled>true</enabled> + </snapshots> + </repository> +<!-- <repository>--> +<!-- <id>delta-standalone_2.12</id>--> +<!-- <url>file://C:\Users\varastogi\Workspace\flink-main\flink-k8s-operator\target</url>--> +<!-- </repository>--> + </repositories> ++ <dependencies> + <!-- Apache Flink dependencies --> + <!-- These dependencies are provided, because they should not be packaged into the JAR file. --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-streaming-java</artifactId> + <version>${flink.version}</version> + <scope>provided</scope> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-clients</artifactId> + <version>${flink.version}</version> + <scope>provided</scope> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-java</artifactId> + <version>${flink.version}</version> + <scope>provided</scope> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-base</artifactId> + <version>${flink.version}</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-files</artifactId> + <version>${flink.version}</version> + </dependency> +<!-- <dependency>--> +<!-- <groupId>io.delta</groupId>--> +<!-- <artifactId>delta-standalone_2.12</artifactId>--> +<!-- <version>4.0.0</version>--> +<!-- <scope>system</scope>--> +<!-- <systemPath>C:\Users\varastogi\Workspace\flink-main\flink-k8s-operator\target\io\delta\delta-standalone_2.12\4.0.0\delta-standalone_2.12-4.0.0.jar</systemPath>--> +<!-- </dependency>--> + <dependency> + <groupId>io.delta</groupId> + <artifactId>delta-standalone_2.12</artifactId> + <version>0.6.0</version> + </dependency> + <dependency> + <groupId>org.apache.hadoop</groupId> + <artifactId>hadoop-mapreduce-client-core</artifactId> + <version>3.2.1</version> + </dependency> + <dependency> + <groupId>io.delta</groupId> + <artifactId>delta-flink</artifactId> + <version>0.6.0</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-parquet</artifactId> + <version>${flink.version}</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-clients</artifactId> + <version>${flink.version}</version> + </dependency> + <dependency> + <groupId>org.apache.parquet</groupId> + <artifactId>parquet-common</artifactId> + <version>1.12.2</version> + </dependency> + <dependency> + <groupId>org.apache.parquet</groupId> + <artifactId>parquet-column</artifactId> + <version>1.12.2</version> + </dependency> + <dependency> + <groupId>org.apache.parquet</groupId> + <artifactId>parquet-hadoop</artifactId> + <version>1.12.2</version> + </dependency> + <dependency> + <groupId>org.apache.hadoop</groupId> + <artifactId>hadoop-azure</artifactId> + <version>3.3.2</version> + </dependency> +<!-- <dependency>--> +<!-- <groupId>org.apache.hadoop</groupId>--> +<!-- <artifactId>hadoop-azure</artifactId>--> +<!-- <version>3.3.4</version>--> +<!-- </dependency>--> + <dependency> + <groupId>org.apache.hadoop</groupId> + <artifactId>hadoop-mapreduce-client-core</artifactId> + <version>3.2.1</version> + </dependency> + <dependency> + <groupId>org.apache.hadoop</groupId> + <artifactId>hadoop-client</artifactId> + <version>3.3.2</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-table-common</artifactId> + <version>${flink.version}</version> +<!-- <scope>provided</scope>--> + </dependency> + <dependency> + <groupId>org.apache.parquet</groupId> + <artifactId>parquet-hadoop-bundle</artifactId> + <version>1.10.0</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-table-runtime</artifactId> + <version>${flink.version}</version> + <scope>provided</scope> + </dependency> +<!-- <dependency>--> +<!-- <groupId>org.apache.flink</groupId>--> +<!-- <artifactId>flink-table-common</artifactId>--> +<!-- <version>${flink.version}</version>--> +<!-- </dependency>--> + <dependency> + <groupId>org.apache.hadoop</groupId> + <artifactId>hadoop-common</artifactId> + <version>3.3.2</version> + </dependency> ++ <!-- Add connector dependencies here. They must be in the default scope (compile). --> ++ <!-- Example: ++ <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-kafka</artifactId> + <version>${flink.version}</version> + </dependency> + --> ++ <!-- Add logging framework, to produce console output when running in the IDE. --> + <!-- These dependencies are excluded from the application JAR by default. --> + <dependency> + <groupId>org.apache.logging.log4j</groupId> + <artifactId>log4j-slf4j-impl</artifactId> + <version>${log4j.version}</version> + <scope>runtime</scope> + </dependency> + <dependency> + <groupId>org.apache.logging.log4j</groupId> + <artifactId>log4j-api</artifactId> + <version>${log4j.version}</version> + <scope>runtime</scope> + </dependency> + <dependency> + <groupId>org.apache.logging.log4j</groupId> + <artifactId>log4j-core</artifactId> + <version>${log4j.version}</version> + <scope>runtime</scope> + </dependency> + </dependencies> ++ <build> + <plugins> ++ <!-- Java Compiler --> + <plugin> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-compiler-plugin</artifactId> + <version>3.1</version> + <configuration> + <source>${target.java.version}</source> + <target>${target.java.version}</target> + </configuration> + </plugin> ++ <!-- We use the maven-shade plugin to create a fat jar that contains all necessary dependencies. --> + <!-- Change the value of <mainClass>...</mainClass> if your program entry point changes. --> + <plugin> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-shade-plugin</artifactId> + <version>3.1.1</version> + <executions> + <!-- Run shade goal on package phase --> + <execution> + <phase>package</phase> + <goals> + <goal>shade</goal> + </goals> + <configuration> + <createDependencyReducedPom>false</createDependencyReducedPom> + <artifactSet> + <excludes> + <exclude>org.apache.flink:flink-shaded-force-shading</exclude> + <exclude>com.google.code.findbugs:jsr305</exclude> + <exclude>org.slf4j:*</exclude> + <exclude>org.apache.logging.log4j:*</exclude> + </excludes> + </artifactSet> + <filters> + <filter> + <!-- Do not copy the signatures in the META-INF folder. + Otherwise, this might cause SecurityExceptions when using the JAR. --> + <artifact>*:*</artifact> + <excludes> + <exclude>META-INF/*.SF</exclude> + <exclude>META-INF/*.DSA</exclude> + <exclude>META-INF/*.RSA</exclude> + </excludes> + </filter> + </filters> + <transformers> + <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/> + <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/> + <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> + <mainClass>org.example.flink.delta.DataStreamJob</mainClass> + </transformer> + </transformers> + </configuration> + </execution> + </executions> + </plugin> + </plugins> ++ <pluginManagement> + <plugins> ++ <!-- This improves the out-of-the-box experience in Eclipse by resolving some warnings. --> + <plugin> + <groupId>org.eclipse.m2e</groupId> + <artifactId>lifecycle-mapping</artifactId> + <version>1.0.0</version> + <configuration> + <lifecycleMappingMetadata> + <pluginExecutions> + <pluginExecution> + <pluginExecutionFilter> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-shade-plugin</artifactId> + <versionRange>[3.1.1,)</versionRange> + <goals> + <goal>shade</goal> + </goals> + </pluginExecutionFilter> + <action> + <ignore/> + </action> + </pluginExecution> + <pluginExecution> + <pluginExecutionFilter> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-compiler-plugin</artifactId> + <versionRange>[3.1,)</versionRange> + <goals> + <goal>testCompile</goal> + <goal>compile</goal> + </goals> + </pluginExecutionFilter> + <action> + <ignore/> + </action> + </pluginExecution> + </pluginExecutions> + </lifecycleMappingMetadata> + </configuration> + </plugin> + </plugins> + </pluginManagement> + </build> +</project> +``` +* You're required to build the jar with required libraries and dependencies. +* Specify the ADLS Gen2 location in our java class to reference the source data. ++ + ```java + public StreamExecutionEnvironment createPipeline( + String tablePath, + int sourceParallelism, + int sinkParallelism) { ++ DeltaSource<RowData> deltaSink = getDeltaSource(tablePath); + StreamExecutionEnvironment env = getStreamExecutionEnvironment(); ++ env + .fromSource(deltaSink, WatermarkStrategy.noWatermarks(), "bounded-delta-source") + .setParallelism(sourceParallelism) + .addSink(new ConsoleSink(Utils.FULL_SCHEMA_ROW_TYPE)) + .setParallelism(1); ++ return env; + } ++ /** + * An example of Flink Delta Source configuration that will read all columns from Delta table + * using the latest snapshot. + */ + @Override + public DeltaSource<RowData> getDeltaSource(String tablePath) { + return DeltaSource.forBoundedRowData( + new Path(tablePath), + new Configuration() + ).build(); + } + ``` ++1. Call the read class while submitting the job using [Flink CLI](./flink-web-ssh-on-portal-to-flink-sql.md). ++ :::image type="content" source="./media/use-flink-delta-connector/call-the-read-class.png" alt-text="Screenshot shows how to call the read class file." lightbox="./media/use-flink-delta-connector/call-the-read-class.png"::: ++1. After submitting the job, + 1. Check the status and metrics on Flink UI. + 1. Check the job manager logs for more details. ++ :::image type="content" source="./media/use-flink-delta-connector/check-job-manager-logs.png" alt-text="Screenshot shows job manager logs." lightbox="./media/use-flink-delta-connector/check-job-manager-logs.png"::: ++## Writing to Delta sink ++The delta sink is used for writing the data to a delta table in ADLS gen2. The data stream consumed by the delta sink. +1. Build the jar with required libraries and dependencies. +1. Enable checkpoint for delta logs to commit the history. ++ :::image type="content" source="./media/use-flink-delta-connector/enable-checkpoint-for-delta-logs.png" alt-text="Screenshot shows how enable checkpoint for delta logs." lightbox="./media/use-flink-delta-connector/enable-checkpoint-for-delta-logs.png"::: + + ```java + public StreamExecutionEnvironment createPipeline( + String tablePath, + int sourceParallelism, + int sinkParallelism) { ++ DeltaSink<RowData> deltaSink = getDeltaSink(tablePath); + StreamExecutionEnvironment env = getStreamExecutionEnvironment(); ++ // Using Flink Delta Sink in processing pipeline + env + .addSource(new DeltaExampleSourceFunction()) + .setParallelism(sourceParallelism) + .sinkTo(deltaSink) + .name("MyDeltaSink") + .setParallelism(sinkParallelism); ++ return env; + } ++ /** + * An example of Flink Delta Sink configuration. + */ + @Override + public DeltaSink<RowData> getDeltaSink(String tablePath) { + return DeltaSink + .forRowData( + new Path(TABLE_PATH), + new Configuration(), + Utils.FULL_SCHEMA_ROW_TYPE) + .build(); + } + ``` +1. Call the delta sink class while submitting the job via Flink CLI. +1. Specify the account key of the storage account in `flink-client-config` using [Flink configuration management](./flink-configuration-management.md). You can specify the account key of the storage account in Flink config. `fs.azure.<storagename>.dfs.core.windows.net : <KEY >` ++ :::image type="content" source="./media/use-flink-delta-connector/call-the-delta-sink-class.png" alt-text="Screenshot shows how to call the delta sink class." lightbox="./media/use-flink-delta-connector/call-the-delta-sink-class.png"::: ++1. Specify the path of ADLS Gen2 storage account while specifying the delta sink properties. +1. Once the job is submitted, check the status and metrics on Flink UI. ++ :::image type="content" source="./media/use-flink-delta-connector/check-the-status-on-flink-ui.png" alt-text="Screenshot shows status on Flink UI." lightbox="./media/use-flink-delta-connector/check-the-status-on-flink-ui.png"::: ++ :::image type="content" source="./media/use-flink-delta-connector/view-the-checkpoints-on-flink-ui.png" alt-text="Screenshot shows the checkpoints on Flink-UI." lightbox="./media/use-flink-delta-connector/view-the-checkpoints-on-flink-ui.png"::: ++ :::image type="content" source="./media/use-flink-delta-connector/view-the-metrics-on-flink-ui.png" alt-text="Screenshot shows the metrics on Flink UI." lightbox="./media/use-flink-delta-connector/view-the-metrics-on-flink-ui.png"::: ++## Power BI integration ++Once the data is in delta sink, you can run the query in Power BI desktop and create a report. +1. Open your Power BI desktop and get the data using ADLS Gen2 connector. ++ :::image type="content" source="./media/use-flink-delta-connector/view-power-bi-desktop.png" alt-text="Screenshot shows Power BI desktop."::: ++ :::image type="content" source="./media/use-flink-delta-connector/view-adls-gen2-connector.png" alt-text="Screenshot shows ADLSGen 2 connector."::: ++1. URL of the storage account. ++ :::image type="content" source="./media/use-flink-delta-connector/url-of-the-storage-account.png" alt-text="Screenshot showing the URL of the storage account."::: ++ :::image type="content" source="./media/use-flink-delta-connector/adls-gen-2-details.png" alt-text="Screenshot shows ADLS Gen2-details."::: ++1. Create M-query for the source and invoke the function, which queries the data from storage account. Refer [Delta Power BI connectors](https://github.com/delta-io/connectors/tree/master/powerbi). ++1. Once the data is readily available, you can create reports. ++ :::image type="content" source="./media/use-flink-delta-connector/create-reports.png" alt-text="Screenshot shows how to create reports."::: ++## References ++* [Delta connectors](https://github.com/delta-io/connectors/tree/master/flink). +* [Delta Power BI connectors](https://github.com/delta-io/connectors/tree/master/powerbi). |
hdinsight-aks | Use Flink To Sink Kafka Message Into Hbase | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/use-flink-to-sink-kafka-message-into-hbase.md | + + Title: Write messages to HBase with DataStream API +description: Learn how to write messages to HBase with DataStream API ++ Last updated : 08/29/2023+++# Write messages to HBase with DataStream API +++In this article, learn how to write messages to HBase with Apache Flink DataStream API ++## Overview ++Apache Flink offers HBase connector as a sink, with this connector with Flink you can store the output of a real-time processing application in HBase. Learn how to process streaming data on HDInsight Kafka as a source, perform transformations, then sink into HDInsight HBase table. ++In a real world scenario, this example is a stream analytics layer to realize value from Internet of Things (IOT) analytics, which use live sensor data. The Flink Stream can read data from Kafka topic and write it to HBase table. If there is a real time streaming IOT application, the information can be gathered, transformed and optimized. +++## Prerequisites ++* [HDInsight on AKS Flink 1.16.0](../flink/flink-create-cluster-portal.md) +* [HDInsight Kafka](../flink/process-and-consume-data.md) +* [HDInsight HBase 2.4.11](../../hdinsight/hbase/apache-hbase-tutorial-get-started-linux.md#create-apache-hbase-cluster) + * You're required to make sure HDInsight on AKS Flink can connect to HDInsight HBase Master(zk), with same virtual network. +* Maven project on IntelliJ IDEA for development on an Azure VM in the same VNet ++## Implementation Steps ++### Use pipeline to produce Kafka topic (user click event topic) ++**weblog.py** ++``` python +import json +import random +import time +from datetime import datetime ++user_set = [ + 'John', + 'XiaoMing', + 'Mike', + 'Tom', + 'Machael', + 'Zheng Hu', + 'Zark', + 'Tim', + 'Andrew', + 'Pick', + 'Sean', + 'Luke', + 'Chunck' +] ++web_set = [ + 'https://github.com', + 'https://www.bing.com/new', + 'https://kafka.apache.org', + 'https://hbase.apache.org', + 'https://flink.apache.org', + 'https://spark.apache.org', + 'https://trino.io', + 'https://hadoop.apache.org', + 'https://stackoverflow.com', + 'https://docs.python.org', + 'https://azure.microsoft.com/products/category/storage', + '/azure/hdinsight/hdinsight-overview', + 'https://azure.microsoft.com/products/category/storage' +] ++def main(): + while True: + if random.randrange(13) < 4: + url = random.choice(web_set[:3]) + else: + url = random.choice(web_set) ++ log_entry = { + 'userName': random.choice(user_set), + 'visitURL': url, + 'ts': datetime.now().strftime("%m/%d/%Y %H:%M:%S") + } ++ print(json.dumps(log_entry)) + time.sleep(0.05) ++if __name__ == "__main__": + main() +``` ++**Use pipeline to produce Kafka topic** ++We're going to use click_events for the Kafka topic +``` +python weblog.py | /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --bootstrap-server wn0-contsk:9092 --topic click_events +``` ++**Sample commands on Kafka** +``` +-- create topic (replace with your Kafka bootstrap server) +/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --replication-factor 2 --partitions 3 --topic click_events --bootstrap-server wn0-contsk:9092 ++-- delete topic (replace with your Kafka bootstrap server) +/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --delete --topic click_events --bootstrap-server wn0-contsk:9092 ++-- produce topic (replace with your Kafka bootstrap server) +python weblog.py | /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --bootstrap-server wn0-contsk:9092 --topic click_events ++-- consume topic +/usr/hdp/current/kafka-broker/bin/kafka-console-consumer.sh --bootstrap-server wn0-contsk:9092 --topic click_events --from-beginning +{"userName": "Luke", "visitURL": "https://azure.microsoft.com/products/category/storage", "ts": "07/11/2023 06:39:43"} +{"userName": "Sean", "visitURL": "https://www.bing.com/new", "ts": "07/11/2023 06:39:43"} +{"userName": "XiaoMing", "visitURL": "https://hbase.apache.org", "ts": "07/11/2023 06:39:43"} +{"userName": "Machael", "visitURL": "https://www.bing.com/new", "ts": "07/11/2023 06:39:43"} +{"userName": "Andrew", "visitURL": "https://github.com", "ts": "07/11/2023 06:39:43"} +{"userName": "Zark", "visitURL": "https://kafka.apache.org", "ts": "07/11/2023 06:39:43"} +{"userName": "XiaoMing", "visitURL": "https://trino.io", "ts": "07/11/2023 06:39:43"} +{"userName": "Zark", "visitURL": "https://flink.apache.org", "ts": "07/11/2023 06:39:43"} +{"userName": "Mike", "visitURL": "https://kafka.apache.org", "ts": "07/11/2023 06:39:43"} +{"userName": "Zark", "visitURL": "https://docs.python.org", "ts": "07/11/2023 06:39:44"} +{"userName": "John", "visitURL": "https://www.bing.com/new", "ts": "07/11/2023 06:39:44"} +{"userName": "Mike", "visitURL": "https://hadoop.apache.org", "ts": "07/11/2023 06:39:44"} +{"userName": "Tim", "visitURL": "https://www.bing.com/new", "ts": "07/11/2023 06:39:44"} +..... +``` ++**Create HBase table on HDInsight HBase** ++``` sql +root@hn0-contos:/home/sshuser# hbase shell +SLF4J: Class path contains multiple SLF4J bindings. +SLF4J: Found binding in [jar:file:/usr/hdp/5.1.1.3/hadoop/lib/slf4j-reload4j-1.7.35.jar!/org/slf4j/impl/StaticLoggerBinder.class] +SLF4J: Found binding in [jar:file:/usr/hdp/5.1.1.3/hbase/lib/client-facing-thirdparty/slf4j-reload4j-1.7.33.jar!/org/slf4j/impl/StaticLoggerBinder.class] +SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. +SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory] +HBase Shell +Use "help" to get list of supported commands. +Use "exit" to quit this interactive shell. +For more information, see, http://hbase.apache.org/2.0/book.html#shell +Version 2.4.11.5.1.1.3, rUnknown, Thu Apr 20 12:31:07 UTC 2023 +Took 0.0032 seconds +hbase:001:0> create 'user_click_events','user_info' +Created table user_click_events +Took 5.1399 seconds +=> Hbase::Table - user_click_events +hbase:002:0> +``` ++### Develop the project for submitting jar on Flink ++**create maven project with following pom.xml** ++``` xml +<?xml version="1.0" encoding="UTF-8"?> +<project xmlns="http://maven.apache.org/POM/4.0.0" + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> + <modelVersion>4.0.0</modelVersion> ++ <groupId>contoso.example</groupId> + <artifactId>FlinkHbaseDemo</artifactId> + <version>1.0-SNAPSHOT</version> + <properties> + <maven.compiler.source>1.8</maven.compiler.source> + <maven.compiler.target>1.8</maven.compiler.target> + <flink.version>1.16.0</flink.version> + <java.version>1.8</java.version> + <scala.binary.version>2.12</scala.binary.version> + <hbase.version>2.4.11</hbase.version> + <kafka.version>3.2.0</kafka.version> // Replace with 2.4.0 for HDInsight Kafka 2.4 + </properties> + <dependencies> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-java</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-streaming-java</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-clients</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-hbase-base --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-hbase-base</artifactId> + <version>1.16.0</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-client --> + <dependency> + <groupId>org.apache.hbase</groupId> + <artifactId>hbase-client</artifactId> + <version>${hbase.version}</version> + </dependency> + <dependency> + <groupId>org.apache.hadoop</groupId> + <artifactId>hadoop-common</artifactId> + <version>3.1.1</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-kafka</artifactId> + <version>${flink.version}</version> + </dependency> + <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-base --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-base</artifactId> + <version>${flink.version}</version> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-core</artifactId> + <version>${flink.version}</version> + </dependency> + </dependencies> + <build> + <plugins> + <plugin> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-assembly-plugin</artifactId> + <version>3.0.0</version> + <configuration> + <appendAssemblyId>false</appendAssemblyId> + <descriptorRefs> + <descriptorRef>jar-with-dependencies</descriptorRef> + </descriptorRefs> + </configuration> + <executions> + <execution> + <id>make-assembly</id> + <phase>package</phase> + <goals> + <goal>single</goal> + </goals> + </execution> + </executions> + </plugin> + </plugins> + </build> +</project> +``` ++**Source code** ++Writing HBase Sink program ++**HBaseWriterSink** +``` java +package contoso.example; ++import org.apache.flink.api.java.tuple.Tuple3; +import org.apache.flink.configuration.Configuration; +import org.apache.flink.streaming.api.functions.sink.RichSinkFunction; +import org.apache.hadoop.hbase.HBaseConfiguration; +import org.apache.hadoop.hbase.TableName; +import org.apache.hadoop.hbase.client.*; +import org.apache.hadoop.hbase.util.Bytes; ++public class HBaseWriterSink extends RichSinkFunction<Tuple3<String,String,String>> { + String hbase_zk = "<update-hbasezk-ip>:2181,<update-hbasezk-ip>:2181,<update-hbasezk-ip>:2181"; + Connection hbase_conn; + Table tb; + int i = 0; + @Override + public void open(Configuration parameters) throws Exception { + super.open(parameters); + org.apache.hadoop.conf.Configuration hbase_conf = HBaseConfiguration.create(); + hbase_conf.set("hbase.zookeeper.quorum", hbase_zk); + hbase_conf.set("zookeeper.znode.parent", "/hbase-unsecure"); + hbase_conn = ConnectionFactory.createConnection(hbase_conf); + tb = hbase_conn.getTable(TableName.valueOf("user_click_events")); + } ++ @Override + public void invoke(Tuple3<String,String,String> value, Context context) throws Exception { + byte[] rowKey = Bytes.toBytes(String.format("%010d", i++)); + Put put = new Put(rowKey); + put.addColumn(Bytes.toBytes("user_info"), Bytes.toBytes("userName"), Bytes.toBytes(value.f0)); + put.addColumn(Bytes.toBytes("user_info"), Bytes.toBytes("visitURL"), Bytes.toBytes(value.f1)); + put.addColumn(Bytes.toBytes("user_info"), Bytes.toBytes("ts"), Bytes.toBytes(value.f2)); + tb.put(put); + }; ++ public void close() throws Exception { + if (null != tb) tb.close(); + if (null != hbase_conn) hbase_conn.close(); + } +} +``` ++**main:KafkaSinkToHbase** ++Writing a Kafka Sink to HBase Program ++``` java +package contoso.example; ++import org.apache.flink.api.common.eventtime.WatermarkStrategy; +import org.apache.flink.api.common.serialization.SimpleStringSchema; +import org.apache.flink.api.common.typeinfo.Types; ++import org.apache.flink.api.java.tuple.Tuple3; +import org.apache.flink.connector.kafka.source.KafkaSource; +import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.datastream.DataStreamSource; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; ++public class KafkaSinkToHbase { + public static void main(String[] args) throws Exception { ++ StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1); + String kafka_brokers = "10.0.0.38:9092,10.0.0.39:9092,10.0.0.40:9092"; ++ KafkaSource<String> source = KafkaSource.<String>builder() + .setBootstrapServers(kafka_brokers) + .setTopics("click_events") + .setGroupId("my-group") + .setStartingOffsets(OffsetsInitializer.earliest()) + .setValueOnlyDeserializer(new SimpleStringSchema()) + .build(); ++ DataStreamSource<String> kafka = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source").setParallelism(1); + DataStream<Tuple3<String,String,String>> dataStream = kafka.map(line-> { + String[] fields = line.toString().replace("{","").replace("}",""). + replace("\"","").split(","); + Tuple3<String, String,String> tuple3 = Tuple3.of(fields[0].substring(10),fields[1].substring(11),fields[2].substring(5)); + return tuple3; + }).returns(Types.TUPLE(Types.STRING,Types.STRING,Types.STRING)); ++ dataStream.addSink(new HBaseWriterSink()); ++ env.execute("Kafka Sink To Hbase"); + } +} ++``` ++### Submit job on Secure Shell ++We use [Flink CLI](./flink-web-ssh-on-portal-to-flink-sql.md) from Azure portal to submit jobs +++### Monitor job on Flink UI ++We can monitor the jobs on Flink Web UI +++## Validate HBase table data ++``` +hbase:001:0> scan 'user_click_events' +ROW COLUMN+CELL + 0000000853 column=user_info:ts, timestamp=2023-07-11T06:50:08.505, value=07/11/2023 06:39:44 + 0000000853 column=user_info:userName, timestamp=2023-07-11T06:50:08.505, value=Sean + 0000000853 column=user_info:visitURL, timestamp=2023-07-11T06:50:08.505, value=https://kafka.apache.org + 0000000854 column=user_info:ts, timestamp=2023-07-11T06:50:08.556, value=07/11/2023 06:39:45 + 0000000854 column=user_info:userName, timestamp=2023-07-11T06:50:08.556, value=Pick + 0000000854 column=user_info:visitURL, timestamp=2023-07-11T06:50:08.556, value=https://www.bing.com/new + 0000000855 column=user_info:ts, timestamp=2023-07-11T06:50:08.609, value=07/11/2023 06:39:45 + 0000000855 column=user_info:userName, timestamp=2023-07-11T06:50:08.609, value=Pick + 0000000855 column=user_info:visitURL, timestamp=2023-07-11T06:50:08.609, value=https://kafka.apache.org + 0000000856 column=user_info:ts, timestamp=2023-07-11T06:50:08.663, value=07/11/2023 06:39:45 + 0000000856 column=user_info:userName, timestamp=2023-07-11T06:50:08.663, value=Andrew + 0000000856 column=user_info:visitURL, timestamp=2023-07-11T06:50:08.663, value=https://hadoop.apache.org + 0000000857 column=user_info:ts, timestamp=2023-07-11T06:50:08.714, value=07/11/2023 06:39:45 + 0000000857 column=user_info:userName, timestamp=2023-07-11T06:50:08.714, value=Machael + 0000000857 column=user_info:visitURL, timestamp=2023-07-11T06:50:08.714, value=https://flink.apache.org + 0000000858 column=user_info:ts, timestamp=2023-07-11T06:50:08.767, value=07/11/2023 06:39:45 + 0000000858 column=user_info:userName, timestamp=2023-07-11T06:50:08.767, value=Luke + 0000000858 column=user_info:visitURL, timestamp=2023-07-11T06:50:08.767, value=/azure/ + hdinsight/hdinsight-overview +859 row(s) +Took 0.9531 seconds +``` ++> [!NOTE] +> - FlinkKafkaConsumer is deprecated and removed with Flink 1.17, use KafkaSource instead. +> - FlinkKafkaProducer is deprecated and removed with Flink 1.15, use KafkaSink instead. ++## References +* [Apache Kafka Connector](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/kafka) +* [Download IntelliJ IDEA](https://www.jetbrains.com/idea/download/#section=windows) |
hdinsight-aks | Use Hive Catalog | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/use-hive-catalog.md | + + Title: Use Hive Catalog, Hive Read & Write demo on Apache Flink SQL +description: Learn how to use Hive Catalog, Hive Read & Write demo on Apache Flink SQL ++ Last updated : 08/29/2023+++# How to use Hive Catalog with Apache Flink SQL +++This example uses HiveΓÇÖs Metastore as a persistent catalog with Apache FlinkΓÇÖs HiveCatalog. We will use this functionality for storing Kafka table and MySQL table metadata on Flink across sessions. Flink uses Kafka table registered in Hive Catalog as a source, perform some lookup and sink result to MySQL database +++## Prerequisites ++* [HDInsight on AKS Flink 1.16.0 with Hive Metastore 3.1.2](../flink/flink-create-cluster-portal.md) +* [HDInsight Kafka](../../hdinsight/kafk) + * You're required to ensure the network settings are complete as described on [Using HDInsight Kafka](../flink/process-and-consume-data.md); that's to make sure HDInsight on AKS Flink and HDInsight Kafka are in the same VNet +* MySQL 8.0.33 ++## Apache Hive on Flink ++Flink offers a two-fold integration with Hive. ++- The first step is to use Hive Metastore (HMS) as a persistent catalog with FlinkΓÇÖs HiveCatalog for storing Flink specific metadata across sessions. + - For example, users can store their Kafka or ElasticSearch tables in Hive Metastore by using HiveCatalog, and reuse them later on in SQL queries. +- The second is to offer Flink as an alternative engine for reading and writing Hive tables. +- The HiveCatalog is designed to be ΓÇ£out of the boxΓÇ¥ compatible with existing Hive installations. You don't need to modify your existing Hive Metastore or change the data placement or partitioning of your tables. ++You may refer to this page for more details on [Apache Hive](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/table/hive/overview/) ++## Environment preparation ++### Create an Apache Flink cluster with HMS ++Lets create an Apache Flink cluster with HMS on Azure portal, you can refer to the detailed instructions on [Flink cluster creation](../flink/flink-create-cluster-portal.md). +++After cluster creation, check HMS is running or not on AKS side. +++### Prepare user order transaction data Kafka topic on HDInsight ++Download the kafka client jar using the following command: ++`wget https://archive.apache.org/dist/kafka/3.2.0/kafka_2.12-3.2.0.tgz` ++Untar the tar file with ++`tar -xvf kafka_2.12-3.2.0.tgz` ++Produce the messages to the Kafka topic. +++Other commands: +> [!NOTE] +> You're required to replace bootstrap-server with your own kafka brokers host name or IP +``` + delete topic +./kafka-topics.sh --delete --topic user_orders --bootstrap-server wn0-contsk:9092 ++ create topic +./kafka-topics.sh --create --replication-factor 2 --partitions 3 --topic user_orders --bootstrap-server wn0-contsk:9092 ++ produce topic +./kafka-console-producer.sh --bootstrap-server wn0-contsk:9092 --topic user_orders ++ consumer topic +./kafka-console-consumer.sh --bootstrap-server wn0-contsk:9092 --topic user_orders --from-beginning +``` ++### Prepare user order master data on MySQL on Azure ++Testing DB: ++++**Prepare the order table:** ++``` SQL +mysql> use mydb +Reading table information for completion of table and column names +You can turn off this feature to get a quicker startup with -A ++mysql> CREATE TABLE orders ( + order_id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY, + order_date DATETIME NOT NULL, + customer_id INTEGER NOT NULL, + customer_name VARCHAR(255) NOT NULL, + price DECIMAL(10, 5) NOT NULL, + product_id INTEGER NOT NULL, + order_status BOOLEAN NOT NULL +) AUTO_INCREMENT = 10001; +++mysql> INSERT INTO orders +VALUES (default, '2023-07-16 10:08:22','0001', 'Jark', 50.00, 102, false), + (default, '2023-07-16 10:11:09','0002', 'Sally', 15.00, 105, false), + (default, '2023-07-16 10:11:09','000', 'Sally', 25.00, 105, false), + (default, '2023-07-16 10:11:09','0004', 'Sally', 45.00, 105, false), + (default, '2023-07-16 10:11:09','0005', 'Sally', 35.00, 105, false), + (default, '2023-07-16 12:00:30','0006', 'Edward', 90.00, 106, false); ++mysql> select * from orders; ++-++-++-++--++| order_id | order_date | customer_id | customer_name | price | product_id | order_status | ++-++-++-++--++| 10001 | 2023-07-16 10:08:22 | 1 | Jark | 50.00000 | 102 | 0 | +| 10002 | 2023-07-16 10:11:09 | 2 | Sally | 15.00000 | 105 | 0 | +| 10003 | 2023-07-16 10:11:09 | 3 | Sally | 25.00000 | 105 | 0 | +| 10004 | 2023-07-16 10:11:09 | 4 | Sally | 45.00000 | 105 | 0 | +| 10005 | 2023-07-16 10:11:09 | 5 | Sally | 35.00000 | 105 | 0 | +| 10006 | 2023-07-16 12:00:30 | 6 | Edward | 90.00000 | 106 | 0 | ++-++-++-++--++6 rows in set (0.22 sec) ++mysql> desc orders; +++++--++-++| Field | Type | Null | Key | Default | Extra | +++++--++-++| order_id | int | NO | PRI | NULL | auto_increment | +| order_date | datetime | NO | | NULL | | +| customer_id | int | NO | | NULL | | +| customer_name | varchar(255) | NO | | NULL | | +| price | decimal(10,5) | NO | | NULL | | +| product_id | int | NO | | NULL | | +| order_status | tinyint(1) | NO | | NULL | | +++++--++-++7 rows in set (0.22 sec) +``` ++### Using SSH download required Kafka connector and MySQL Database jars +++> [!NOTE] +> Download the correct version jar according to our HDInsight kafka version and MySQL version. ++``` +wget https://repo1.maven.org/maven2/org/apache/flink/flink-connector-jdbc/1.16.0/flink-connector-jdbc-1.16.0.jar +wget https://repo1.maven.org/maven2/com/mysql/mysql-connector-j/8.0.33/mysql-connector-j-8.0.33.jar +wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.2.0/kafka-clients-3.2.0.jar +wget https://repo1.maven.org/maven2/org/apache/flink/flink-connector-kafka/1.16.0/flink-connector-kafka-1.16.0.jar +``` ++**Moving the planner jar** ++Move the jar flink-table-planner_2.12-1.16.0-0.0.18.jar located in webssh pod's /opt to /lib and move out the jar flink-table-planner-loader-1.16.0-0.0.18.jar from /lib. Please refer to [issue](https://issues.apache.org/jira/browse/FLINK-25128) for more details. Perform the following steps to move the planner jar. ++``` +mv /opt/flink-webssh/opt/flink-table-planner_2.12-1.16.0-0.0.18.jar /opt/flink-webssh/lib/ +mv /opt/flink-webssh/lib/flink-table-planner-loader-1.16.0-0.0.18.jar /opt/flink-webssh/opt/ +``` ++> [!NOTE] +> **An extra planner jar moving is only needed when using Hive dialect or HiveServer2 endpoint**. However, this is the recommended setup for Hive integration. ++## Validation +### Use bin/sql-client.sh to connect to Flink SQL ++``` +bin/sql-client.sh -j kafka-clients-3.2.0.jar -j flink-connector-kafka-1.16.0.jar -j flink-connector-jdbc-1.16.0.jar -j mysql-connector-j-8.0.33.jar +``` ++### Create Hive catalog and connect to the hive catalog on Flink SQL ++> [!NOTE] +> As we already use Flink cluster with Hive Metastore, there is no need to perform any additional configurations. ++``` SQL +CREATE CATALOG myhive WITH ( + 'type' = 'hive' +); ++USE CATALOG myhive; +``` ++### Create Kafka Table on Apache Flink SQL ++``` SQL +CREATE TABLE kafka_user_orders ( + `user_id` BIGINT, + `user_name` STRING, + `user_email` STRING, + `order_date` TIMESTAMP(3) METADATA FROM 'timestamp', + `price` DECIMAL(10,5), + `product_id` BIGINT, + `order_status` BOOLEAN +) WITH ( + 'connector' = 'kafka', + 'topic' = 'user_orders', + 'scan.startup.mode' = 'latest-offset', + 'properties.bootstrap.servers' = '10.0.0.38:9092,10.0.0.39:9092,10.0.0.40:9092', + 'format' = 'json' +); ++select * from kafka_user_orders; +``` ++### Create MySQL Table on Apache Flink SQL ++``` SQL +CREATE TABLE mysql_user_orders ( + `order_id` INT, + `order_date` TIMESTAMP, + `customer_id` INT, + `customer_name` STRING, + `price` DECIMAL(10,5), + `product_id` INT, + `order_status` BOOLEAN +) WITH ( + 'connector' = 'jdbc', + 'url' = 'jdbc:mysql://<servername>.mysql.database.azure.com/mydb', + 'table-name' = 'orders', + 'username' = '<username>', + 'password' = '<password>' +); ++select * from mysql_user_orders; +``` ++### Check tables registered in above Hive catalog on Flink SQL ++++### Sink user transaction order info into master order table in MySQL on Flink SQL ++``` SQL +INSERT INTO mysql_user_orders (order_date, customer_id, customer_name, price, product_id, order_status) + SELECT order_date, CAST(user_id AS INT), user_name, price, CAST(product_id AS INT), order_status + FROM kafka_user_orders; +``` +++### Check if user transaction order data on Kafka is added in master table order in MySQL on Azure Cloud Shell ++++### Creating three more user orders on Kafka ++``` +sshuser@hn0-contsk:~$ /usr/hdp/current/kafka-broker/bin/kafka-console-producer.sh --bootstrap-server wn0-contsk:9092 --topic user_orders +>{"user_id": null,"user_name": "Lucy","user_email": "user8@example.com","order_date": "07/17/2023 21:33:44","price": "90.00000","product_id": "102","order_status": false} +>{"user_id": "0009","user_name": "Zark","user_email": "user9@example.com","order_date": "07/17/2023 21:52:07","price": "80.00000","product_id": "103","order_status": true} +>{"user_id": "0010","user_name": "Alex","user_email": "user10@example.com","order_date": "07/17/2023 21:52:07","price": "70.00000","product_id": "104","order_status": true} +``` ++### Check Kafka table data on Flink SQL +``` SQL +Flink SQL> select * from kafka_user_orders; +``` +++### Insert `product_id=104` into orders table on MySQL on Flink SQL ++``` SQL +INSERT INTO mysql_user_orders (order_date, customer_id, customer_name, price, product_id, order_status) +SELECT order_date, CAST(user_id AS INT), user_name, price, CAST(product_id AS INT), order_status +FROM kafka_user_orders where product_id = 104; +``` ++### Check `product_id = 104` record is added in order table on MySQL on Azure Cloud Shell +++### Reference +* [Apache Hive](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/table/hive/overview/) |
hdinsight-aks | Use Hive Metastore Datastream | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/flink/use-hive-metastore-datastream.md | + + Title: Use Hive Metastore with Apache Flink DataStream API +description: Use Hive Metastore with Apache Flink DataStream API ++ Last updated : 08/29/2023+++# Use Hive Metastore with Apache Flink DataStream API +++Over the years, Hive Metastore has evolved into a de facto metadata center in the Hadoop ecosystem. Many companies have a separate Hive Metastore service instance in their production environments to manage all their metadata (Hive or non-Hive metadata). For users who have both Hive and Flink deployments, HiveCatalog enables them to use Hive Metastore to manage FlinkΓÇÖs metadata. +++## Supported Hive versions for HDInsight on AKS - Apache Flink ++Supported Hive Version: +- 3.1 + - 3.1.0 + - 3.1.1 + - 3.1.2 + - 3.1.3 ++If you're building your own program, you need the following dependencies in your mvn file. ItΓÇÖs **not recommended** to include these dependencies in the resulting jar file. YouΓÇÖre supposed to add dependencies at runtime. ++``` +<dependency> + <groupId>org.apache.flink</groupId> + <artifactId> flink-sql-connector-hive-3.1.2_2.12</artifactId> + <version>1.16.0</version> + <scope>provided</scope> +</dependency> ++<dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-table-api-java-bridge_2.12</artifactId> + <version>1.16.0</version> + <scope>provided</scope> +</dependency> +``` ++## Connect to Hive ++This example illustrates various snippets of connecting to hive, using HDInsight on AKS - Flink, you're required to use `/opt/hive-conf` as hive configuration directory to connect with Hive metastore ++``` +public static void main(String[] args) throws Exception + { + StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); + // start Table Environment + StreamTableEnvironment tableEnv = + StreamTableEnvironment.create(env); + env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); + String catalogName = "myhive"; + String defaultDatabase = HiveCatalog.DEFAULT_DB; + String hiveConfDir = "/opt/hive-conf"; + HiveCatalog hive = new HiveCatalog(catalogName, defaultDatabase, hiveConfDir); +// register HiveCatalog in the tableEnv + tableEnv.registerCatalog("myhive", hive); +// set the HiveCatalog as the current catalog of the session + tableEnv.useCatalog("myhive"); +// Create a table in hive catalog + tableEnv.executeSql("create table MyTable (a int, b bigint, c varchar(32)) with ('connector' = 'filesystem', 'path' = '/non', 'format' = 'csv')\""); +// Create a view in hive catalog + tableEnv.executeSql("create view MyView as select * from MyTable"); +``` ++## References +[Apache Flink - Hive read & write](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/table/hive/hive_read_write/) |
hdinsight-aks | Get Started | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/get-started.md | + + Title: One-click deployment for Azure HDInsight on AKS (Preview) +description: How to create cluster pool and cluster with one-click deployment on Azure HDInsight on AKS. ++ Last updated : 08/29/2023+++# Get started with one-click deployment (Preview) +++One-click deployments are designed for users to experience zero touch creation of HDInsight on AKS. It eliminates the need to manually perform certain steps. +This article describes how to use readily available ARM templates to create a cluster pool and cluster in few clicks. ++> [!NOTE] +> - These ARM templates cover the basic requirements to create a cluster pool and cluster along with prerequisite resources. To explore advanced options, see [Create cluster pool and clusters](quickstart-create-cluster.md). +> - Necessary resources are created as part of the ARM template deployment in your resource group. For more information, see [Resource prerequisites](prerequisites-resources.md). +> - The user must have permission to create new resources and assign roles to the resources in the subscription to deploy these ARM templates. +> - Before you begin with ARM templates, please keep [object ID ready](#find-object-id-of-an-identity) for the identity you are going to use for deployment. ++|Workload|Template|Description| +|||| +|Trino| [![Deploy Trino to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure-Samples%2Fhdinsight-aks%2Fmain%2FARM%2520templates%2FoneClickTrino.json) | Creates cluster pool and cluster **without** HMS, custom VNet, and Monitoring capability.| +|Flink|[![Deploy Trino to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure-Samples%2Fhdinsight-aks%2Fmain%2FARM%2520templates%2FoneClickFlink.json) | Creates cluster pool and cluster **without** HMS, custom VNet, and Monitoring capability.| +|Spark| [![Deploy Trino to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure-Samples%2Fhdinsight-aks%2Fmain%2FARM%2520templates%2FoneClickSpark.json) | Creates cluster pool and cluster **without** HMS, custom VNet, and Monitoring capability.| +|Trino|[![Deploy Trino to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure-Samples%2Fhdinsight-aks%2Fmain%2FARM%2520templates%2FoneClickTrino_WithVnet.json) | Creates cluster pool and cluster with an existing custom VNet.| +|Flink| [![Deploy Trino to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure-Samples%2Fhdinsight-aks%2Fmain%2FARM%2520templates%2FoneClickFlink_WithVnet.json) | Creates cluster pool and cluster with an existing custom VNet.| +|Spark| [![Deploy Trino to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure-Samples%2Fhdinsight-aks%2Fmain%2FARM%2520templates%2FoneClickSpark_WithVnet.json)| Creates cluster pool and cluster with an existing custom VNet.| ++When you click on one of these templates, it launches Custom deployment page in the Azure portal. You need to provide the details for the following parameters based on the template used. ++|**Property**|**Description**| +||| +|Subscription| Select the Azure subscription in which resources are to be created.| +|Resource Group|Create a new resource group, or select the resource group in your subscription from the drop-down list under which resources are to be created.| +|Region|Select the region where the resource group is deployed.| +|Cluster Pool Name| Enter the name of the cluster pool to be created. Cluster pool name length can't be more than 26 characters. It must start with an alphabet, end with an alphanumeric character, and must only contain alphanumeric characters and hyphens.| +|Cluster Pool Version| Select the HDInsight on AKS cluster pool version. | +|Cluster Pool Node VM Size|From the drop-down list, select the virtual machine size for the cluster pool based on your requirement.| +|Location|Select the region where the cluster and necessary resources are to be deployed.| +|Resource Prefix|Provide a prefix for creating necessary resources for cluster creation, resources are named as [prefix + predefined string].| +|Cluster Name |Enter the name of the new cluster.| +|HDInsight on AKS Version | Select the minor or patch version of the HDInsight on AKS of the new cluster. For more information, see [versioning](./versions.md).| +|Cluster Node VM Size |Provide the VM size for the cluster. For example: Standard_D8ds_v5.| +|Cluster OSS Version |Provide the cluster type supported OSS version in three part naming format. For example: Trino - 0.410.0, Flink - 1.16.0, Spark - 3.3.1| +|Custom VNet Name |Provide custom virtual network to be associated with the cluster pool. It should be in the same resource group as your cluster pool. | +|Subnet Name in Custom Vnet |Provide subnet name defined in your custom virtual network. | +|User Object ID| Provide user alias object ID from Microsoft Entra ID [(Azure Active Directory)](https://www.microsoft.com/security/business/identity-access/azure-active-directory).| + + ### Find Object ID of an identity + + 1. In the top search bar in the Azure portal, enter your user ID. (For example, john@contoso.com) ++ :::image type="content" source="./media/get-started/search-object-id.png" alt-text="Screenshot showing how to search object ID."::: + + 2. From Azure Active Directory box, click on your user ID. + + :::image type="content" source="./media/get-started/view-object-id.png" alt-text="Screenshot showing how to view object ID."::: + + 1. Copy the Object ID. + + ### Deploy + + 1. Select **Next: Review + create** to continue. + 1. On the **Review + create** page, based on validation status, continue to click **Create**. ++ :::image type="content" source="./media/get-started/custom-deployment-summary.png" alt-text="Screenshot showing custom deployment summary."::: ++ The **Deployment is in progress** page is displayed while the resources are getting created, and the **"Your deployment is complete"** page is displayed once the cluster pool and cluster are fully deployed and ready for use. ++ :::image type="content" source="./media/get-started/custom-deployment-complete.png" alt-text="Screenshot showing custom deployment complete."::: ++ ++ If you navigate away from the page, you can check the status of the deployment by clicking Notifications :::image type="icon" source="./media/get-started/notifications.png" alt-text="Screenshot showing notifications icon in the Azure portal."::: in the Azure portal. + + > [!TIP] + > + > For troubleshooting any deployment errors, you can refer this [page](./create-cluster-error-dictionary.md). + |
hdinsight-aks | Hdinsight Aks Support Help | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/hdinsight-aks-support-help.md | + + Title: Support and troubleshooting for HDInsight on AKS +description: This article provides support and troubleshooting options for HDInsight on AKS. ++ Last updated : 10/06/2023+++# Support and troubleshooting for HDInsight on AKS ++## Self help troubleshooting +++The [HDInsight on AKS troubleshooting documentation](./create-cluster-error-dictionary.md) provides guidance for how to diagnose and resolve issues that you might encounter when using HDInsight on AKS. These articles cover how to troubleshoot deployment failures, connection issues, and more. ++For specific component pages, you can always refer: ++- [Flink](./flink/flink-cluster-configuration.md) +- [Trino](./trino/trino-configuration-troubleshoot.md) ++## Post a question on Microsoft Q&A +++Azure's preferred destination for community support, [Microsoft Q&A](/answers/products/azure), allows you to ask technical questions and engage with Azure engineers, Most Valuable Professionals (MVPs), partners, and customers. When you ask a question, make sure you use the `HDInsight` tag. You can also submit your own answers and help other community members with their questions. ++- [Microsoft Q&A for HDInsight on AKS](/answers/tags/453/azure-hdinsight-aks) ++If you can't find an answer to your problem using search, you can submit a new question to Microsoft Q&A and tag it with the appropriate Azure service and area. ++The following table lists the tags for HDInsight on AKS and related ++| Area | Tag | +|-|-| +| [Azure Kubernetes Service](/azure/aks/intro-kubernetes) | [azure-kubernetes-service](/answers/topics/azure-kubernetes-service.html)| +| [Azure HDInsight on AKS](./overview.md) | [azure-hdinsight-aks](/answers/topics/azure-hdinsight-aks.html) | +| [Azure storage accounts](/azure/storage/common/storage-account-overview) | [azure-storage-accounts](/answers/topics/azure-storage-accounts.html)| +| [Azure Managed Identities](/azure/active-directory/managed-identities-azure-resources/overview) | [azure-managed-identity](/answers/topics/azure-managed-identity.html) | +| [Azure RBAC](/azure/role-based-access-control/overview) | [azure-rbac](/answers/topics/azure-rbac.html)| +| [Azure Active Directory](/azure/active-directory/fundamentals/whatis) | [azure-active-directory](/answers/topics/azure-active-directory.html)| +| [Azure Virtual Network](/azure/virtual-network/network-overview) | [azure-virtual-network](/answers/topics/azure-virtual-network.html)| ++## Create an Azure support request +++Explore the range of [Azure support options](https://azure.microsoft.com/support/plans) and choose a plan that best fits your needs. Azure customers can create and manage support requests in the Azure portal. ++If you already have an Azure Support Plan, you can [open a support request](https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/newsupportrequest). ++## Create a GitHub issue +++If you need help with the languages and tools for developing and managing HDInsight on AKS, you can open an issue in its GitHub repository. ++The following table lists the GitHub repositories for HDInsight on AKS and related ++| Library | GitHub issues URL| +| | | +| Azure PowerShell | https://github.com/Azure/azure-powershell/issues | +| Azure CLI | https://github.com/Azure/azure-cli/issues | +| Azure REST API | https://github.com/Azure/azure-rest-api-specs/issues | +| Azure SDK for Java | https://github.com/Azure/azure-sdk-for-java/issues | +| Azure SDK for Python | https://github.com/Azure/azure-sdk-for-python/issues | +| Azure SDK for .NET | https://github.com/Azure/azure-sdk-for-net/issues | +| Azure SDK for JavaScript | https://github.com/Azure/azure-sdk-for-js/issues | +| Terraform | https://github.com/Azure/terraform/issues | ++## Stay informed of updates and new releases +++Learn about important product updates, roadmap, and announcements in [Azure Updates](https://azure.microsoft.com/updates/?query=HDInsight), [Release notes](./release-notes/hdinsight-aks-release-notes.md) and [Social Channels](https://www.linkedin.com/groups/14313521/). ++## Next steps ++Visit the [HDInsight on AKS documentation](./index.yml). |
hdinsight-aks | Hdinsight On Aks Autoscale Clusters | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/hdinsight-on-aks-autoscale-clusters.md | + + Title: Automatically scale Azure HDInsight on AKS clusters +description: Use the Auto scale feature to automatically scale Azure HDInsight clusters on AKS based on a schedule or load based metrics. ++ Last updated : 08/29/2023+++# Auto Scale HDInsight on AKS Clusters +++The sizing of any cluster to meet job performance and manage costs ahead of time is always tricky, and hard to determine! One of the lucrative benefits of building data lake house over Cloud is its elasticity, which means to use autoscale feature to maximize the utilization of resources at hand. Auto scale with Kubernetes is one key to establishing a cost optimized ecosystem. With varied usage patterns in any enterprise, there could be variations in cluster loads over time that could lead to clusters being under-provisioned (lousy performance) or overprovisioned (unnecessary costs due to idle resources). ++The autoscale feature offered in HDInsight on AKS can automatically increase or decrease the number of worker nodes in your cluster. Auto scale uses the cluster metrics and scaling policy used by the customers. ++This feature is well suited for mission-critical workloads, which may have +- Variable or unpredictable traffic patterns and require SLAs on high performance and scale or +- Predetermined schedule for required worker nodes to be available to successfully execute the jobs on the cluster. ++Auto Scale with HDInsight on AKS Clusters makes the clusters cost efficient, and elastic on Azure. ++With Auto scale, customers can scale down clusters without affecting workloads. It's enabled with advanced capabilities such as graceful decommissioning and cooling period. These capabilities empower users to make informed choices on addition and removal of nodes based on the current load of the cluster. ++## How it works ++This feature works by scaling the number of nodes within preset limits based on cluster metrics or a defined schedule of scale-up and scale-down operations. There are two types of conditions to trigger autoscale events: threshold-based triggers for various cluster performance metrics (called load-based scaling) and time-based triggers (called schedule-based scaling). ++Load-based scaling changes the number of nodes in your cluster, within a range that you set, to ensure optimal CPU usage and minimize running cost. ++Schedule-based scaling changes the number of nodes in your cluster based on a schedule of scale-up and scale-down operations. ++> [!NOTE] +> Auto scale does not support changing SKU type of an existing cluster. ++### Cluster compatibility ++The following table describes the cluster types that are compatible with the Auto scale feature, and whatΓÇÖs available or planned. ++|Workload |Load Based |Schedule Base| +|-|-|-| +|Flink |Planned |Yes| +|Trino |Planned |Yes**| +|Spark |Yes** |Yes**| ++**Graceful decommissioning is configurable. ++## Scaling Methods ++* **Schedule-based scaling**: + * When your jobs are expected to run on fixed schedules and for a predictable duration or when you anticipate low usage during specific times of the day For example, test and dev environments in post-work hours, end-of day jobs. + + :::image type="content" source="./media/hdinsight-on-aks-autoscale-clusters/schedule-based-concept-step-1.png" alt-text="Screenshot showing how to select schedule-based-scaling." border="true" lightbox="./media/hdinsight-on-aks-autoscale-clusters/schedule-based-concept-step-1.png"::: ++* **Load based scale**: + * When the load patterns fluctuate substantially and unpredictably during the day, for example, Order data processing with random fluctuations in load patterns based on various factors. + + :::image type="content" source="./media/hdinsight-on-aks-autoscale-clusters/load-based-concept-step-2.png" alt-text="Screenshot showing how to select load based scaling." border="true" lightbox="./media/hdinsight-on-aks-autoscale-clusters/load-based-concept-step-2.png"::: ++ With the new, configure scale rule option, you can now customize the scale rules. ++ :::image type="content" source="./media/hdinsight-on-aks-autoscale-clusters/configure-scale-rule-concept-step-3.png" alt-text="Screenshot showing how to configure scale rule in load based scaling."::: + + :::image type="content" source="./media/hdinsight-on-aks-autoscale-clusters/configure-scale-rule-add-rule-concept-step-4.png" alt-text="Screenshot showing how to add rules in configure scale rules for load based scaling."::: ++ > [!TIP] + > * Scale Up rules take precedence when one or more rules are triggered. Even if only one of the rules for scale up suggest cluster being under-provisioned, cluster will try to scale up. For scale down to happen, no scale up rule should be triggered. ++### Load-based scale conditions ++When the following conditions are detected, Auto scale issues a scale request ++|Scale-up|Scale-down| +|-|-| +|Allocated Cores are greater than 80% for 5-minutes poll interval (1-minute check period)|Allocated Cores are less than or equal to 20% for 5-minutes poll interval (1-minute check period) | ++* For scale-up, Auto scale issues a scale-up request to add the required number of nodes. The scale-up is based on how many new worker nodes are needed to meet the current CPU and memory requirements. This value is capped to maximum number of worker nodes set. + +* For scale-down, Auto scale issues a request to remove some nodes. The scale-down considerations include the number of pods per node, the current CPU and memory requirements, and worker nodes, which are candidates for removal based on current job execution. The scale down operation first decommissions the nodes, and then removes them from the cluster. ++ > [!IMPORTANT] + > The Auto scale Rule Engine proactively flushes old events every **30 minutes** to optimize system memory. As a result, there exists an upper bound limit of 30 minutes on the scaling rule interval. To ensure the consistent and reliable triggering of scaling actions, it's imperative to set the scaling rule interval to a value which is lesser than the limit. By adhering to this guideline, you can guarantee a smooth and efficient scaling process while effectively managing system resources. ++#### Cluster metrics ++Auto scale continuously monitors the cluster and collects the following metrics for Load based autoscale: ++Cluster Metrics Available for Scaling Purposes ++|Metric|Description| +|-|-| +|Available Cores Percentage|The total number of cores available in the cluster compared to the total number of cores in the cluster.| +|Available Memory Percentage|The total memory (in MB) available in the cluster compared to the total amount of memory in the cluster.| +|Allocated Cores Percentage|The total number of cores allocated in the cluster compared to the total number of cores in the cluster.| +|Allocated Memory Percentage|The amount of memory allocated in the cluster compared to the total amount of memory in the cluster.| ++By default, the above metrics are checked every **300 seconds**, it is also configurable when you customize the poll interval with customize autoscale option. Auto scale makes scale-up or scale-down decisions based on these metrics. ++> [!NOTE] +> By default Auto scale uses default resource calculator for YARN for Apache Spark. Load based scaling is available for Apache Spark Clusters. +++#### Graceful Decommissioning ++Enterprises need ways to achieve petabyte scale with autoscaling and to decommission resources gracefully when they're no longer needed. In such scenario, graceful decommissioning feature comes handy. ++Graceful decommissioning allows jobs to complete even after autoscale has triggered decommissioning of the worker nodes. This feature allows nodes to continue to be provisioned until jobs are complete. ++ - **Trino** : Workers have Graceful Decommission enabled by default. Coordinator allows terminating worker to finish its tasks for configured amount of time before removing the worker from the cluster. You can configure the timeout either using native Trino parameter `shutdown.grace-period`, or on Azure portal service configuration page. ++ - **Apache Spark** : Scaling down may impact/stop any running jobs in the cluster. If you enable Graceful Decommissioning settings on the Azure portal, it incorporates Graceful Decommission of YARN Nodes and ensures that any work in progress on a worker node is complete before the node removed from the HDInsight on AKS cluster. ++##### Cool down period ++To avoid continuous scale up operations, autoscale engine waits for a configurable interval before initiating another set of scale up operations. +The default value is set to **180 seconds** ++> [!Note] +> * In custom scale rules, no rule trigger can have a trigger interval greater than 30 minutes. After an auto scaling event occurs, the amount of time to wait before enforcing another scaling policy. +> * Cool down period should be greater than policy interval, so the cluster metrics can get reset. +++## Get started ++1. For autoscale to function, you're required to assign the **owner** or **contributor** permission to the MSI (used during cluster creation) at the cluster level, using IAM on the left pane. ++1. Refer to the following illustration and steps listed on how to add role assignment ++ :::image type="content" source="./media/hdinsight-on-aks-autoscale-clusters/add-permissions-concept-step-5.png" alt-text="Screenshot showing how to add role assignment." border="true" lightbox="./media/hdinsight-on-aks-autoscale-clusters/add-permissions-concept-step-5.png"::: + +1. Select the **add role assignment**, + 1. Assignment type: Privileged administrator roles + 1. Role: **Owner** or **Contributor** + 1. Members: Choose Managed identity and select the **User-assigned managed identity**, which was given during cluster creation phase. + 1. Assign the role. ++### Create a cluster with Schedule based Auto scale ++1. Once your cluster pool is created, create a [new cluster](./quickstart-create-cluster.md) with your desired workload (on the Cluster type), and complete the other steps as part of the normal cluster creation process. +1. On the **Configuration** tab, enable **Auto scale** toggle. +1. Select **Schedule based** autoscale +1. Select your timezone and then click **+ Add rule** +1. Select the days of the week that the new condition should apply to. +1. Edit the time the condition should take effect and the number of nodes that the cluster should be scaled to. ++ :::image type="content" source="./media/hdinsight-on-aks-autoscale-clusters/schedule-based-get-started-step-6.png" alt-text="Screenshot showing how to get started with schedule based autoscale." border="true" lightbox="./media/hdinsight-on-aks-autoscale-clusters/schedule-based-get-started-step-6.png"::: ++ > [!NOTE] + > * User should have ΓÇ£ownerΓÇ¥ or ΓÇ£contributorΓÇ¥ role on the cluster MSI for autoscale to work. + > * The default value defines the initial size of the cluster when it's created. + > * The difference between two schedules is set to default by 30 minutes. + > * The time value follows 24-hour format + > * In case of a continuous window of beyond 24 hours across days, you're required to set Auto scale schedule across days, and autoscale assumes 23:59 as 00:00 (with same node count) spanning across two days from 22:00 to 23:59, 00:00 to 02:00 as 22:00 to 02:00. + > * The schedules are set in Coordinated Universal Time (UTC), by default. You can always update to time zone that corresponds to your local time zone in the drop down available. When you are on a time zone that observes Daylight Savings, the schedule does not adjust automatically, you are required to manage the schedule updates accordingly. ++### Create a cluster with Load based Auto scale ++1. Once your cluster pool is created, create a [new cluster](./quickstart-create-cluster.md) with your desired workload (on the Cluster type), and complete the other steps as part of the normal cluster creation process. +1. On the **Configuration** tab, enable **Auto scale** toggle. +1. Select **Load based** autoscale +1. Based on the type of workload, you have options to add **graceful decommission timeout**, **cool down period** +1. Select the **minimum** and **maximum** nodes, and if necessary **configure the scale rules** to customize Auto scale to your needs. ++ :::image type="content" source="./media/hdinsight-on-aks-autoscale-clusters/load-based-get-started-step-7.png" alt-text="Screenshot showing how to get started with load based autoscale." border="true" lightbox="./media/hdinsight-on-aks-autoscale-clusters/load-based-get-started-step-7.png"::: ++ > [!TIP] + > * Your subscription has a capacity quota for each region. The total number of cores of your head nodes and the maximum worker nodes can't exceed the capacity quota. However, this quota is a soft limit; you can always create a support ticket to get it increased easily. + > * If you exceed the total core quota limit, You'll receive an error message saying `The maximum node count you can select is {maxCount} due to the remaining quota in the selected subscription ({remaining} cores)`. + > * Scale Up rules take precedence when one or more rules are triggered. Even if only one of the rules for scale up suggest cluster being under-provisioned, cluster will try to scale up. For scale down to happen, no scale up rule should be triggered. + > * The maximum number of nodes allowed in a cluster pool is 250 in public preview. ++### Create a cluster with a Resource Manager template ++**Schedule based auto scale** ++You can create an HDInsight on AKS cluster with schedule-based Autoscaling using an Azure Resource Manager template, by adding an autoscale to the clusterProfile -> autoscaleProfile section. ++The autoscale node contains a recurrence that has a timezone and schedule that describes when the change takes place. For a complete Resource Manager template, see sample JSON ++```json +{ + "autoscaleProfile": { + "enabled": true, + "autoscaleType": "ScheduleBased", + "gracefulDecommissionTimeout": 60, + "scheduleBasedConfig": { + "schedules": [ + { + "days": [ + "Monday", + "Tuesday", + "Wednesday" + ], + "startTime": "09:00", + "endTime": "10:00", + "count": 2 + }, + { + "days": [ + "Sunday", + "Saturday" + ], + "startTime": "12:00", + "endTime": "22:00", + "count": 5 + }, + { + "days": [ + "Monday", + "Tuesday", + "Wednesday", + "Thursday", + "Friday" + ], + "startTime": "22:00", + "endTime": "23:59", + "count": 6 + }, + { + "days": [ + "Monday", + "Tuesday", + "Wednesday", + "Thursday", + "Friday" + ], + "startTime": "00:00", + "endTime": "05:00", + "count": 6 + } + ], + "timeZone": "UTC", + "defaultCount": 110 + } + } +} +``` +> [!TIP] +> * You are required to set non-conflicting schedules using ARM deployments, to avoid scaling operation failures. ++**Load based auto scale** ++You can create an HDInsight on AKS cluster with load-based Autoscaling using an Azure Resource Manager template, by adding an autoscale to the clusterProfile -> autoscaleProfile section. ++The autoscale node contains ++* a poll interval, cool down period, +* graceful decommission, +* minimum and maximum nodes, +* standard threshold rules, +* scaling metrics that describes when the change takes place. ++For a complete Resource Manager template, see sample JSON as follows ++```json + { + "autoscaleProfile": { + "enabled": true, + "autoscaleType": "LoadBased", + "gracefulDecommissionTimeout": 60, + "loadBasedConfig": { + "minNodes": 2, + "maxNodes": 157, + "pollInterval": 300, + "cooldownPeriod": 180, + "scalingRules": [ + { + "actionType": "scaleup", + "comparisonRule": { + "threshold": 80, + "operator": " greaterThanOrEqual" + }, + "evaluationCount": 1, + "scalingMetric": "allocatedCoresPercentage" + }, + { + "actionType": "scaledown", + "comparisonRule": { + "threshold": 20, + "operator": " lessThanOrEqual" + }, + "evaluationCount": 1, + "scalingMetric": "allocatedCoresPercentage" + } + ] + } + } +} +``` ++### Using the REST API +To enable or disable Auto scale on a running cluster using the REST API, make a PATCH request to your Auto scale endpoint: ```https://management.azure.com/subscriptions/{{USER_SUB}}/resourceGroups/{{USER_RG}}/providers/Microsoft.HDInsight/clusterpools/{{CLUSTER_POOL_NAME}}/clusters/{{CLUSTER_NAME}}?api-version={{HILO_API_VERSION}}``` ++- Use the appropriate parameters in the request payload. The json payload could be used to enable Auto scale. +- Use the payload (**autoscaleProfile**: null) or use flag (**enabled**, false) to disable Auto scale. +- Refer to the JSON samples mentioned on the above step for reference. ++### Pause Auto scale for a running cluster ++We have introduced pause feature in Auto scale. Now, using the Azure portal, you can pause Auto scale on a running cluster. The below diagram illustrates how to select the pause and resume autoscale +++You can resume once you would like to resume the autoscale operations. +++> [!TIP] +> When you configure multiple schedules, and you're pausing the autoscale, it doesn't trigger the next schedule. The node count remains same, even if the nodes are in a decommissioned state. ++### Copy Auto Scale Configurations ++Using the Azure portal, you can now copy the same autoscale configurations for a same cluster shape across your cluster pool, you can use this feature and export or import the same configurations. +++## Monitoring Auto scale activities ++### Cluster status ++The cluster status listed in the Azure portal can help you monitor Auto scale activities. All of the cluster status messages that you might see are explained in the list. ++|Cluster status |Description| +|-|-| +|Succeeded |The cluster is operating normally. All of the previous Auto scale activities have been completed successfully.| +|Accepted |The cluster operation (for example: scale up) is accepted, waiting for the operation to be completed.| +|Failed |This means a current operation failed due to some reason, the cluster maybe not functional. | +|Canceled |The current operation stands canceled.| +++To view the current number of nodes in your cluster, go to the **Cluster size** chart on the **Overview** page for your cluster. ++++### Operation history ++You can view the cluster scale-up and scale-down history as part of the cluster metrics. You can also list all scaling actions over the past day, week, or other period. ++++**Additional resources** ++[Manual scale - Azure HDInsight on AKS](./manual-scale.md) |
hdinsight-aks | Hdinsight On Aks Manage Authorization Profile | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/hdinsight-on-aks-manage-authorization-profile.md | + + Title: Manage cluster access +description: How to manage cluster access in HDInsight on AKS ++ Last updated : 08/4/2023+++# Manage cluster access +++This article provides an overview of the mechanisms available to manage access for HDInsight on AKS cluster pools and clusters. +It also covers how to assign permission to users, groups, user-assigned managed identity, and service principals to enable access to cluster data plane. ++When a user creates a cluster, then that user is authorized to perform the operations with data accessible to the cluster. However, to allow other users to execute queries and jobs on the cluster, access to cluster data plane is required. +++## Manage cluster pool or cluster access (Control plane) ++The following HDInsight on AKS and Azure built-in roles are available for cluster management to manage the cluster pool or cluster resources. ++|Role|Description| +|-|-| +|Owner |Grants full access to manage all resources, including the ability to assign roles in Azure RBAC.| +|Contributor |Grants full access to manage all resources but doesn't allow you to assign roles in Azure RBAC.| +|Reader |View all resources but doesn't allow you to make any changes.| +|HDInsight on AKS Cluster Pool Admin |Grants full access to manage a cluster pool including ability to delete the cluster pool.| +|HDInsight on AKS Cluster Admin |Grants full access to manage a cluster including ability to delete the cluster.| ++You can use Access control (IAM) blade to manage the access for cluster poolΓÇÖs and control plane. ++Refer: [Grant a user access to Azure resources using the Azure portal - Azure RBAC](/azure/role-based-access-control/quickstart-assign-role-user-portal). ++## Manage cluster access (Data plane) ++This access enables you to do the following actions: +* View clusters and manage jobs. +* All the monitoring and management operations. +* To enable auto scale and update the node count. + +The access is restricted for: +* Cluster deletion. ++To assign permission to users, groups, user-assigned managed identity, and service principals to enable access to clusterΓÇÖs data plane, the following options are available: ++ * [Azure portal](#using-azure-portal) + * [ARM template](#using-arm-template) ++### Using Azure portal ++#### How to grant access + +The following steps describe how to provide access to other users, groups, user-assigned managed identity, and service principals. ++1. Navigate to the **Cluster access** blade of your cluster in the Azure portal and click **Add**. ++ :::image type="content" source="./media/hdinsight-on-aks-manage-authorization-profile/cluster-access.png" alt-text="Screenshot showing how to provide access to a user for cluster access."::: ++1. Search for the user/group/user-assigned managed identity/service principal to grant access and click **Add**. ++ :::image type="content" source="./media/hdinsight-on-aks-manage-authorization-profile/add-members.png" alt-text="Screenshot showing how to add member for cluster access."::: ++#### How to remove access ++1. Select the members to be removed and click **Remove**. ++ :::image type="content" source="./media/hdinsight-on-aks-manage-authorization-profile/remove-access.png" alt-text="Screenshot showing how to remove cluster access for a member."::: ++### Using ARM template ++#### Prerequisites ++* An operational HDInsight on AKS cluster. +* [ARM template](./create-cluster-using-arm-template-script.md) for your cluster. +* Familiarity with [ARM template authoring and deployment](/azure/azure-resource-manager/templates/overview). + +Follow the steps to update `authorizationProfile` object under `clusterProfile` section in your cluster ARM template. ++1. In the Azure portal search bar, search for user/group/user-assigned managed identity/service principal. ++ :::image type="content" source="./media/hdinsight-on-aks-manage-authorization-profile/search-object-id.png" alt-text="Screenshot showing how to search object ID."::: + +1. Copy the **Object ID** or **Principal ID**. ++ :::image type="content" source="./media/hdinsight-on-aks-manage-authorization-profile/view-object-id.png" alt-text="Screenshot showing how to view object ID."::: ++1. Modify the `authorizationProfile` section in your cluster ARM template. ++ 1. Add user/user-assigned managed identity/service principal Object ID or Principal ID under `userIds` property. + + 1. Add groups Object ID under `groupIds` property. + + ```json + "authorizationProfile": { + "userIds": [ + "abcde-12345-fghij-67890", + "a1b1c1-12345-abcdefgh-12345" + ], + "groupIds": [] + }, + ``` + + 1. Deploy the updated ARM template to reflect the changes in your cluster. Learn how to [deploy an ARM template](/azure/azure-resource-manager/templates/deploy-portal). |
hdinsight-aks | How To Azure Monitor Integration | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/how-to-azure-monitor-integration.md | + + Title: How to integrate with Azure Monitor +description: Learn how to integrate with Azure Monitoring. ++ Last updated : 08/29/2023+++# How to integrate with Log Analytics +++This article describes how to enable Log Analytics to monitor & collect logs for cluster pool and cluster operations on HDInsight on AKS. You can enable the integration during cluster pool creation or post the creation. +Once the integration at cluster pool is enabled, it isn't possible to disable the integration. However, you can disable the log analytics for individual clusters, which are part of the same pool. ++## Prerequisites ++* Log Analytics workspace. You can think of this workspace as a unique logs environment with its own data repository, data sources, and solutions. Learn how to [create a Log Analytics workspace](/azure/azure-monitor/logs/quick-create-workspace?tabs=azure-portal) . ++ > [!NOTE] + > 1. Log Analytics must be enabled at cluster pool level first so as to enable it at a cluster level. + > + > 2. The configuration at cluster pool level is a global switch for all clusters in the cluster pool, therefore all clusters in the same cluster pool can only flow log to one Log Analytics workspace. ++## Enable Log Analytics using the portal during cluster pool creation ++1. Sign in to [Azure portal](https://portal.azure.com). + +1. Select **Create a resource** and search for *cluster pool* in marketplace and select **Azure HDInsight on AKS Cluster Pool**. For more information on starting the cluster pool creation process, see [Create a cluster pool](quickstart-create-cluster.md#create-a-cluster-pool). + +1. Navigate to the Integrations blade, select **Enable Log Analytics**. ++ :::image type="content" source="./media/how-to-azure-monitor-integration/enable-log-analytics.png" alt-text="Screenshot showing how to enable log analytics option." border="true" lightbox="./media/how-to-azure-monitor-integration/enable-log-analytics.png"::: + +1. From the drop-down list, select an existing Log Analytics workspace. Complete the remaining details required to finish cluster pool creation and select **Create**. + +1. Log Analytics is enabled when the cluster pool is successfully created. All monitoring capabilities can be accessed under your cluster poolΓÇÖs **Monitoring** section. + + :::image type="content" source="./media/how-to-azure-monitor-integration/monitor-section.png" alt-text="Screenshot showing monitoring section in the Azure portal." lightbox="./media/how-to-azure-monitor-integration/monitor-section.png"::: ++## Enable Log Analytics using portal after cluster pool creation ++1. In the Azure portal search bar, type "HDInsight on AKS cluster pools" and select *Azure HDInsight on AKS cluster pools* to go to the cluster pools page. On the HDInsight on AKS cluster pools page, select your cluster pool. ++ :::image type="content" source="./media/how-to-azure-monitor-integration/cluster-pool-get-started.png" alt-text="Screenshot showing search option for getting started with HDInsight on AKS Cluster pool." border="true" lightbox="./media/how-to-azure-monitor-integration/cluster-pool-get-started.png"::: ++ :::image type="content" source="./media/how-to-azure-monitor-integration/cluster-pool-in-list-view.png" alt-text="Screenshot showing cluster pools in a list view." border="true" lightbox="./media/how-to-azure-monitor-integration/cluster-pool-in-list-view.png"::: ++1. Navigate to the "Monitor settings" blade on the left side menu and click on "Configure" to enable Log Analytics. + + :::image type="content" source="./media/how-to-azure-monitor-integration/cluster-pool-integration.png" alt-text="Screenshot showing cluster pool integration blade." border="true" lightbox="./media/how-to-azure-monitor-integration/cluster-pool-integration.png"::: ++1. Select an existing Log Analytics workspace, and click **Ok**. ++ :::image type="content" source="./media/how-to-azure-monitor-integration/enable-cluster-pool-log-analytics.png" alt-text="Screenshot showing how to enable cluster pool log analytics." border="true" lightbox="./media/how-to-azure-monitor-integration/enable-cluster-pool-log-analytics.png"::: + +## Enable Log Analytics using the portal during cluster creation ++1. In the Azure portal search bar, type "HDInsight on AKS cluster pools" and select *Azure HDInsight on AKS cluster pools* to go to the cluster pools page. On the HDInsight on AKS cluster pools page, select the cluster pool in which you want to create a cluster. + + :::image type="content" source="./media/how-to-azure-monitor-integration/cluster-pool-get-started.png" alt-text="Screenshot showing search option for getting started with HDInsight on AKS Cluster pool." border="true" lightbox="./media/how-to-azure-monitor-integration/cluster-pool-get-started.png"::: ++ :::image type="content" source="./media/how-to-azure-monitor-integration/cluster-pool-in-list-view.png" alt-text="Screenshot showing cluster pools in a list view." border="true" lightbox="./media/how-to-azure-monitor-integration/cluster-pool-in-list-view.png"::: ++ > [!NOTE] + > It is important to make sure that the selected cluster pool has Log Analytics enabled. ++1. Select **New Cluster** to start the creation process. For more information on starting the cluster creation process, see [Create a cluster](./quickstart-create-cluster.md). ++ :::image type="content" source="./media/how-to-azure-monitor-integration/new-cluster.png" alt-text="Screenshot showing New cluster button in the Azure portal." border="true" lightbox="./media/how-to-azure-monitor-integration/new-cluster.png"::: ++1. Navigate to the Integrations blade, select **Enable Log Analytics**. ++1. Select one or more type of logs you would like to collect. Complete the remaining details required to finish the cluster creation and select **Create**. ++ :::image type="content" source="./media/how-to-azure-monitor-integration/select-log-type.png" alt-text="Screenshot showing how to select log type." border="true" lightbox="./media/how-to-azure-monitor-integration/select-log-type.png"::: + + > [!NOTE] + > If no option is selected, then only AKS service logs will be available. ++2. Log Analytics is enabled when the cluster is successfully created. All monitoring capabilities can be accessed under your cluster's **Monitoring** section. + + :::image type="content" source="./media/how-to-azure-monitor-integration/monitor-section-cluster.png" alt-text="Screenshot showing Monitoring section for cluster in the Azure portal."::: ++## Enable Log Analytics using portal after cluster creation ++1. In the Azure portal top search bar, type "HDInsight on AKS clusters" and select *Azure HDInsight on AKS clusters* from the drop-down list. On the HDInsight on AKS cluster pools page, select your cluster name from the list page. + + :::image type="content" source="./media/how-to-azure-monitor-integration/get-started-portal-search-step-1.png" alt-text="Screenshot showing search option for getting started with HDInsight on AKS Cluster." lightbox="./media/how-to-azure-monitor-integration/get-started-portal-search-step-1.png"::: ++ :::image type="content" source="./media/how-to-azure-monitor-integration/get-started-portal-list-view-step-2.png" alt-text="Screenshot showing selecting the HDInsight on AKS Cluster you require from the list." border="true" lightbox="./media/how-to-azure-monitor-integration/get-started-portal-list-view-step-2.png"::: ++1. Navigate to the "Monitor settings" blade, select **Enable Log Analytics**. Choose one or more type of logs you would like to collect, and click **Save**. + + :::image type="content" source="./media/how-to-azure-monitor-integration/select-more-log-types.png" alt-text="Screenshot showing how to select more log types." border="true" lightbox="./media/how-to-azure-monitor-integration/select-more-log-types.png"::: + + > [!NOTE] + > If no option is selected, then only AKS service logs will be available. ++## Access the log tables and run queries using the portal ++1. From the Azure portal, select your cluster pool or cluster of choice to open it. ++1. Navigate to the **Monitoring** section and select the **Logs** blade to query and analyze the collected data. ++ :::image type="content" source="./media/how-to-azure-monitor-integration/monitoring-logs.png" alt-text="Screenshot showing logs in the Azure portal."::: ++1. A list of commonly used query templates is provided to choose from to simplify the process or you can write your own query using the provided console. ++ :::image type="content" source="./media/how-to-azure-monitor-integration/queries.png" alt-text="Screenshot showing queries in the Azure portal." border="true" lightbox="./media/how-to-azure-monitor-integration/queries.png"::: ++ :::image type="content" source="./media/how-to-azure-monitor-integration/new-query.png" alt-text="Screenshot showing New queries in the Azure portal." border="true" lightbox="./media/how-to-azure-monitor-integration/new-query.png"::: + |
hdinsight-aks | Manage Cluster Pool | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/manage-cluster-pool.md | + + Title: Manage cluster pools +description: Manage cluster pools in HDInsight on AKS. ++ Last updated : 08/29/2023+++# Manage cluster pools +++Cluster pools are a logical grouping of clusters and maintain a set of clusters in the same pool. It helps in building robust interoperability across multiple cluster types and allow enterprises to have the clusters in the same virtual network. One cluster pool corresponds to one cluster in AKS infrastructure. ++This article describes how to manage a cluster pool. ++> [!NOTE] +> You are required to have an operational cluster pool, Learn how to create a [cluster pool](./quickstart-create-cluster.md). ++## Get started ++1. Sign in to [Azure portal](https://portal.azure.com). + +1. In the Azure portal search bar, type "HDInsight on AKS cluster pool" and select "Azure HDInsight on AKS cluster pools" from the drop-down list. + + :::image type="content" source="./media/manage-cluster-pool/cluster-pool-get-started.png" alt-text="Screenshot showing search option for getting started with HDInsight on AKS Cluster pool." border="true" lightbox="./media/manage-cluster-pool/cluster-pool-get-started.png"::: + +1. Select your cluster pool name from the list page. + + :::image type="content" source="./media/manage-cluster-pool/cluster-pool-in-list-view.png" alt-text="Screenshot showing cluster pools in a list view." border="true" lightbox="./media/manage-cluster-pool/cluster-pool-in-list-view.png"::: ++## Create new cluster + +In a cluster pool, you can add multiple clusters of different types. For example, you can have a Trino cluster and an Apache Flink cluster inside the same pool. ++To create a new cluster, click on the **+New cluster** on the Azure portal and continue to use the Azure portal to create a Trino, Apache Flink, and Apache Spark cluster. ++Learn more on how to [create a cluster](./quickstart-create-cluster.md). +++## View the list of existing clusters ++You can view the list of clusters in the cluster pool on the **Overview** tab. ++ +## Manage access to the cluster pool + +HDInsight on AKS supports both Azure built-in roles and certain roles specific to HDInsight on AKS. In the Azure portal, you can use Access control (IAM) blade in your pool to manage the access for cluster pool’s control plane. ++For more information, see [manage access](./hdinsight-on-aks-manage-authorization-profile.md). ++## Enable integration with Azure services ++ In the Azure portal, use Monitor settings blade in your cluster pool to configure the supported Azure services. Currently, we support Log Analytics and Azure managed Prometheus and Grafana, which has to be configured at cluster pool before you can enable at cluster level. ++ * Learn more about [Azure Monitor Integration](./how-to-azure-monitor-integration.md). + * For more information, see [how to enable Log Analytics](./how-to-azure-monitor-integration.md). + * For more information, see [how to enable Azure managed Prometheus and Grafana](./monitor-with-prometheus-grafana.md). + + +## Delete cluster pool ++ Deleting the cluster pool deletes the following resources: + + * All the clusters that are part of the cluster pool. + * Managed resource groups created during cluster pool creation to hold the ancillary resources. + + However, it doesn't delete the external resources associated with the cluster pool or cluster. For example, Key Vault, Storage account, Monitoring workspace etc. ++ Each cluster pool version is associated with an AKS version. When an AKS version is deprecated, you'll be notified. In this case, you need to delete the cluster pool and recreate to move to the supported AKS version +++ > [!Note] + > You can't recover the deleted cluster pool. Be careful while deleting the cluster pool. ++ 1. To delete the cluster pool, click on "Delete" at the top left in the "Overview" blade in the Azure portal. + + :::image type="content" source="./media/manage-cluster-pool/delete-cluster-pool.png" alt-text="Screenshot showing how to delete cluster pool."::: + + 1. Enter the pool name to be deleted and click on delete. + + :::image type="content" source="./media/manage-cluster-pool/cluster-pool-delete-cluster.png" alt-text="Screenshot showing how to delete cluster pool, and updating your cluster pool name once you click delete."::: ++ Once the deletion is successful, you can check the status by clicking Notifications icon ![Screenshot showing the Notifications icon in the Azure portal.](./media/manage-cluster-pool/notifications.png) in the Azure portal. ++ :::image type="content" source="./media/manage-cluster-pool/cluster-pool-delete-cluster-notification.png" alt-text="Screenshot showing a notification alert of a cluster pool deletion successful."::: |
hdinsight-aks | Manage Cluster | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/manage-cluster.md | + + Title: Manage clusters +description: Manage clusters in HDInsight on AKS. ++ Last updated : 08/29/2023+++# Manage clusters +++Clusters are individual compute workloads such as Apache Spark, Apache Flink, and Trino, which can be created rapidly in few minutes with preset configurations and few clicks. ++This article describes how to manage a cluster using Azure portal. + +> [!NOTE] +> You are required to have an operational cluster, Learn how to create a [cluster](./quickstart-create-cluster.md). ++## Get started ++1. Sign in to [Azure portal](https://portal.azure.com). + +1. In the Azure portal search bar, type "HDInsight on AKS clusters" and select "Azure HDInsight on AKS clusters" from the drop-down list. + + :::image type="content" source="./media/manage-cluster/get-started-portal-search-step-1.png" alt-text="Screenshot showing search option for getting started with HDInsight on AKS Cluster." border="true" lightbox="./media/manage-cluster/get-started-portal-search-step-1.png"::: + +1. Select your cluster name from the list page. + + :::image type="content" source="./media/manage-cluster/get-started-portal-list-view-step-2.png" alt-text="Screenshot showing selecting the HDInsight on AKS Cluster you require from the list." border="true" lightbox="./media/manage-cluster/get-started-portal-list-view-step-2.png"::: ++## View cluster details ++You can view the cluster details in the "Overview" blade of your cluster. It provides general information and easy access to the tools that are part of the cluster. ++|Property|Description| +|-|-| +|Resource group| The resource group in which cluster is created.| +|Cluster pool name| Cluster pool name inside which the cluster is created.| +|Cluster type| The type of the cluster such as Spark, Trino, or Flink.| +|HDInsight on AKS version| HDInsight on AKS cluster version. For more information. see [versioning](./versions.md).| +|Cluster endpoint| The endpoint of the cluster.| +|Cluster package| Component versions associated with the cluster.| +|Subscription details| Subscription name and subscription ID.| +|Location| Region in which the cluster is deployed.| +|Cluster size details| Node size, node type, and number of nodes.| +++## Manage cluster size ++You can check and modify the number of worker nodes for your cluster using "Cluster size" blade in the Azure portal. There are two options to scale up/down your cluster: ++* [Manual scale](./manual-scale.md) +* [Auto scale](./hdinsight-on-aks-autoscale-clusters.md) +++## Manage cluster access ++HDInsight on AKS provides a comprehensive and fine-grained access control at both control plane and data plane, which allows you to manage cluster resources and provide access to cluster data plane. ++Learn how to [manage access to your cluster](./hdinsight-on-aks-manage-authorization-profile.md). ++## Configure secure shell (SSH) ++Secure shell (SSH) allows you to submit jobs and queries to your cluster directly. You can enable or disable SSH using "Secure shell (SSH)" blade in the Azure portal. + +>[!NOTE] +>Enabling SSH will create additional VMs in the cluster. The maximum allowed secure shell nodes are 5. +++## Manage cluster configuration ++HDInsight on AKS allows you to tweak the configuration properties to improve performance of your cluster with certain settings. For example, usage or memory settings. +In the Azure portal, use "Configuration management" blade of your cluster to manage the configurations. ++You can do the following actions: ++* Update the existing service configurations or add new configurations. +* Export the service configurations using RestAPI. ++Learn how to manage the [cluster configuration](./service-configuration.md). ++## View service details ++In the Azure portal, use "Services" blade in your cluster to check the health of the services running in your cluster. It includes the collection of the services and the status of each service running in the cluster. You can drill down on each service to check instance level details. ++Learn how to check [service health](./service-health.md). ++## Enable integration with Azure services ++In the Azure portal, use "Integrations" blade in your cluster pool to configure the supported Azure services. Currently, we support Log Analytics and Azure managed Prometheus and Grafana, which has to be configured at cluster pool before you can enable at cluster level. ++* Learn more about [Azure Monitor Integration](./how-to-azure-monitor-integration.md). +* For more information, see [how to enable Log Analytics](./how-to-azure-monitor-integration.md). +* For more information, see [how to enable Azure managed Prometheus and Grafana](./monitor-with-prometheus-grafana.md). +++## Delete cluster ++Deleting a cluster doesn't delete the default storage account nor any linked storage accounts. You can re-create the cluster by using the same storage accounts and the same metastores. ++From the "Overview" blade in the Azure portal: ++1. Select **Delete** from the top menu. ++ :::image type="content" source="./media/manage-cluster/delete-cluster-step-7.png" alt-text="Screenshot showing how to delete the cluster on HDInsight on AKS Cluster." border="true" ::: +1. Status can be checked on Notification icon ![Screenshot showing the Notifications icon in the Azure portal.](./media/manage-cluster/notifications.png). |
hdinsight-aks | Manage Script Actions | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/manage-script-actions.md | + + Title: Manage script actions on Azure HDInsight on AKS clusters +description: An introduction on how to manage script actions in Azure HDInsight on AKS. ++ Last updated : 08/29/2023++# Script actions during cluster creation +++Azure HDInsight on AKS provides a mechanism called **Script Actions** that invoke custom scripts to customize the cluster. These scripts are used to install additional components and change configuration settings. Script actions can be provisioned only during cluster creation as of now. Post cluster creation, Script Actions are part of the roadmap. +This article explains how you can provision script actions when you create an HDInsight on AKS cluster. ++## Use a script action during cluster creation using Azure portal ++1. Upload the script action in a `ADLS/WASB` storage(does not have to be the primary cluster storage). In this example we will consider an `ADLS` storage. + To upload a script into your storage, navigate into the target storage and the container where you want to upload it. ++ :::image type="content" source="./media/manage-script-actions/upload-script-action-1.png" alt-text="Screenshot showing the how to select container." border="true" lightbox="./media/manage-script-actions/upload-script-action-1.png"::: ++1. To upload a script into your storage, navigate into the target storage and the container. Click on the upload button and select the script from your local drive. + After the script gets uploaded you should be able to see it in the container(see below image). ++ :::image type="content" source="./media/manage-script-actions/upload-script-action-2.png" alt-text="Screenshot showing how to upload the script." border="true" lightbox="./media/manage-script-actions/upload-script-action-2.png"::: + +1. Create a new cluster as described [here](./quickstart-create-cluster.md). + +1. From the Configuration tab, select **+ Add script action**. + + :::image type="content" source="./media/manage-script-actions/manage-script-action-creation-step-1.png" alt-text="Screenshot showing the New cluster page with Add Script action button in the Azure portal." border="true" lightbox="./media/manage-script-actions/manage-script-action-creation-step-1.png"::: ++ This action opens the Script Action window. Provide the following details: ++ :::image type="content" source="./media/manage-script-actions/manage-script-action-add-step-2.png" alt-text="Screenshot showing the Add Script action window opens in the Azure portal."::: + + |Property|Description| + |-|-| + |Script Action Name| Unique name of the script action.| + |Bash Script URL| Location where the script is stored. For example - `abfs://<CONTAINER>@<DATALAKESTOREACCOUNTNAME>.dfs.core.windows.net/<file_path>`, update the data lake storage name and file path.| + |Services| Select the specific service components where the Script Action needs to run.| + |Parameters| Specify the parameters, if necessary for the script.| + |`TimeOutInMinutes`|Choose the timeout for each script| + + :::image type="content" source="./media/manage-script-actions/manage-script-action-add-node-type-step-3.png" alt-text="Screenshot showing the list of services where to the apply the script actions." border="true" lightbox="./media/manage-script-actions/manage-script-action-add-node-type-step-3.png"::: ++ > [!NOTE] + > * All the Script Actions will be persisted. + > * Script actions are available only for Apache Spark cluster type. + +1. Select ‘OK’ to save the script. +1. Then you can again use **+ Add Script Action** to add another script if necessary. + + :::image type="content" source="./media/manage-script-actions/manage-script-action-view-scripts-step-4.png" alt-text="Screenshot showing the View scripts section in the integration tab." border="true" lightbox="./media/manage-script-actions/manage-script-action-view-scripts-step-4.png"::: + +1. Complete the remaining cluster creation steps to create a cluster. ++ >[!Important] + >* There's no automatic way to undo the changes made by a script action. + >* Script actions must finish within **40 minutes**, or they time out causing cluster creation to fail. + >* During cluster provisioning, the script runs concurrently with other setup and configuration processes. + >* Competition for resources such as CPU time or network bandwidth might cause the script to take longer to finish. + >* To minimize the time it takes to run the script, avoid tasks like downloading and compiling applications from the source. Precompile applications and store the binary in Azure Data Lake Store Gen2. ++### View the list of Script Actions ++1. You can view the list of Script Actions in the "Configuration" tab. ++ :::image type="content" source="./media/manage-script-actions/manage-script-action-view-scripts-step-5.png" alt-text="Screenshot showing the Create to save Script actions page." border="true" lightbox="./media/manage-script-actions/manage-script-action-view-scripts-step-5.png"::: + + |
hdinsight-aks | Manual Scale | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/manual-scale.md | + + Title: Manual scale +description: How to manually scale in HDInsight on AKS. ++ Last updated : 08/29/2023+++# Manual scale +++HDInsight on AKS provides elasticity with options to scale up and scale down the number of cluster nodes. This elasticity works to help increase resource utilization and improve cost efficiency. ++## Utility to scale clusters ++HDInsight on AKS provides the following methods to manually scale clusters: ++| Utility| Description| +||| +|Azure portal| Open your HDInsight on AKS cluster pane, select **Cluster size** on the left-hand menu, then on the Cluster size pane, type in the number of worker nodes, and select Save | +|REST API|To scale a running HDInsight on AKS cluster using the REST API, make a subsequent POST request on the same resource with the updated count in the compute profile.| ++You can use the Azure portal to access the “Cluster size” menu in the cluster navigation page. In Cluster size blade, change the “Number of worker nodes,” and save the change to scale up or down the cluster. +++## Impact of scaling operation on a cluster ++Any scaling operation triggers a restart of the service, which can lead to errors on jobs already running. ++When you **add nodes** to an operational HDInsight on AKS cluster (scale up): ++- Successful scaling operation using manual scale will add worker nodes to the cluster. +- New jobs can be safely submitted when the scaling process is completed. +- If the scaling operation fails, the failure leaves your cluster in the "Failed” state. +- You can expect to experience job failures during the scaling operation as services get restarted. ++If you **remove nodes** (scale down) from an HDInsight on AKS cluster: + +- Pending or running jobs fails when the scaling operation completes. This failure is because of some of the services restarting during the scaling process. The impact of changing the number of cluster nodes varies for each cluster type. ++>[!IMPORTANT] +>- To avoid quota errors during scaling operations, please plan for quota in your subscription. In case you have insufficient quota, you can increase quota with this [documentation](/azure/quotas/regional-quota-requests). +>- In case scale down selects a head node, which hosts coordinator/ingress and other services, it will result in downtime. ++## Frequently Asked Questions ++### General ++|Question|Answer| +| -- | -- | +|What are the minimum nodes that I can add/remove during scale operations?|One Node.| +|What's the maximum limit to scale up an HDInsight on AKS Trino cluster?|100 nodes (in public preview).| +|How do I manually scale down my cluster?|In the ARM request, update `computeProfile.count` or follow the steps mentioned to scale down using Azure portal.| +|Can I add custom script actions to a cluster during manual scale?|Script actions are applicable for Apache Spark cluster type| +|How do I get logs for manual scale failures for the cluster nodes?|Logs are available in Log analytics module, refer the [Azure Monitor Integration](./how-to-azure-monitor-integration.md).| +|Is load based or schedule based autoscaling supported?|Yes. For more information, see [Autoscale](./hdinsight-on-aks-autoscale-clusters.md).| ++### Trino ++|Question|Answer| +| -- | -- | +|Will my Trino service restart after scaling operation?|Yes, service restarts during the scaling operation.| ++### Apache Flink ++|Question|Answer| +| -- | -- | +|What’s the impact of scaling operations on Apache Flink cluster?|Any scaling operation is likely to trigger a restart of the service, which causes job failures. New jobs can be submitted when the scaling process is completed. In Apache Flink, scale down triggers job restarts and scale up operation can’t trigger job restarts.| +++### Apache Spark ++|Question|Answer| +| -- | -- | +|What’s the impact of scaling operations on Spark cluster?|Manual scale down operation may trigger restart of head nodes services.| +++> [!NOTE] +> It is recommended that you manage the quotas set on the subscription prior to scaling operations to avoid quota errors. +> Before scaling down, please note that for an HDInsight on AKS Trino cluster to be operational, it requires minimum **five** active nodes. + |
hdinsight-aks | Monitor With Prometheus Grafana | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/monitor-with-prometheus-grafana.md | + + Title: Monitoring with Azure Managed Prometheus and Grafana +description: Learn how to use monitor With Azure Managed Prometheus and Grafana ++ Last updated : 08/29/2023+++# Monitoring with Azure Managed Prometheus and Grafana +++Cluster and service Monitoring is integral part of any organization. Azure HDInsight on AKS comes with integrated monitoring experience with Azure services. In this article, we use managed Prometheus service with Azure Grafana dashboards for monitoring. ++[Azure Managed Prometheus](../azure-monitor/essentials/prometheus-metrics-overview.md) is a service that monitors your cloud environments. The monitoring is to maintain their availability and performance and workload metrics. It collects data generated by resources in your Azure instances and from other monitoring tools. The data is used to provide analysis across multiple sources. ++[Azure Managed Grafana](../managed-grafan) is a data visualization platform built on top of the Grafana software by Grafana Labs. It's built as a fully managed Azure service operated and supported by Microsoft. Grafana helps you bring together metrics, logs, and traces into a single user interface. With its extensive support for data sources and graphing capabilities, you can view and analyze your application and infrastructure telemetry data in real-time. ++This article covers the details of enabling the monitoring feature in HDInsight on AKS. ++## Prerequisites ++* An Azure Managed Prometheus workspace. You can think of this workspace as a unique Azure Monitor logs environment with its own data repository, data sources, and solutions. For the instructions, see [Create a Azure Managed Prometheus workspace](../azure-monitor/essentials/azure-monitor-workspace-manage.md). +* Azure Managed Grafana workspace. For the instructions, see [Create a Azure Managed Grafana workspace](../managed-grafan). +* An [HDInsight on AKS cluster](./quickstart-create-cluster.md). Currently, you can use Azure Managed Prometheus with the following HDInsight on AKS cluster types: + * Apache Spark + * Apache Flink + * Trino ++For the instructions on how to create an HDInsight on AKS cluster, see [Get started with Azure HDInsight on AKS](./overview.md). ++## Enabling Azure Managed Prometheus and Grafana ++The Azure Managed Prometheus and Grafana Monitoring must be configured at cluster pool level to enable it at cluster level. You need to consider various stages while enabling the Monitoring Solution. +++|#| Scenario |Enable |Disable | +|-|-|-|-| +|1| Cluster Pool -During Creation | `Not Supported` |`Default` | +|2| Cluster Pool – Post Creation | `Supported` | `Not Supported` | +|3| Cluster – During Creation | `Supported` | `Default` | +|4| Cluster – Post Creation |`Supported` |`Supported` | ++## During cluster pool creation ++Currently, Managed Prometheus **CANNOT** be enabled during Cluster Pool creation time. You can configure it post cluster pool creation. ++## Post cluster pool creation ++Monitoring can be enabled from the **Integrations** tab on an existing Cluster Pool View available in Azure portal. +You can use pre created workspaces or create a new one while your'e configuring the monitoring for the cluster pool. ++### Use precreated workspace ++1. Click on **configure** to enable Azure Prometheus monitoring. ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/integration-configure-tab.png" alt-text="Screenshot showing integration configure tab." border="true" lightbox="./media/monitor-with-prometheus-grafana/integration-configure-tab.png"::: ++1. Click on **Advanced Settings** to attach your pre created workspaces. ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/advanced-settings.png" alt-text="Screenshot showing advanced settings." border="true" lightbox="./media/monitor-with-prometheus-grafana/advanced-settings.png"::: ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/configure-prometheus-step-1.png" alt-text="Screenshot showing configure Prometheus step 1." border="true" lightbox="./media/monitor-with-prometheus-grafana/configure-prometheus-step-1.png"::: ++### Create Azure Prometheus and Grafana Workspace while enabling Monitoring in Cluster Pool ++You can create the workspaces from the HDI on AKS cluster pool page. ++1. Click on **Configure** next to the Azure Prometheus option. ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/configure-prometheus-step-2.png" alt-text="Screenshot showing configure Prometheus step 2." border="true" lightbox="./media/monitor-with-prometheus-grafana/configure-prometheus-step-2.png"::: ++1. Click on **Create New** workspace for Azure Managed Prometheus. ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/configure-prometheus-step-3.png" alt-text="Screenshot showing configure Prometheus step 3." border="true" lightbox="./media/monitor-with-prometheus-grafana/configure-prometheus-step-3.png"::: ++1. Fill in the name, region and click on **Create** for Prometheus. ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/configure-prometheus-step-4.png" alt-text="Screenshot showing configure Prometheus step 4." border="true" lightbox="./media/monitor-with-prometheus-grafana/configure-prometheus-step-4.png"::: ++1. Click on **Create New** workspace for Azure Managed Grafana. +1. Fill in Name, Region and click on **Create** for Grafana. ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/configure-prometheus-step-5.png" alt-text="Screenshot showing configure Prometheus step 5." border="true" lightbox="./media/monitor-with-prometheus-grafana/configure-prometheus-step-5.png"::: +++ > [!NOTE] + > 1. Managed Grafana can be enabled only if Managed Prometheus is enabled. + > 1. Once Azure Managed Prometheus workspace and Azure Managed Grafana workspace is enabled from the HDInsight on AKS cluster pool, it cannot be disabled from the cluster pool again. It must be disabled from the cluster level. ++## During cluster creation ++### Enable Azure Managed Prometheus during cluster creation ++1. Once the cluster pool is created and the Azure Managed Prometheus enabled, user must [create a HDI on AKS cluster in the same cluster pool](./trino/trino-create-cluster.md). ++1. During the cluster creation process, navigate to the **Integration** page and enable **Azure Prometheus.** ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/enable-prometheus-monitoring.png" alt-text="Screenshot showing enable prometheus monitoring." border="true" lightbox="./media/monitor-with-prometheus-grafana/enable-prometheus-monitoring.png"::: ++## Post cluster creation ++You can also enable Azure Managed Prometheus post HDI on AKS cluster creation ++1. Navigate to the Integrations tab in the cluster page. ++1. Enable Azure Prometheus Monitoring with the toggle button and click on **Save**. ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/save-configuration.png" alt-text="Screenshot showing how to save configuration." border="true" lightbox="./media/monitor-with-prometheus-grafana/save-configuration.png"::: ++ > [!NOTE] + > Similarly, if you need to disable Azure Prometheus monitoring can be done by disabling the toggle button and click on **Save.** ++### Enabling required permissions ++For viewing Azure Managed Prometheus and Azure Managed Grafana from the HDInsight on AKS portal, you need to have certain permissions as follows. ++User permission: For viewing Azure Managed Grafana, “Grafana Viewer” role is required for the user in the Azure Managed Grafana workspace, Access control (IAM). View how to grant user access, [here](../role-based-access-control/quickstart-assign-role-user-portal.md). ++1. Open the Grafana workspace configured in the cluster pool. +1. Select the Role as **Grafana Viewer** +1. Select the username who is accessing the Grafana dashboard. +1. Select the user and click on **Review+ Assign** ++ > [!NOTE] + > If user is pre-creating Azure Managed Prometheus the Grafana Identity requires additional permission of **Monitoring Reader.** + +1. In the Grafana workspace page (the one linked to the cluster) provides **Monitoring reader** permission in Identity tab. ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/role-assignment.png" alt-text="Screenshot showing how to assign role." border="true" lightbox="./media/monitor-with-prometheus-grafana/role-assignment.png"::: ++1. Click on **Add role assignment.** +1. Select the following parameters + 1. Scope as **Subscription** + 1. The subscription name. + 1. Role as **Monitoring Reader** + ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/role-assignment.png" alt-text="Screenshot showing how to assign role." border="true" lightbox="./media/monitor-with-prometheus-grafana/role-assignment.png"::: ++ > [!NOTE] + > For viewing other roles for Grafana users see [here](../managed-grafan). + +## View metrics +You can use the Grafana dashboard to view the service and system. Trino cluster as an example, assuming few jobs are executed in the cluster. ++1. Open the Grafana link in the cluster overview page. ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/view-metrics.png" alt-text="Screenshot showing how to view-metrics." border="true" lightbox="./media/monitor-with-prometheus-grafana/view-metrics.png"::: ++1. The default value on the Explore tab is **Grafana**. +1. Select on the dropdown and click on the `Managed Prometheus.…. <workspace name>` option and select the parameters of the time frame required. ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/set-time-frame.png" alt-text="Screenshot showing how to set time frame." border="true" lightbox="./media/monitor-with-prometheus-grafana/set-time-frame.png"::: ++1. Next Select the metric you want to see. ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/metric-type.png" alt-text="Screenshot showing how to metric type." border="true" lightbox="./media/monitor-with-prometheus-grafana/metric-type.png"::: ++1. Click on **Run Query** and select the timeframe on how often the query should be run. ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/run-query.png" alt-text="Screenshot showing how to run query." border="true" lightbox="./media/monitor-with-prometheus-grafana/run-query.png"::: ++1. View the metric as per selection. ++ :::image type="content" source="./media/monitor-with-prometheus-grafana/view-output.png" alt-text="Screenshot showing how to view the output." border="true" lightbox="./media/monitor-with-prometheus-grafana/view-output.png"::: |
hdinsight-aks | Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/overview.md | + + Title: What is Azure HDInsight on AKS? (Preview) +description: An introduction to Azure HDInsight on AKS. +++ Last updated : 08/29/2023+++# What is HDInsight on AKS? (Preview) ++ +HDInsight on AKS is a modern, reliable, secure, and fully managed Platform as a Service (PaaS) that runs on Azure Kubernetes Service (AKS). HDInsight on AKS allows you to deploy popular Open-Source Analytics workloads like Apache Spark, Apache Flink, and Trino without the overhead of managing and monitoring containers. +You can build end-to-end, petabyte-scale Big Data applications spanning streaming through Apache Flink, data engineering and machine learning using Apache Spark, and Trino's powerful query engine. ++All these capabilities combined with HDInsight on AKSΓÇÖs strong developer focus enables enterprises and digital natives with deep technical expertise to build and operate applications that are right fit for their needs. HDInsight on AKS allows developers to access all the rich configurations provided by open-source software and the extensibility to seamlessly include other ecosystem offerings. This offering empowers developers to test and tune their applications to extract the best performance at optimal cost. ++HDInsight on AKS integrates with the entire Azure ecosystem, shortening implementation cycles and improving time to realize value. + + + ## Technical architecture ++ HDInsight on AKS introduces the concept of cluster pools and clusters, which allow you to realize the complete value of data lakehouse. Cluster pools allow you to use multiple compute workloads on a single data lake, thereby removing the overhead of network management and resource planning. + +* **Cluster pools** are a logical grouping of clusters that help build robust interoperability across multiple cluster types and allow enterprises to have the clusters in the same virtual network. Cluster pools provide rapid and cost-effective access to all the cluster types created on-demand and at scale. +<br>One cluster pool corresponds to one cluster in AKS infrastructure. +* **Clusters** are individual compute workloads, such as Apache Spark, Apache Flink, and Trino, that can be created rapidly in few minutes with preset configurations. ++You can create the pool with a single cluster or a combination of cluster types, which are based on the need and can custom configure the following options: ++* Storage +* Network +* Logging +* Monitoring ++The following diagram shows the logical technical architecture of components installed in a default cluster pool. The clusters are isolated using [namespaces](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/) in AKS clusters. +++## Modernized cloud-native compute platform ++The latest version of HDInsight is orchestrated using AKS, which enables the platform to be more robust and empowers the users to handle the clusters effectively. Provisioning of clusters on HDInsight on AKS is fast and reliable, making it easy to manage clusters and perform in-place upgrades. With vast SKU choices and flexible subscription models, modernizing data lakehouses using open-source, cloud-native, and scalable infrastructure on HDInsight on AKS can meet all your analytics needs. + + +**Key features include:** +* Fast cluster creation and scaling. +* Ease of maintenance and periodic security updates. +* Cluster resiliency powered by modern cloud-native AKS. +* Native support for modern auth with OAuth, and Microsoft Entra ID (Azure Active Directory). +* Deep integration with Azure Services ΓÇô Azure Data Factory (ADF), Power BI, Azure Monitor. ++## Connectivity to HDInsight ++HDInsight on AKS can connect seamlessly with HDInsight. You can reap the benefits of using needed cluster types in a hybrid model. Interoperate with cluster types of HDInsight using the same storage and metastore across both the offerings. +++**The following scenarios are supported:** ++* [Flink connecting to HBase](./flink/use-flink-to-sink-kafka-message-into-hbase.md) +* [Flink connecting to Kafka](./flink/join-stream-kafka-table-filesystem.md) +* Spark connecting to HBase +* Spark connecting to Kafka ++## Security architecture ++HDInsight on AKS is secure by default. It enables enterprises to protect enterprise data assets with Azure Virtual Network, encryption, and integration with Microsoft Entra ID (Azure Active Directory). It also meets the most popular industry and government compliance standards upholding the Azure standards. With over 30 certifications that help protect data along with periodic updates, health advisor notifications, service health analytics, along with best-in-class Azure security standards. HDInsight on AKS offers several methods to address your enterprise security needs by default. +For more information, see [HDInsight on AKS security](./concept-security.md). ++ +## Region availability (public preview) ++* West Europe +* Central India +* UK South +* Korea Central +* East US 2 +* West US 2 +* West US 3 +* East US |
hdinsight-aks | Prerequisites Resources | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/prerequisites-resources.md | + + Title: Resource prerequisites for Azure HDInsight on AKS +description: Prerequisite steps to complete for Azure resources before working with HDInsight on AKS. ++ Last updated : 08/29/2023+++# Resource prerequisites +++This article details the resources required for getting started with HDInsight on AKS. It covers the necessary and the optional resources and how to create them. ++## Necessary resources +The following table depicts the necessary resources that are required for cluster creation based on the cluster types. ++|Workload|Managed Service Identity (MSI)|Storage|SQL Server - SQL Database|Key Vault| +| ||||| +|Trino| ✅ | | | | +|Flink| ✅ | ✅ | | | +|Spark| ✅ | ✅ | | | +|Trino, Flink, or Spark with Hive Metastore (HMS)| ✅ | ✅ | ✅ | ✅ | ++> [!NOTE] +> MSI is used as a security standard for authentication and authorization across resources, except SQL Database. The role assignment occurs prior to deployment to authorize MSI to storage and the secrets are stored in Key vault for SQL Database. Storage support is with ADLS Gen2, and is used as data store for the compute engines, and SQL Database is used for table management on Hive Metastore. ++## Optional resources ++* Virtual Network (VNet) and Subnet: [Create virtual network](/azure/virtual-network/quick-create-portal) +* Log Analytics Workspace: [Create Log Analytics workspace](/azure/azure-monitor/logs/quick-create-workspace?tabs=azure-portal) ++> [!NOTE] +> +> * HDInsight on AKS allows you to bring your own VNet and Subnet, enabling you to customize your [network requirements](./secure-traffic-by-firewall.md) to suit the needs of your enterprise. +> * Log Analytics workspace is optional and needs to be created ahead in case you would like to use Azure Monitor capabilities like [Azure Log Analytics](./how-to-azure-monitor-integration.md). ++You can create the necessary resources in two ways: ++* [Ready-to-use ARM templates](#using-arm-templates) +* [Using Azure portal](#using-azure-portal) ++### Using ARM templates ++The following ARM templates allow you to create the specified necessary resources, in one click using a resource prefix and more details as required. ++For example, if you provide resource prefix as “demo” then, following resources are created in your resource group depending on the template you select - +* MSI is created with name as `demoMSI`. +* Storage is created with name as `demostore` along with a container as `democontainer`. +* Key vault is created with name as `demoKeyVault` along with the secret provided as parameter in the template. +* Azure SQL database is created with name as `demoSqlDB` along with SQL server with name as `demoSqlServer`. + +|Workload|Prerequisites| +||| +|Trino|**Create the resources mentioned as follows:** <br> 1. Managed Service Identity (MSI): user-assigned managed identity. <br><br> [![Deploy Trino to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure-Samples%2Fhdinsight-aks%2Fmain%2FARM%2520templates%2FprerequisitesTrino.json)| +|Flink |**Create the resources mentioned as follows:** <br> 1. Managed Service Identity (MSI): user-assigned managed identity. <br> 2. ADLS Gen2 storage account and a container. <br><br> **Role assignments:** <br> 1. Assigns “Storage Blob Data Owner” role to user-assigned MSI on storage account. <br><br> [![Deploy Apache Flink to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure-Samples%2Fhdinsight-aks%2Fmain%2FARM%2520templates%2FprerequisitesFlink.json)| +|Spark| **Create the resources mentioned as follows:** <br> 1. Managed Service Identity (MSI): user-assigned managed identity. <br> 2. ADLS Gen2 storage account and a container. <br><br> **Role assignments:** <br> 1. Assigns “Storage Blob Data Owner” role to user-assigned MSI on storage account. <br><br> [![Deploy Spark to Azure](https://aka.ms/deploytoazurebutton)]( https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure-Samples%2Fhdinsight-aks%2Fmain%2FARM%2520templates%2FprerequisitesSpark.json)| +|Trino, Flink, or Spark with Hive Metastore (HMS)|**Create the resources mentioned as follows:** <br> 1. Managed Service Identity (MSI): user-assigned managed identity. <br> 2. ADLS Gen2 storage account and a container. <br> 3. Azure Key Vault and a secret to store SQL Server admin credentials. <br><br> **Role assignments:** <br> 1. Assigns “Storage Blob Data Owner” role to user-assigned MSI on storage account. <br> 2. Assigns “Key Vault Secrets User” role to user-assigned MSI on Key Vault. <br><br> [![Deploy Trino HMS to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FAzure-Samples%2Fhdinsight-aks%2Fmain%2FARM%2520templates%2Fprerequisites_WithHMS.json)| ++> [!NOTE] +> Using these ARM templates require a user to have permission to create new resources and assign roles to the resources in the subscription. ++### Using Azure portal ++#### [Create user-assigned managed identity (MSI)](/azure/active-directory/managed-identities-azure-resources/how-manage-user-assigned-managed-identities?pivots=identity-mi-methods-azp#create-a-user-assigned-managed-identity) ++ A managed identity is an identity registered in Microsoft Entra ID [(Azure Active Directory)](https://www.microsoft.com/security/business/identity-access/azure-active-directory) whose credentials managed by Azure. With managed identities, you need not register service principals in Azure AD to maintain credentials such as certificates. ++ HDInsight on AKS relies on user-assigned MSI for communication among different components. ++#### [Create storage account – ADLS Gen 2](/azure/storage/blobs/create-data-lake-storage-account) ++ The storage account is used as the default location for cluster logs and other outputs. + Enable hierarchical namespace during the storage account creation to use as ADLS Gen2 storage. ++ 1. [Assign a role](/azure/role-based-access-control/role-assignments-portal#step-2-open-the-add-role-assignment-page): Assign “Storage Blob Data Owner” role to the user-assigned MSI created to this storage account. + + 1. [Create a container](/azure/storage/blobs/blob-containers-portal#create-a-container): After creating the storage account, create a container in the storage account. + + > [!NOTE] + > Option to create a container during cluster creation is also available. ++#### [Create Azure SQL Database](/azure/azure-sql/database/single-database-create-quickstart) + + Create an Azure SQL Database to be used as an external metastore during cluster creation or you can use an existing SQL Database. However, ensure the following properties are set. + + **Necessary properties to be enabled for SQL Server and SQL Database**- + + |Resource Type| Property| Description| + |-|-|-| + |SQL Server |Authentication method |While creating a SQL Server, use "Authentication method" as <br> :::image type="content" source="./media/prerequisites-resources/authentication-method.png" alt-text="Screenshot showing how to select authentication method." border="true" lightbox="media/prerequisites-resources/authentication-method.png":::| + |SQL Database |Allow Azure services and resources to access this server |Enable this property under Networking blade in your SQL database in the Azure portal.| ++ > [!NOTE] + > * Currently, we support only Azure SQL Database as inbuilt metastore. + > * Due to Hive limitation, "-" (hyphen) character in metastore database name is not supported. + > * Azure SQL Database should be in the same region as your cluster. + > * Option to create a SQL Database during cluster creation is also available. However, you need to refresh the cluster creation page to get the newly created database appear in the dropdown list. ++#### [Create Azure Key Vault](/azure/key-vault/general/quick-create-portal#create-a-vault) ++ Key Vault allows you to store the SQL Server admin password set during SQL Database creation. + HDInsight on AKS platform doesn’t deal with the credential directly. Hence, it's necessary to store your important credentials in the Key Vault. ++ 1. [Assign a role](/azure/role-based-access-control/role-assignments-portal#step-2-open-the-add-role-assignment-page): Assign “Key Vault Secrets User” role to the user-assigned MSI created as part of necessary resources to this Key Vault. + + 1. [Create a secret](/azure/key-vault/secrets/quick-create-portal#add-a-secret-to-key-vault): This step allows you to keep your SQL Server admin password as a secret in Azure Key Vault. Add your password in the “Value” field while creating a secret. ++ > [!NOTE] + > * Make sure to note the secret name, as this is required during cluster creation. + > * You need to have a “Key Vault Administrator” role assigned to your identity or account to add a secret in the Key Vault using Azure portal. Navigate to the Key Vault and follow the steps on [how to assign the role](/azure/role-based-access-control/role-assignments-portal#step-2-open-the-add-role-assignment-page). |
hdinsight-aks | Prerequisites Subscription | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/prerequisites-subscription.md | + + Title: Subscription prerequisites for Azure HDInsight on AKS. +description: Prerequisite steps to complete on your subscription before working with Azure HDInsight on AKS. ++ Last updated : 08/29/2023+++# Subscription prerequisites +++If you're using Azure subscription first time for HDInsight on AKS, the following features might need to be enabled. ++## Enable features ++1. Sign in to [Azure portal](https://portal.azure.com). + +1. Click the Cloud Shell icon (:::image type="icon" source="./media/prerequisites-subscription/cloud-shell.png" alt-text="Screenshot screenshot showing Cloud Shell icon.":::) at the top right, and select **PowerShell** or **Bash** as your environment depending on the command you use. ++At the next command prompt, enter each of the following commands: + +1. **Register your subscription for 'AKS-AzureKeyVaultSecretsProvider' feature.** ++ ```azurecli + az feature register --name AKS-AzureKeyVaultSecretsProvider --namespace "Microsoft.ContainerService" --subscription <Your Subscription> + ``` + + ```powershell + Register-AzProviderFeature -FeatureName AKS-AzureKeyVaultSecretsProvider -ProviderNamespace Microsoft.ContainerService + ``` ++ **Output:** All requests for this feature should be automatically approved. The state in the response should show as **Registered**. + <br>If you receive a response that the registration is still on-going (state in the response shows as "Registering"), wait for a few minutes. <br>Run the command again in few minutes and the state changes to "Registered" once feature registration is completed. ++1. **Register your subscription for 'EnablePodIdentityPreview' feature.** ++ ```azurecli + az feature register --name EnablePodIdentityPreview --namespace "Microsoft.ContainerService" --subscription <Your Subscription> + ``` + + ```powershell + Register-AzProviderFeature -FeatureName EnablePodIdentityPreview -ProviderNamespace Microsoft.ContainerService + ``` + **Output:** The response indicates the registration is in progress (state in the response shows as "Registering"). It might take a few minutes to register the feature. + <br>Run the command again in few minutes and the state changes to "Registered" once feature registration is completed. ++1. **Register your subscription for 'KubeletDisk' feature.** + + ```azurecli + az feature register --name KubeletDisk --namespace "Microsoft.ContainerService" --subscription <Your Subscription> + ``` + + ```powershell + Register-AzProviderFeature -FeatureName KubeletDisk -ProviderNamespace Microsoft.ContainerService + ``` + + **Output:** The response indicates the registration is in progress (state in the response shows as "Registering"). It might take a few minutes to register the feature. + <br>Run the command again in few minutes and the state changes to "Registered" once feature registration is completed. ++1. **Register with 'Microsoft.ContainerService' provider to propagate the features registered in the previous steps.** + + ```azurecli + az provider register -n Microsoft.ContainerService --subscription <Your Subscription> + ``` + + ```powershell + Register-AzResourceProvider -ProviderNamespace Microsoft.ContainerService + ``` + + **Output:** No response means the feature registration propagated and you can proceed. If you receive a response that the registration is still on-going, wait for a few minutes, and run the command again until you receive no response. ++## Next steps +* [One-click deployment](./get-started.md) |
hdinsight-aks | Preview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/preview.md | + + Title: HDInsight on AKS preview information +description: This article explains what public preview mean in HDInsight on AKS. ++ Last updated : 09/05/2023+++# Microsoft HDInsight on AKS preview information ++Azure HDInsight on AKS is currently in public preview and may be substantially modified before it's released. Preview online service products and features aren't complete but are made available on a preview basis so that customers can get early access and provide feedback. ++This article describes the Azure HDInsight on AKS preview state, and provides disclaimers related to preview. ++## Terms of use ++Your use of the Microsoft Azure HDInsight on AKS Cluster Pool or Microsoft Azure HDInsight on AKS Clusters preview experiences and features is governed by the preview online service terms and conditions of the agreement(s) under which you obtained the services and the [supplemental preview terms](https://go.microsoft.com/fwlink/?linkid=2240967). ++Previews are provided ΓÇ£as-is,ΓÇ¥ ΓÇ£with all faults,ΓÇ¥ and ΓÇ£as available,ΓÇ¥ and are excluded from the service level agreements and limited warranty. Customer support may not cover previews. We may change or discontinue Previews at any time without notice. We also may choose not to release a Preview into ΓÇ£General AvailabilityΓÇ¥. ++Previews may be subject to reduced or different security, compliance and privacy commitments, as further explained in the [Microsoft Privacy Statement](https://go.microsoft.com/fwlink/?LinkId=521839), [Microsoft Trust Center](https://go.microsoft.com/fwlink/?linkid=2179910), the [Product Terms](https://go.microsoft.com/fwlink/?linkid=2173816), the [Microsoft Products and Services Data Protection Addendum (ΓÇ£DPAΓÇ¥)](https://go.microsoft.com/fwlink/?linkid=2153219), and any extra notices provided with the Preview. ++## Functionality ++During preview, Microsoft Azure HDInsight on AKS may have limited or restricted functionality. ++## Availability ++During public preview, HDInsight on AKS may not be available in all geographic areas. For more information, see [region availability](./overview.md#region-availability-public-preview) ++ |
hdinsight-aks | Quickstart Create Cluster | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/quickstart-create-cluster.md | + + Title: Create cluster pool and cluster +description: Creating a cluster pool and cluster in HDInsight on AKS. ++ Last updated : 08/29/2023+++# Create cluster pool and cluster +++HDInsight on AKS has the concept of cluster pools and clusters. ++- **Cluster pools** are a logical grouping of clusters and maintain a set of clusters in the same pool, which helps in building robust interoperability across multiple cluster types. It can be created within an existing virtual network or outside a virtual network. ++ A cluster pool in HDInsight on AKS corresponds to one cluster in AKS infrastructure. ++- **Clusters** are individual compute workloads, such as Apache Spark, Apache Flink, or Trino, which can be created in the same cluster pool. ++For creating Apache Spark, Apache Flink, or Trino clusters, you need to first create a cluster pool. ++## Prerequisites ++Ensure that you have completed the [subscription prerequisites](prerequisites-subscription.md) and [resource prerequisites](prerequisites-resources.md) before creating a cluster pool. ++## Create a cluster pool ++1. Sign in to [Azure portal](https://portal.azure.com). + +1. In the Azure portal search bar, type "HDInsight on AKS cluster pool" and select "Azure HDInsight on AKS cluster pools" from the drop-down list. + + :::image type="content" source="./media/quickstart-create-cluster/search-bar.png" alt-text="Diagram showing search bar in Azure portal." border="true" lightbox="./media/quickstart-create-cluster/search-bar.png" + +1. Click **+ Create**. ++ :::image type="content" source="./media/quickstart-create-cluster/create-button.png" alt-text="Diagram showing create button." border="true" lightbox="./media/quickstart-create-cluster/create-button.png"::: + +1. In the **Basics** tab, enter the following information: ++ :::image type="content" source="./media/quickstart-create-cluster/cluster-pool-basic-tab.png" alt-text="Diagram showing cluster pool creation basic tab." border="true" lightbox="./media/quickstart-create-cluster/cluster-pool-basic-tab.png"::: ++ |Property|Description| + ||| + |Subscription| From the drop-down list, select the Azure subscription under which you want to create HDInsight on AKS cluster pool.| + |Resource group|From the drop-down list, select an existing resource group, or select **Create new.**| + |Pool name| Enter the name of the cluster pool to be created. Cluster pool name length can't be more than 26 characters. It must start with an alphabet, end with an alphanumeric character, and must only contain alphanumeric characters and hyphens.| + |Region|From the drop-down list, select the region for the cluster pool. Check [region availability](./overview.md#region-availability-public-preview). For cluster pools in a virtual network, the region for the virtual network and the cluster pool must be same. | + |Cluster pool version|From the drop-down list, select the HDInsight on AKS cluster pool version. | + |Virtual machine|From the drop-down list, select the virtual machine size for the cluster pool based on your requirement.| + |Managed resource group|(Optional) Provide a name for managed resource group. It holds ancillary resources created by HDInsight on AKS.| + ++ Select **Next: Security + networking** to continue. + +1. On the **Security + networking** page, provide the following information: + + :::image type="content" source="./media/quickstart-create-cluster/cluster-pool-security-tab.png" alt-text="Diagram showing cluster pool creation network and security tab." border="true" lightbox="./media/quickstart-create-cluster/cluster-pool-security-tab.png"::: ++ |Property|Description| + ||| + |Virtual network (VNet) | From the drop-down list, select a virtual network, which is in the same region as the cluster pool.| + |Subnet | From the drop-down list, select the name of the subnet that you plan to associate with the cluster pool.| ++ Select **Next: Integrations** to continue. + + +1. On the **Integrations** page, provide the following information: ++ :::image type="content" source="./media/quickstart-create-cluster/create-cluster-pool-integration-tab.png" alt-text="Diagram showing cluster pool creation integration tab." border="true" lightbox="./media/quickstart-create-cluster/create-cluster-pool-integration-tab.png"::: ++ |Property|Description| + ||| + |Log Analytics| (Optional) Select this option to enable Log analytics to view insights and logs directly in your cluster by sending metrics and logs to a Log Analytics Workspace.| + |Azure Prometheus| You can enable this option after cluster pool creation is completed. | + + Select **Next: Tags** to continue. + +1. On the **Tags** page, enter any tags (optional) youΓÇÖd like to assign to the cluster pool. + + :::image type="content" source="./media/quickstart-create-cluster/create-cluster-pool-tags-page.png" alt-text="Diagram showing cluster pool creation tags tab." border="true" lightbox="./media/quickstart-create-cluster/create-cluster-pool-tags-page.png"::: ++ | Property | Description| + ||| + |Name | Enter a name (key) that help you identify resources based on settings that are relevant to your organization. For example, "Environment" to track the deployment environment for your resources.| + | Value | Enter the value that helps to relate to the resources. For example, "Production" to identify the resources deployed to production.| + | Resource | Select the applicable resource type.| ++ Select **Next: Review + create** to continue. + +1. On the **Review + create** page, look for the **Validation succeeded** message at the top of the page and then click **Create**. ++ The **Deployment is in process** page is displayed while the cluster pool is being created, and the **Your deployment is complete page** is displayed once the cluster pool is fully deployed and ready for use. ++ :::image type="content" source="./media/quickstart-create-cluster/create-cluster-pool-review-create-page.png" alt-text="Diagram showing cluster pool review and create tab." lightbox="./media/quickstart-create-cluster/create-cluster-review-create-page.png"::: ++ If you navigate away from the page, you can check the status of the deployment by clicking Notifications icon. + + > [!TIP] + > For troubleshooting any deployment errors, you can refer this [page](./create-cluster-error-dictionary.md). ++## Create a cluster ++Once the cluster pool deployment completes, continue to use the Azure portal to create a [Trino](./trino/trino-create-cluster.md#create-a-trino-cluster), [Flink](./flink/flink-create-cluster-portal.md#create-an-apache-flink-cluster), and [Spark](./spark/hdinsight-on-aks-spark-overview.md) cluster. ++> [!IMPORTANT] +> For creating a cluster in a new cluster pool, assign AKS agentpool MSI "Managed Identity Operator" role on the user-assigned managed identity created as part of resource prerequisites. +> When a user has permission to assign the Azure RBAC roles, it's assigned automatically. +> +> AKS agentpool managed identity is created during cluster pool creation. You can identify the AKS agentpool managed identity by **(your clusterpool name)-agentpool**. +> Follow these steps to [assign the role](/azure/role-based-access-control/role-assignments-portal#step-2-open-the-add-role-assignment-page). ++For a quickstart, refer to the following steps. ++1. When the cluster pool creation completes, click **Go to resource** from the **Your deployment is complete** page or the **Notifications** area. If the **Go to resource option** isn't available, type *HDInsight on AKS cluster pool* in the search bar on the Azure portal, and then select the cluster pool you created. ++1. Click **+ New cluster** from and then provide the following information: ++ :::image type="content" source="./media/quickstart-create-cluster/create-new-cluster.png" alt-text="Screenshot showing create new cluster option."::: + + :::image type="content" source="./media/quickstart-create-cluster/create-cluster-basic-page.png" alt-text="Diagram showing how to create a new cluster." border="true" lightbox="./media/quickstart-create-cluster/create-cluster-basic-page.png"::: + + | Property| Description| + ||| + |Subscription | By default, it's populated with the subscription used for the cluster pool.| + |Resource group| By default, it's populated with the resource group used for the cluster pool.| + |Cluster pool|Represents the cluster pool in which the cluster has to be created. To create a cluster in a different pool, find that cluster pool in the portal and click **+ New cluster**.| + |Region| By default, it's populated with the region used for the cluster pool.| + |Cluster pool version|By default, it's populated with the version used for the cluster pool.| + |HDInsight on AKS version| From the drop-down list, select the HDInsight on AKS version. For more information, see [versioning](./versions.md).| + |Cluster type | From the drop-down list, select the type of Cluster you want to create: Trino, Flink, or Spark.| + |Cluster package| Select the cluster package with component version available for the selected cluster type. | + |Cluster name|Enter the name of the new cluster.| + |User-assigned managed identity | Select the managed identity to use with the cluster.| + |Storage account (ADLS Gen2) | Select a storage account and a container that is the default location for cluster logs and other output. It's mandatory for Apache Flink and Spark cluster type.| + |Virtual network (VNet) | The virtual network for the cluster. It's derived from the cluster pool.| + |Subnet|The virtual network subnet for the cluster. It's derived from the cluster pool.| + + Click **Next: Configuration** to continue. ++1. On the **Configuration** page, provide the following information: ++ :::image type="content" source="./media/quickstart-create-cluster/configuration-and-pricing-tab.png" alt-text="Diagram showing configuration tab."::: + ++ |Property|Description| + ||| + |Head node size| This value is same as the worker node size.| + |Number of head nodes|This value is set by default based on the cluster type.| + |Worker node size| From the drop-down list, select the recommended SKU or you can choose the SKU available in your subscription by clicking **Select VM size**.| + |Number of worker nodes|Select the number of worker nodes required for your cluster.| + |Autoscale|(Optional) Select this option to enable the autoscale capability| + |Secure shell (SSH) configuration|(Optional) Select this option to enable SSH node. By enabling SSH, more VM nodes are created.| ++ > [!NOTE] + > You will see extra section to provide service configurations for Apache Flink clusters. ++ Click **Next: Integrations** to continue. ++1. On the **Integrations** page, provide the following information: ++ :::image type="content" source="./media/quickstart-create-cluster/cluster-integration-tab.png" alt-text="Diagram showing integration tab."::: ++ |Property|Description| + ||| + |Log Analytics|(Optional) Select this option to enable Log analytics to view insights and logs directly in your cluster by sending metrics and logs to a Log Analytics Workspace.| + |Azure Prometheus|(Optional) Select this option to enable Azure Managed Prometheus to view Insights and Logs directly in your cluster by sending metrics and logs to an Azure Monitor workspace.| ++ > [!NOTE] + > To enable Log Analytics and Azure Prometheus, it should be first enabled at the cluster pool level. + + Click **Next: Tags** to continue. + +1. On the **Tags** page, enter any tags(optional) youΓÇÖd like to assign to the cluster. + + :::image type="content" source="./media/quickstart-create-cluster/create-cluster-tags-page.png" alt-text="Screenshot showing tags page."::: ++ | Property | Description| + ||| + |Name | Enter a name (key) that help you identify resources based on settings that are relevant to your organization. "Environment" to track the deployment environment for your resources.| + | Value | Enter the value that helps to relate to the resources. "Production" to identify the resources deployed to production.| + | Resource | Select the applicable resource type.| ++ Select **Next: Review + create** to continue. ++1. On the **Review + create** page, look for the **Validation succeeded** message at the top of the page and then click **Create**. ++ :::image type="content" source="./media/quickstart-create-cluster/create-cluster-review-create-page.png" alt-text="Diagram showing cluster review and create tab." lightbox="./media/quickstart-create-cluster/create-cluster-review-create-page.png"::: ++ The **Deployment is in process** page is displayed while the cluster is being created, and the **"Your deployment is complete"** page is displayed once the cluster is fully deployed and ready for use. ++ > [!TIP] + > For troubleshooting any deployment errors, you can refer to this [page](./create-cluster-error-dictionary.md). |
hdinsight-aks | Hdinsight Aks Release Notes Archive | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/release-notes/hdinsight-aks-release-notes-archive.md | + + Title: Archived release notes for Azure HDInsight on AKS +description: Archived release notes for Azure HDInsight on AKS. Get development tips and details for Trino, Flink, and Spark. ++ Last updated : 08/29/2023+++# Azure HDInsight on AKS archived release notes ++Now there are no archived release notes available for HDInsight on AKS. ++++ |
hdinsight-aks | Hdinsight Aks Release Notes | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/release-notes/hdinsight-aks-release-notes.md | + + Title: Release notes for Azure HDInsight on AKS +description: Latest release notes for Azure HDInsight on AKS. Get development tips and details for Trino, Flink, Spark, and more. ++ Last updated : 08/29/2023+++# Azure HDInsight on AKS release notes ++This article provides information about the **most recent**- Azure HDInsight on AKS release updates. ++## Summary ++Azure HDInsight on AKS is a new version of HDInsight, which runs on Kubernetes and brings in the best of the open source on Kubernetes. It's gaining popularity among enterprise customers for open-source analytics on Azure Kubernetes Services. ++## Release date: October 10, 2023 ++**This release applies to the following** ++- Cluster Pool Version: 1.0 +- Cluster Version: 1.0.6 ++> [!NOTE] +> To understand about HDInsight on AKS versioning and support, refer to the **[versioning page](../versions.md)**. ++You can refer to [What's new](../whats-new.md) page for all the details of the features currently in public preview for this release. ++## Release Information ++### Operating System version ++- Mariner OS 2.0 ++**Workload versions** ++|Workload|Version| +| -- | -- | +|Trino | 410 | +|Flink | 1.16 | +|Apache Spark | 3.3.1 | ++**Supported Java and Scala versions** ++|Java|Scala| +| -- | -- | +|8, JDK 1.8.0_345 |2.12.10 | ++The preview is available in the following [regions](../overview.md#region-availability-public-preview). ++If you have any more questions, contact [Azure Support](https://ms.portal.azure.com/#view/Microsoft_Azure_Support/HelpAndSupportBlade/~/overview) or refer to the [Support options](../hdinsight-aks-support-help.md) page. ++### Next steps ++- [Azure HDInsight on AKS : Frequently asked questions](../faq.md) +- [Create a cluster pool and cluster](../quickstart-create-cluster.md) |
hdinsight-aks | Required Outbound Traffic | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/required-outbound-traffic.md | + + Title: Outbound traffic on HDInsight on AKS +description: Learn required outbound traffic on HDInsight on AKS. ++ Last updated : 08/29/2023+++# Required outbound traffic for HDInsight on AKS +++This article outlines the networking information to help manage the network policies at enterprise and make necessary changes to the network security groups (NSGs) for smooth functioning of HDInsight on AKS. ++If you use firewall to control outbound traffic to your HDInsight on AKS cluster, you must ensure that your cluster can communicate with critical Azure services. +Some of the security rules for these services are region-specific, and some of them apply to all Azure regions. ++You need to configure the following network and application security rules in your firewall to allow outbound traffic. ++## Common traffic ++|Type| Destination Endpoint | Protocol | Port | Azure Firewall Rule Type | Use | +|-|--|-||--| -| +| ServiceTag | AzureCloud.`<Region>` | UDP | 1194 | Network security rule| Tunneled secure communication between the nodes and the control plane.| +| ServiceTag | AzureCloud.`<Region>` | TCP | 9000 | Network security rule|Tunneled secure communication between the nodes and the control plane.| +| FQDN Tag| AzureKubernetesService | HTTPS | 443 |Application security rule| Required by AKS Service.| +| Service Tag | AzureMonitor | TCP | 443 |Application security rule| Required for integration with Azure Monitor.| +| FQDN| hiloprodrpacr00.azurecr.io|HTTPS|443|Application security rule| Downloads metadata info of the docker image for setup of HDInsight on AKS and monitoring.| +| FQDN| *.blob.core.windows.net|HTTPS|443|Application security rule| Monitoring and setup of HDInsight on AKS.| +| FQDN|graph.microsoft.com|HTTPS|443|Application security rule| Authentication.| +| FQDN|*.servicebus.windows.net|HTTPS|443|Application security rule| Monitoring.| +| FQDN|*.table.core.windows.net|HTTPS|443|Application security rule| Monitoring. +| FQDN|gcs.prod.monitoring.core.windows.net|HTTPS|443|Application security rule| Monitoring.| +| FQDN|API Server FQDN (available once AKS cluster is created)|TCP|443|Network security rule| Required as the running pods/deployments use it to access the API Server. You can get this information from the AKS cluster running behind the cluster pool. For more information, see [how to get API Server FQDN](secure-traffic-by-firewall-azure-portal.md#get-aks-cluster-details-created-behind-the-cluster-pool) using Azure portal.| +++## Cluster specific traffic ++The below section outlines any specific network traffic, which a cluster shape requires, to help enterprises plan and update the network rules accordingly. ++### Trino ++| Type | Destination Endpoint | Protocol | Port | Azure Firewall Rule Type |Use | +||--|-||--|-| +| FQDN|*.dfs.core.windows.net|HTTPS|443|Application security rule|Required if Hive is enabled. It's user's own Storage account, such as contosottss.dfs.core.windows.net| +| FQDN|*.database.windows.net|mysql|1433|Application security rule|Required if Hive is enabled. It's user's own SQL server, such as contososqlserver.database.windows.net| +|Service Tag | Sql.`<Region>`|TCP|11000-11999|Network security rule|Required if Hive is enabled. It's used in connecting to SQL server. It's recommended to allow outbound communication from the client to all Azure SQL IP addresses in the region on ports in the range of 11000 to 11999. Use the Service Tags for SQL to make this process easier to manage. When using the Redirect connection policy, refer to the [Azure IP Ranges and Service Tags ΓÇô Public Cloud](https://www.microsoft.com/download/details.aspx?id=56519) for a list of your region's IP addresses to allow.| ++### Spark ++| Type | Destination Endpoint| Protocol | Port | Azure Firewall Rule Type |Use | +|||-||--|| +| FQDN|*.dfs.core.windows.net|HTTPS|443|Application security rule|Spark Azure Data Lake Storage Gen2. It's user's Storage account: such as contosottss.dfs.core.windows.net| +|Service Tag | Storage.`<Region>`|TCP|445|Network security rule|Use SMB protocol to connect to Azure File| +| FQDN|*.database.windows.net|mysql|1433|Application security rule|Required if Hive is enabled. It's user's own SQL server, such as contososqlserver.database.windows.net| +|Service Tag | Sql.`<Region>`|TCP|11000-11999|Network security rule|Required if Hive is enabled. It's used to connect to SQL server. It's recommended to allow outbound communication from the client to all Azure SQL IP addresses in the region on ports in the range of 11000 to 11999. Use the Service Tags for SQL to make this process easier to manage. When using the Redirect connection policy, refer to the [Azure IP Ranges and Service Tags ΓÇô Public Cloud](https://www.microsoft.com/download/details.aspx?id=56519) for a list of your region's IP addresses to allow. | ++### Apache Flink ++|Type|Destination Endpoint|Protocol|Port|Azure Firewall Rule Type |Use| +|-|-|-|-|-|--| +|FQDN|`*.dfs.core.windows.net`|HTTPS|443|Application security rule|Flink Azure Data Lake Storage Gens. It's user's Storage account: such as contosottss.dfs.core.windows.net| ++## Next steps +* [How to use firewall to control outbound traffic and apply rules](./secure-traffic-by-firewall.md). +* [How to use NSG to restrict traffic](./secure-traffic-by-nsg.md). |
hdinsight-aks | Secure Traffic By Firewall Azure Portal | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/secure-traffic-by-firewall-azure-portal.md | + + Title: Use firewall to restrict outbound traffic on HDInsight on AKS, using Azure portal +description: Learn how to secure traffic using firewall on HDInsight on AKS using Azure portal ++ Last updated : 08/3/2023+++# Use firewall to restrict outbound traffic using Azure portal +++When an enterprise wants to use their own virtual network for the cluster deployments, securing the traffic of the virtual network becomes important. +This article provides the steps to secure outbound traffic from your HDInsight on AKS cluster via Azure Firewall using Azure portal. ++The following diagram illustrates the example used in this article to simulate an enterprise scenario: ++++## Create a virtual network and subnets + + 1. Create a virtual network and two subnets. + + In this step, set up a virtual network and two subnets for configuring the egress specifically. + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-virtual-network-step-2.png" alt-text="Diagram showing creating a virtual network in the resource group using Azure portal step number 2." border="true" lightbox="./media/secure-traffic-by-firewall-azure-portal/create-virtual-network-step-2.png"::: + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-virtual-network-step-3.png" alt-text="Diagram showing creating a virtual network and setting IP address using Azure portal step 3." border="true" lightbox="./media/secure-traffic-by-firewall-azure-portal/create-virtual-network-step-3.png"::: + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-virtual-network-step-4.png" alt-text="Diagram showing creating a virtual network and setting IP address using Azure portal in step number four." border= "true" lightbox="./media/secure-traffic-by-firewall-azure-portal/create-virtual-network-step-4.png"::: + + > [!IMPORTANT] + > * If you add NSG in subnet , you need to add certain outbound and inbound rules manually. Follow [use NSG to restrict the traffic](./secure-traffic-by-nsg.md). + > * Don't associate subnet `hdiaks-egress-subnet` with a route table because HDInsight on AKS creates cluster pool with default outbound type and can't create the cluster pool in a subnet already associated with a route table. ++## Create HDInsight on AKS cluster pool using Azure portal ++ 1. Create a cluster pool. + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-cluster-pool-step-5.png" alt-text="Diagram showing creating a HDInsight on AKS cluster pool using Azure portal in step number five." border="true" lightbox="./media/secure-traffic-by-firewall-azure-portal/create-cluster-pool-step-5.png"::: + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-cluster-pool-networking-step-6.png" alt-text="Diagram showing creating a HDInsight on AKS cluster pool networking using Azure portal step 6." border="true" lightbox="./media/secure-traffic-by-firewall-azure-portal/create-cluster-pool-networking-step-6.png"::: + + 2. When HDInsight on AKS cluster pool is created, you can find a route table in subnet `hdiaks-egress-subnet`. + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-cluster-pool-networking-step-7.png" alt-text="Diagram showing creating a HDInsight on AKS cluster pool networking using Azure portal step 7." border="true" lightbox="./media/secure-traffic-by-firewall-azure-portal/create-cluster-pool-networking-step-7.png"::: ++### Get AKS cluster details created behind the cluster pool ++You can search your cluster pool name in portal, and go to AKS cluster. For example, +++Get AKS API Server details. +++## Create firewall ++ 1. Create firewall using Azure portal. + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-firewall-step-10.png" alt-text="Diagram showing creating a firewall using Azure portal step 10." border="true" lightbox="./media/secure-traffic-by-firewall-azure-portal/create-firewall-step-10.png"::: + + 3. Enable DNS proxy server of firewall. + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-firewall-dns-proxy-step-11.png" alt-text="Diagram showing creating a firewall and DNS proxy using Azure portal step 11." border="true" lightbox="./media/secure-traffic-by-firewall-azure-portal/create-firewall-dns-proxy-step-11.png"::: + + 5. Once the firewall is created, find the firewall internal IP and public IP. + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-firewall-dns-proxy-step-12.png" alt-text="Diagram showing creating a firewall and DNS proxy internal and public IP using Azure portal step 12." border="true" lightbox="./media/secure-traffic-by-firewall-azure-portal/create-firewall-dns-proxy-step-12.png"::: ++### Add network and application rules to the firewall ++ 1. Create the network rule collection with following rules. + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-firewall-rules-step-13.png" alt-text="Diagram showing adding firewall rules using Azure portal step 13." border="true" lightbox="./media/secure-traffic-by-firewall-azure-portal/create-firewall-rules-step-13.png"::: + + 2. Create the application rule collection with following rules. + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-firewall-rules-step-14.png" alt-text="Diagram showing adding firewall rules using Azure portal step 14." border="true" lightbox="./media/secure-traffic-by-firewall-azure-portal/create-firewall-rules-step-14.png"::: ++### Create route in the route table to redirect the traffic to firewall ++Add new routes to route table to redirect the traffic to the firewall. ++++## Create cluster ++In the previous steps, we have routed the traffic to firewall. ++The following steps provide details about the specific network and application rules needed by each cluster type. You can refer to the cluster creation pages for creating [Apache Flink](./flink/flink-create-cluster-portal.md), [Trino](./trino/trino-create-cluster.md), and [Apache Spark](./spark/hdinsight-on-aks-spark-overview.md) clusters based on your need. ++> [!IMPORTANT] +> Before creating the cluster, make sure to add the following cluster specific rules to allow the traffic. ++### Trino ++ 1. Add the following rules to application rule collection `aksfwar`. + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-cluster-trino-step-16.png" alt-text="Diagram showing adding application rules for Trino Cluster using Azure portal step 16." border="true" lightbox="./media/secure-traffic-by-firewall-azure-portal/create-cluster-trino-step-16.png"::: + + 2. Add the following rule to network rule collection `aksfwnr`. + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-cluster-trino-1-step-16.png" alt-text="Diagram showing how to add application rules to network rule collection for Trino Cluster using Azure portal step 16." border="true" lightbox="./media/secure-traffic-by-firewall-azure-portal/create-cluster-trino-1-step-16.png"::: + + > [!NOTE] + > Change the `Sql.<Region>` to your region as per your requirement. For example: `Sql.WestEurope` ++### Apache Flink ++ 1. Add the following rule to application rule collection `aksfwar`. + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-cluster-flink-step-17.png" alt-text="Diagram showing adding application rules for Apache Flink Cluster using Azure portal step 17." border="true" lightbox="./media/secure-traffic-by-firewall-azure-portal/create-cluster-flink-step-17.png"::: ++### Apache Spark ++ 1. Add the following rules to application rule collection `aksfwar`. + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-cluster-spark-step-18.png" alt-text="Diagram showing adding application rules for Apache Flink Cluster using Azure portal step 18." border="true" lightbox="./media/secure-traffic-by-firewall-azure-portal/create-cluster-spark-1-step-18.png"::: + + 2. Add the following rules to network rule collection `aksfwnr`. + + :::image type="content" source="./media/secure-traffic-by-firewall-azure-portal/create-cluster-spark-1-step-18.png" alt-text="Diagram showing how to add application rules for Apache Flink Cluster using Azure portal step 18." border="true" lightbox="./media/secure-traffic-by-firewall-azure-portal//create-cluster-spark-1-step-18.png"::: + + > [!NOTE] + > 1. Change the `Sql.<Region>` to your region as per your requirement. For example: `Sql.WestEurope` + > 2. Change the `Storage.<Region>` to your region as per your requirement. For example: `Storage.WestEurope` +++## Solving symmetric routing issue ++The following steps allow us to request cluster by cluster load balancer ingress service and ensure the network response traffic doesn't flow to firewall. ++Add a route to the route table to redirect the response traffic to your client IP to Internet and then, you can reach the cluster directly. +++ If you can't reach the cluster and have configured NSG, follow [use NSG to restrict the traffic](./secure-traffic-by-nsg.md) to allow the traffic. ++> [!TIP] +> If you want to permit more traffic, you can configure it over the firewall. ++## How to Debug +If you find the cluster works unexpectedly, you can check the firewall logs to find which traffic is blocked. |
hdinsight-aks | Secure Traffic By Firewall | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/hdinsight-aks/secure-traffic-by-firewall.md | + + Title: Use firewall to restrict outbound traffic on HDInsight on AKS using Azure CLI +description: Learn how to secure traffic using firewall on HDInsight on AKS using Azure CLI ++ Last updated : 08/3/2023+++# Use firewall to restrict outbound traffic using Azure CLI +++When an enterprise wants to use their own virtual network for the cluster deployments, securing the traffic of the virtual network becomes important. +This article provides the steps to secure outbound traffic from your HDInsight on AKS cluster via Azure Firewall using [Azure CLI](/azure/cloud-shell/quickstart?tabs=azurecli). ++The following diagram illustrates the example used in this article to simulate an enterprise scenario: +++The example demonstrated in this article is using **Azure Could Shell**. ++## Define the variables ++**Copy and execute in the Azure Cloud Shell to set the values of these variables.** ++```azurecli +PREFIX="hdiaks-egress" +RG="${PREFIX}-rg" +LOC="eastus" +HDIAKS_CLUSTER_POOL=${PREFIX} +VNET_NAME="${PREFIX}-vnet" +HDIAKS_SUBNET_NAME="${PREFIX}-subnet" +# DO NOT CHANGE FWSUBNET_NAME - This is currently a requirement for Azure Firewall. +FWSUBNET_NAME="AzureFirewallSubnet" +FWNAME="${PREFIX}-fw" +FWPUBLICIP_NAME="${PREFIX}-fwpublicip" +FWIPCONFIG_NAME="${PREFIX}-fwconfig" +FWROUTE_NAME="${PREFIX}-fwrn" +FWROUTE_NAME_INTERNET="${PREFIX}-fwinternet" +``` ++## Create a virtual network and subnets ++1. Create a resource group using the az group create command. ++ ```azurecli + az group create --name $RG --location $LOC + ``` ++1. Create a virtual network and two subnets. + + 1. Virtual network with subnet for HDInsight on AKS cluster pool ++ ```azurecli + az network vnet create \ + --resource-group $RG \ + --name $VNET_NAME \ + --location $LOC \ + --address-prefixes 10.0.0.0/8 \ + --subnet-name $HDIAKS_SUBNET_NAME \ + --subnet-prefix 10.1.0.0/16 + ``` + + 1. Subnet for Azure Firewall. + ```azurecli + az network vnet subnet create \ + --resource-group $RG \ + --vnet-name $VNET_NAME \ + --name $FWSUBNET_NAME \ + --address-prefix 10.2.0.0/16 + ``` + > [!Important] + > 1. If you add NSG in subnet `HDIAKS_SUBNET_NAME`, you need to add certain outbound and inbound rules manually. Follow [use NSG to restrict the traffic](./secure-traffic-by-nsg.md). + > 1. Don't associate subnet `HDIAKS_SUBNET_NAME` with a route table because HDInsight on AKS creates cluster pool with default outbound type and can't create the cluster pool in a subnet already associated with a route table. ++## Create HDInsight on AKS cluster pool using Azure portal ++ 1. Create a cluster pool. + + :::image type="content" source="./media/secure-traffic-by-firewall/basic-tab.png" alt-text="Diagram showing the cluster pool basic tab." border="true" lightbox="./media/secure-traffic-by-firewall/basic-tab.png"::: + + :::image type="content" source="./media/secure-traffic-by-firewall/security-tab.png" alt-text="Diagram showing the security tab." border="true" lightbox="./media/secure-traffic-by-firewall/security-tab.png"::: + + 1. When HDInsight on AKS cluster pool is created, you can find a route table in subnet `HDIAKS_SUBNET_NAME`. + + :::image type="content" source="./media/secure-traffic-by-firewall/route-table.png" alt-text="Diagram showing the route table." border="true" lightbox="./media/secure-traffic-by-firewall/route-table.png"::: ++### Get AKS cluster details created behind the cluster pool ++ Follow the steps to get the AKS cluster information, which is useful in the subsequent steps. + + ```azurecli + AKS_MANAGED_RG=$(az network vnet subnet show --name $HDIAKS_SUBNET_NAME --vnet-name $VNET_NAME --resource-group $RG --query routeTable.resourceGroup -o tsv) ++ AKS_ID=$(az group show --name $AKS_MANAGED_RG --query managedBy -o tsv) ++ HDIAKS_MANAGED_RG=$(az resource show --ids $AKS_ID --que |