Service | Microsoft Docs article | Related commit history on GitHub | Change details |
---|---|---|---|
active-directory-b2c | Troubleshoot With Application Insights | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory-b2c/troubleshoot-with-application-insights.md | For more information about querying, see [Overview of log queries in Azure Monit We recommend you to install the [Azure AD B2C extension](https://marketplace.visualstudio.com/items?itemName=AzureADB2CTools.aadb2c) for [VS Code](https://code.visualstudio.com/). With the Azure AD B2C extension, the logs are organized for you by the policy name, correlation ID (the application insights presents the first digit of the correlation ID), and the log timestamp. This feature helps you to find the relevant log based on the local timestamp and see the user journey as executed by Azure AD B2C. > [!NOTE]-> The community has developed the vs code extension for Azure AD B2C to help identity developers. The extension is not supported by Microsoft, and is made available strictly as-is. +> The community has developed the VS Code extension to help people implementing and maintaining Azure AD B2C solutions. The extension is not supported by Microsoft, and is made available strictly as-is. ### Set Application Insights API access |
active-directory | On Premises Scim Provisioning | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/app-provisioning/on-premises-scim-provisioning.md | -The Microsoft Entra provisioning service supports a [SCIM 2.0](https://techcommunity.microsoft.com/t5/security-compliance-and-identity/provisioning-with-scim-getting-started/ba-p/880010) client that can be used to automatically provision users into cloud or on-premises applications. This article outlines how you can use the Microsoft Entra provisioning service to provision users into an on-premises application that's SCIM enabled. If you want to provision users into non-SCIM on-premises applications that use SQL as a data store, see the [Microsoft Entra ECMA Connector Host Generic SQL Connector tutorial](tutorial-ecma-sql-connector.md). If you want to provision users into cloud apps such as DropBox and Atlassian, review the app-specific [tutorials](../../active-directory/saas-apps/tutorial-list.md). +The Microsoft Entra provisioning service supports a [SCIM 2.0](https://techcommunity.microsoft.com/t5/security-compliance-and-identity/provisioning-with-scim-getting-started/ba-p/880010) client that can be used to automatically provision users into cloud or on-premises applications. This article outlines how you can use the Microsoft Entra provisioning service to provision users into an on-premises application that's SCIM enabled. If you want to provision users into non-SCIM on-premises applications that use SQL as a data store, see the [Microsoft Entra ECMA Connector Host Generic SQL Connector tutorial](tutorial-ecma-sql-connector.md). If you want to provision users into cloud apps such as DropBox and Atlassian, review the app-specific [tutorials](../saas-apps/tutorial-list.md). ![Diagram that shows SCIM architecture.](./media/on-premises-scim-provisioning/scim-4.png) Once the agent is installed, no further configuration is necessary on-premises, 1. Sign in to the [Microsoft Entra admin center](https://entra.microsoft.com) as at least a [Application Administrator](../roles/permissions-reference.md#application-administrator). 1. Browse to **Identity** > **Applications** > **Enterprise applications**.-1. Add the **On-premises SCIM app** from the [gallery](../../active-directory/manage-apps/add-application-portal.md). +1. Add the **On-premises SCIM app** from the [gallery](../manage-apps/add-application-portal.md). 1. From the left hand menu navigate to the **Provisioning** option and select **Get started**. 1. Select **Automatic** from the dropdown list and expand the **On-Premises Connectivity** option. 1. Select the agent that you installed from the dropdown list and select **Assign Agent(s)**. Once the agent is installed, no further configuration is necessary on-premises, > If the test connection fails, you will see the request made. Please note that while the URL in the test connection error message is truncated, the actual request sent to the application contains the entire URL provided above. 1. Configure any [attribute mappings](customize-application-attributes.md) or [scoping](define-conditional-rules-for-provisioning-user-accounts.md) rules required for your application.-1. Add users to scope by [assigning users and groups](../../active-directory/manage-apps/add-application-portal-assign-users.md) to the application. +1. Add users to scope by [assigning users and groups](../manage-apps/add-application-portal-assign-users.md) to the application. 1. Test provisioning a few users [on demand](provision-on-demand.md). 1. Add more users into scope by assigning them to your application. 1. Go to the **Provisioning** pane, and select **Start provisioning**.-1. Monitor using the [provisioning logs](../../active-directory/reports-monitoring/concept-provisioning-logs.md). +1. Monitor using the [provisioning logs](../reports-monitoring/concept-provisioning-logs.md). The following video provides an overview of on-premises provisioning. |
active-directory | User Provisioning | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/app-provisioning/user-provisioning.md | In Microsoft Entra ID, the term *app provisioning* refers to automatically creat ![Diagram that shows provisioning scenarios.](../governance/media/what-is-provisioning/provisioning.png) -Microsoft Entra application provisioning refers to automatically creating user identities and roles in the applications that users need access to. In addition to creating user identities, automatic provisioning includes the maintenance and removal of user identities as status or roles change. Common scenarios include provisioning a Microsoft Entra user into SaaS applications like [Dropbox](../../active-directory/saas-apps/dropboxforbusiness-provisioning-tutorial.md), [Salesforce](../../active-directory/saas-apps/salesforce-provisioning-tutorial.md), [ServiceNow](../../active-directory/saas-apps/servicenow-provisioning-tutorial.md), and many more. +Microsoft Entra application provisioning refers to automatically creating user identities and roles in the applications that users need access to. In addition to creating user identities, automatic provisioning includes the maintenance and removal of user identities as status or roles change. Common scenarios include provisioning a Microsoft Entra user into SaaS applications like [Dropbox](../saas-apps/dropboxforbusiness-provisioning-tutorial.md), [Salesforce](../saas-apps/salesforce-provisioning-tutorial.md), [ServiceNow](../saas-apps/servicenow-provisioning-tutorial.md), and many more. Microsoft Entra ID also supports provisioning users into applications hosted on-premises or in a virtual machine, without having to open up any firewalls. The table below provides a mapping of protocols to connectors supported. To help automate provisioning and deprovisioning, apps expose proprietary user a To address these challenges, the System for Cross-domain Identity Management (SCIM) specification provides a common user schema to help users move into, out of, and around apps. SCIM is becoming the de facto standard for provisioning and, when used with federation standards like Security Assertions Markup Language (SAML) or OpenID Connect (OIDC), provides administrators an end-to-end standards-based solution for access management. -For detailed guidance on developing a SCIM endpoint to automate the provisioning and deprovisioning of users and groups to an application, see [Build a SCIM endpoint and configure user provisioning](use-scim-to-provision-users-and-groups.md). Many applications integrate directly with Microsoft Entra ID. Some examples include Slack, Azure Databricks, and Snowflake. For these apps, skip the developer documentation and use the tutorials provided in [Tutorials for integrating SaaS applications with Microsoft Entra ID](../../active-directory/saas-apps/tutorial-list.md). +For detailed guidance on developing a SCIM endpoint to automate the provisioning and deprovisioning of users and groups to an application, see [Build a SCIM endpoint and configure user provisioning](use-scim-to-provision-users-and-groups.md). Many applications integrate directly with Microsoft Entra ID. Some examples include Slack, Azure Databricks, and Snowflake. For these apps, skip the developer documentation and use the tutorials provided in [Tutorials for integrating SaaS applications with Microsoft Entra ID](../saas-apps/tutorial-list.md). ## Manual vs. automatic provisioning Applications in the Microsoft Entra gallery support one of two provisioning modes: * **Manual** provisioning means there's no automatic Microsoft Entra provisioning connector for the app yet. You must create them manually. Examples are adding users directly into the app's administrative portal or uploading a spreadsheet with user account detail. Consult the documentation provided by the app, or contact the app developer to determine what mechanisms are available.-* **Automatic** means that a Microsoft Entra provisioning connector is available this application. Follow the setup tutorial specific to setting up provisioning for the application. Find the app tutorials at [Tutorials for integrating SaaS applications with Microsoft Entra ID](../../active-directory/saas-apps/tutorial-list.md). +* **Automatic** means that a Microsoft Entra provisioning connector is available this application. Follow the setup tutorial specific to setting up provisioning for the application. Find the app tutorials at [Tutorials for integrating SaaS applications with Microsoft Entra ID](../saas-apps/tutorial-list.md). The provisioning mode supported by an application is also visible on the **Provisioning** tab after you've added the application to your enterprise apps. |
active-directory | Application Proxy Integrate With Microsoft Cloud Application Security | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/app-proxy/application-proxy-integrate-with-microsoft-cloud-application-security.md | After adding your application to Microsoft Entra ID, use the steps in [Test the ## Deploy Conditional Access App Control -To configure your application with the Conditional Access Application Control, follow the instructions in [Deploy Conditional Access Application Control for Microsoft Entra apps](/cloud-app-security/proxy-deployment-aad). +To configure your application with the Conditional Access Application Control, follow the instructions in [Deploy Conditional Access Application Control for Microsoft Entra apps](/defender-cloud-apps/proxy-deployment-aad). ## Test Conditional Access App Control |
active-directory | 2 Secure Access Current State | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/2-secure-access-current-state.md | If you use entitlement management, you can confine access packages to a subset o With an inventory of external users and organizations, determine the access to grant to the users. You can use the Microsoft Graph API to determine Microsoft Entra group membership or application assignment. -* [Working with groups in Microsoft Graph](/graph/api/resources/groups-overview?context=graph%2Fcontext&view=graph-rest-1.0&preserve-view=true) +* [Working with groups in Microsoft Graph](/graph/api/resources/groups-overview?context=graph/context&view=graph-rest-1.0&preserve-view=true) * [Applications API overview](/graph/applications-concept-overview?view=graph-rest-1.0&preserve-view=true) ### Enumerate application permissions |
active-directory | 8 Secure Access Sensitivity Labels | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/8-secure-access-sensitivity-labels.md | -See, [Learn about sensitivity labels](/microsoft-365/compliance/sensitivity-labels?view=o365-worldwide&preserve-view=true) +See, [Learn about sensitivity labels](/purview/sensitivity-labels?preserve-view=true&view=o365-worldwide) ## Before you begin Enforce protection settings such as encryption, watermarks, and access restricti Learn more: -* [Restrict access to content by using sensitivity labels to apply encryption](/microsoft-365/compliance/encryption-sensitivity-labels?view=o365-worldwide&preserve-view=true) -* [Use sensitivity labels to protect content in Microsoft Teams, Microsoft 365 Groups, and SharePoint sites](/microsoft-365/compliance/sensitivity-labels-teams-groups-sites) +* [Restrict access to content by using sensitivity labels to apply encryption](/purview/encryption-sensitivity-labels?preserve-view=true&view=o365-worldwide) +* [Use sensitivity labels to protect content in Microsoft Teams, Microsoft 365 Groups, and SharePoint sites](/purview/sensitivity-labels-teams-groups-sites) Sensitivity labels on containers can restrict access to the container, but content in the container doesn't inherit the label. For example, a user takes content from a protected site, downloads it, and then shares it without restrictions, unless the content had a sensitivity label. As you plan the governance of external access to your content, consider content, To define High, Medium, or Low Business Impact (HBI, MBI, LBI) for data, sites, and groups, consider the effect on your organization if the wrong content types are shared. * Credit card, passport, national/regional ID numbers- * [Apply a sensitivity label to content automatically](/microsoft-365/compliance/apply-sensitivity-label-automatically?view=o365-worldwide&preserve-view=true) + * [Apply a sensitivity label to content automatically](/purview/apply-sensitivity-label-automatically?preserve-view=true&view=o365-worldwide) * Content created by corporate officers: compliance, finance, executive, etc. * Strategic or financial data in libraries or sites. A sensitivity label in a document or email is customizable, clear text, and pers Determine the access criteria if Microsoft 365 Groups, Teams, or SharePoint sites are restricted with sensitivity labels. You can label content in containers or use automatic labeling for files in SharePoint, OneDrive, etc. -Learn more: [Get started with sensitivity labels](/microsoft-365/compliance/get-started-with-sensitivity-labels?view=o365-worldwide&preserve-view=true) +Learn more: [Get started with sensitivity labels](/purview/get-started-with-sensitivity-labels?preserve-view=true&view=o365-worldwide) #### Sensitivity labels on containers |
active-directory | Multi Tenant Common Considerations | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/multi-tenant-common-considerations.md | Microsoft Teams has features to limit access and based on user type. Changes to The tenant switching mechanism for Microsoft Teams might require users to manually switch the context of their Teams client when working in Teams outside their home tenant. -You can enable Teams users from another entire external domain to find, call, chat, and set up meetings with your users with Teams Federation. [Manage external meetings and chat with people and organizations using Microsoft identities](/microsoftteams/manage-external-access) describes how you can allow users in your organization to chat and meet with people outside the organization who are using Microsoft as an identity provider. +You can enable Teams users from another entire external domain to find, call, chat, and set up meetings with your users with Teams Federation. [Manage external meetings and chat with people and organizations using Microsoft identities](/microsoftteams/trusted-organizations-external-meetings-chat) describes how you can allow users in your organization to chat and meet with people outside the organization who are using Microsoft as an identity provider. ### Licensing considerations for guest users in Teams |
active-directory | Ops Guide Auth | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/ops-guide-auth.md | If you're managing devices with MDM or Microsoft Intune, but not using device co #### Device trust access policies recommended reading - [How To: Plan your Microsoft Entra hybrid join implementation](../devices/hybrid-join-plan.md)-- [Identity and device access configurations](/microsoft-365/enterprise/microsoft-365-policies-configurations)+- [Identity and device access configurations](/microsoft-365/security/office-365-security/microsoft-365-policies-configurations) ### Windows Hello for Business |
active-directory | Ops Guide Iam | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/ops-guide-iam.md | As you review your list, you may find you need to either assign an owner for tas ### Identify and resolve synchronization issues -Microsoft recommends you have a good baseline and understanding of the issues in your on-premises environment that can result in synchronization issues to the cloud. Since automated tools such as [IdFix](/office365/enterprise/prepare-directory-attributes-for-synch-with-idfix) and [Microsoft Entra Connect Health](../hybrid/connect/whatis-azure-ad-connect.md#why-use-azure-ad-connect-health) can generate a high volume of false positives, we recommend you identify synchronization errors that have been left unaddressed for more than 100 days by cleaning up those objects in error. Long term unresolved synchronization errors can generate support incidents. [Troubleshooting errors during synchronization](../hybrid/connect/tshoot-connect-sync-errors.md) provides an overview of different types of sync errors, some of the possible scenarios that cause those errors and potential ways to fix the errors. +Microsoft recommends you have a good baseline and understanding of the issues in your on-premises environment that can result in synchronization issues to the cloud. Since automated tools such as [IdFix](/microsoft-365/enterprise/set-up-directory-synchronization) and [Microsoft Entra Connect Health](../hybrid/connect/whatis-azure-ad-connect.md#why-use-azure-ad-connect-health) can generate a high volume of false positives, we recommend you identify synchronization errors that have been left unaddressed for more than 100 days by cleaning up those objects in error. Long term unresolved synchronization errors can generate support incidents. [Troubleshooting errors during synchronization](../hybrid/connect/tshoot-connect-sync-errors.md) provides an overview of different types of sync errors, some of the possible scenarios that cause those errors and potential ways to fix the errors. <a name='azure-ad-connect-sync-configuration'></a> |
active-directory | Protect M365 From On Premises Attacks | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/protect-m365-from-on-premises-attacks.md | In Microsoft Entra ID, users who have privileged roles, such as administrators, - Use cloud-only accounts for Microsoft Entra ID and Microsoft 365 privileged roles. -- Deploy privileged access devices for privileged access to manage Microsoft 365 and Microsoft Entra ID. See [Device roles and profiles](/security/compass/privileged-access-devices#device-roles-and-profiles).+- Deploy privileged access devices for privileged access to manage Microsoft 365 and Microsoft Entra ID. See [Device roles and profiles](/security/privileged-access-workstations/privileged-access-devices#device-roles-and-profiles). Deploy Microsoft Entra Privileged Identity Management (PIM) for just-in-time access to all human accounts that have privileged roles. Require strong authentication to activate roles. See [What is Microsoft Entra Privileged Identity Management](../privileged-identity-management/pim-configure.md). |
active-directory | Recoverability Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/recoverability-overview.md | Microsoft Graph APIs are highly customizable based on your organizational needs. | Resource types| Reference links | | - | - | -| Users, groups, and other directory objects| [directoryObject API](/graph/api/resources/directoryObject) | +| Users, groups, and other directory objects| [directoryObject API](/graph/api/resources/directoryobject) | | Directory roles| [directoryRole API](/graph/api/resources/directoryrole) | | Conditional Access policies| [Conditional Access policy API](/graph/api/resources/conditionalaccesspolicy) | | Devices| [devices API](/graph/api/resources/device) | |
active-directory | Resilience B2c Developer Best Practices | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/resilience-b2c-developer-best-practices.md | The Azure AD B2C directory service supports billions of authentications a day. I ### How to optimize directory reads and writes -- **Avoid write functions to the directory on sign-in**: Never execute a write on sign-in without a precondition (if clause) in your custom policies. One use case that requires a write on a sign-in is [just-in-time migration of user passwords](https://github.com/azure-ad-b2c/user-migration/tree/master/seamless-account-migration). Avoid any scenario that requires a write on every sign-in. [Preconditions](../../active-directory-b2c/userjourneys.md) in a user journey will look like this:+- **Avoid write functions to the directory on sign-in**: Never execute a write on sign-in without a precondition (if clause) in your custom policies. One use case that requires a write on a sign-in is [just-in-time migration of user passwords](https://github.com/azure-ad-b2c/user-migration/tree/master/seamless-account-migration). Avoid any scenario that requires a write on every sign-in. [Preconditions](/azure/active-directory-b2c/userjourneys) in a user journey will look like this: ```xml <Precondition Type="ClaimEquals" ExecuteActionsIf="true"> The Azure AD B2C directory service supports billions of authentications a day. I - Understand and plan your migration timeline. When planning to migrate users to Azure AD B2C using Microsoft Graph, consider the application and tenant limits to calculate the time needed to complete the migration of users. If you split your user creation job or script using two applications, you can use the per application limit. It would still need to remain below the per tenant threshold. - Understand the effects of your migration job on other applications. Consider the live traffic served by other relying applications to make sure you don't cause throttling at the tenant level and resource starvation for your live application. For more information, see the [Microsoft Graph throttling guidance](/graph/throttling). - Use a [load test sample](https://github.com/azure-ad-b2c/load-tests) to simulate sign-up and sign-in. - - Learn more about [Azure Active Directory B2C service limits and restrictions](../../active-directory-b2c/service-limits.md?pivots=b2c-custom-policy). + - Learn more about [Azure Active Directory B2C service limits and restrictions](/azure/active-directory-b2c/service-limits?pivots=b2c-custom-policy). ## Extend token lifetimes -In an unlikely event, when the Azure AD B2C authentication service is unable to complete new sign-ups and sign-ins, you can still provide mitigation for users who are signed in. With [configuration](../../active-directory-b2c/configure-tokens.md), you can allow users that are already signed in to continue using the application without any perceived disruption until the user signs out from the application or the [session](../../active-directory-b2c/session-behavior.md) times out due to inactivity. +In an unlikely event, when the Azure AD B2C authentication service is unable to complete new sign-ups and sign-ins, you can still provide mitigation for users who are signed in. With [configuration](/azure/active-directory-b2c/configure-tokens), you can allow users that are already signed in to continue using the application without any perceived disruption until the user signs out from the application or the [session](/azure/active-directory-b2c/session-behavior) times out due to inactivity. Your business requirements and desired end-user experience will dictate your frequency of token refresh for both web and Single-page applications (SPAs). |
active-directory | Road To The Cloud Implement | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/road-to-the-cloud-implement.md | Client workstations are traditionally joined to Active Directory and managed via * Manage workstations from the cloud by using unified endpoint management (UEM) solutions such as [Intune](/mem/intune/fundamentals/what-is-intune). -[Windows Autopilot](/mem/autopilot/windows-autopilot) can help you establish a streamlined onboarding and device provisioning, which can enforce these directives. +[Windows Autopilot](/autopilot/windows-autopilot) can help you establish a streamlined onboarding and device provisioning, which can enforce these directives. [Windows Local Administrator Password Solution](../devices/howto-manage-local-admin-passwords.md) (LAPS) enables a cloud-first solution to manage the passwords of local administrator accounts. |
active-directory | Road To The Cloud Migrate | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/road-to-the-cloud-migrate.md | You can integrate non-Windows workstations with Microsoft Entra ID to enhance th * Plan to deploy [Platform SSO for macOS 13](https://techcommunity.microsoft.com/t5/microsoft-intune-blog/microsoft-simplifies-endpoint-manager-enrollment-for-apple/ba-p/3570319). -* For Linux, you can [sign in to a Linux virtual machine (VM) by using Microsoft Entra credentials](../../active-directory/devices/howto-vm-sign-in-azure-ad-linux.md). +* For Linux, you can [sign in to a Linux virtual machine (VM) by using Microsoft Entra credentials](../devices/howto-vm-sign-in-azure-ad-linux.md). ### Replace other Windows versions for workstations |
active-directory | Secure Best Practices | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/secure-best-practices.md | When designing isolated environments, it's important to consider the following p * **Use only modern authentication** - Applications deployed in isolated environments must use claims-based modern authentication (for example, SAML, * Auth, OAuth2, and OpenID Connect) to use capabilities such as federation, Microsoft Entra B2B collaboration, delegation, and the consent framework. This way, legacy applications that have dependency on legacy authentication methods such as NT LAN Manager (NTLM) won't carry forward in isolated environments. -* **Enforce strong authentication** - Strong authentication must always be used when accessing the isolated environment services and infrastructure. Whenever possible, [passwordless authentication](../authentication/concept-authentication-passwordless.md) such as [Windows for Business Hello](/windows/security/identity-protection/hello-for-business/hello-overview) or a [FIDO2 security keys](../authentication/howto-authentication-passwordless-security-key.md)) should be used. +* **Enforce strong authentication** - Strong authentication must always be used when accessing the isolated environment services and infrastructure. Whenever possible, [passwordless authentication](../authentication/concept-authentication-passwordless.md) such as [Windows for Business Hello](/windows/security/identity-protection/hello-for-business/hello-overview) or a [FIDO2 security keys](../authentication/howto-authentication-passwordless-security-key.md) should be used. -* **Deploy secure workstations** - [Secure workstations](/security/compass/privileged-access-devices) provide the mechanism to ensure that the platform and the identity that platform represents is properly attested and secured against exploitation. Two other approaches to consider are: +* **Deploy secure workstations** - [Secure workstations](/security/privileged-access-workstations/privileged-access-devices) provide the mechanism to ensure that the platform and the identity that platform represents is properly attested and secured against exploitation. Two other approaches to consider are: * Use Windows 365 Cloud PCs (Cloud PC) with the Microsoft Graph API. In addition to the guidance in the [Microsoft Entra general operations guide](./ ### Privileged Accounts -Provision accounts in the isolated environment for administrative personnel and IT teams who operate the environment. This enables you to add stronger security policies such as device-based access control for [secure workstations](/security/compass/privileged-access-deployment). As discussed in previous sections, nonproduction environments can potentially utilize Microsoft Entra B2B collaboration to onboard privileged accounts to the non-production tenants using the same posture and security controls designed for privileged access in their production environment. +Provision accounts in the isolated environment for administrative personnel and IT teams who operate the environment. This enables you to add stronger security policies such as device-based access control for [secure workstations](/security/privileged-access-workstations/privileged-access-deployment). As discussed in previous sections, nonproduction environments can potentially utilize Microsoft Entra B2B collaboration to onboard privileged accounts to the non-production tenants using the same posture and security controls designed for privileged access in their production environment. Cloud-only accounts are the simplest way to provision human identities in a Microsoft Entra tenant and it's a good fit for green field environments. However, if there's an existing on-premises infrastructure that corresponds to the isolated environment (for example, pre-production or management Active Directory forest), you could consider synchronizing identities from there. This holds especially true if the on-premises infrastructure described herein is used for IaaS solutions that require server access to manage the solution data plane. For more information on this scenario, see [Protecting Microsoft 365 from on-premises attacks](./protect-m365-from-on-premises-attacks.md). Synchronizing from isolated on-premises environments might also be needed if there are specific regulatory compliance requirements such as smart-card only authentication. All human identities (local accounts and external identities provisioned through #### Passwordless credentials -A [passwordless solution](../authentication/concept-authentication-passwordless.md) is the best solution for ensuring the most convenient and secure method of authentication. Passwordless credentials such as [FIDO security keys](../authentication/howto-authentication-passwordless-security-key.md) and [Windows Hello for Business](/windows/security/identity-protection/hello-for-business/hello-overview) are recommended for human identities with privileged roles. +A [passwordless solution](../authentication/concept-authentication-passwordless.md) is the best solution for ensuring the most convenient and secure method of authentication. Passwordless credentials such as [FIDO security keys](../authentication/howto-authentication-passwordless-security-key.md) and [Windows Hello for Business](/windows/security/identity-protection/hello-for-business/) are recommended for human identities with privileged roles. #### Password protection |
active-directory | Secure Fundamentals | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/secure-fundamentals.md | Non-production environments are commonly referred to as sandbox environments. **Human identities** are user objects that generally represent people in an organization. These identities are either created and managed directly in Microsoft Entra ID or are synchronized from an on-premises Active Directory to Microsoft Entra ID for a given organization. These types of identities are referred to as **local identities**. There can also be user objects invited from a partner organization or a social identity provider using [Microsoft Entra B2B collaboration](../external-identities/what-is-b2b.md). In this content, we refer to these types of identity as **external identities**. -**Non-human identities** include any identity not associated with a human. This type of identity is an object such as an application that requires an identity to run. In this content, we refer to this type of identity as a **workload identity**. Various terms are used to describe this type of identity, including [application objects and service principals](../../marketplace/manage-aad-apps.md). +**Non-human identities** include any identity not associated with a human. This type of identity is an object such as an application that requires an identity to run. In this content, we refer to this type of identity as a **workload identity**. Various terms are used to describe this type of identity, including [application objects and service principals](/partner-center/marketplace/manage-aad-apps). * **Application object**. A Microsoft Entra application is defined by its application object. The object resides in the Microsoft Entra tenant where the application registered. The tenant is known as the application's "home" tenant. Non-production environments are commonly referred to as sandbox environments. * **Multi-tenant** applications allow identities from any Microsoft Entra tenant to authenticate. -* **Service principal object**. Although there are [exceptions](../../marketplace/manage-aad-apps.md), application objects can be considered the *definition* of an application. Service principal objects can be considered an instance of an application. Service principals generally reference an application object, and one application object is referenced by multiple service principals across directories. +* **Service principal object**. Although there are [exceptions](/partner-center/manage-aad-apps), application objects can be considered the *definition* of an application. Service principal objects can be considered an instance of an application. Service principals generally reference an application object, and one application object is referenced by multiple service principals across directories. **Service principal objects** are also directory identities that can perform tasks independently from human intervention. The service principal defines the access policy and permissions for a user or application in the Microsoft Entra tenant. This mechanism enables core features such as authentication of the user or application during sign-in and authorization during resource access. Some legacy scenarios required a human identity to be used in *non-human* scenar * **Microsoft Entra Domain Joined**. Devices that are owned by the organization and joined to the organization's Microsoft Entra tenant. Typically a device purchased and managed by an organization that is joined to Microsoft Entra ID and managed by a service such as [Microsoft Intune](https://www.microsoft.com/microsoft-365/enterprise-mobility-security/microsoft-intune). - * **Microsoft Entra registered**. Devices not owned by the organization, for example, a personal device, used to access company resources. Organizations may require the device be enrolled via [Mobile Device Management (MDM)](https://www.microsoft.com/itshowcase/mobile-device-management-at-microsoft), or enforced through [Mobile Application Management (MAM)](/office365/enterprise/office-365-client-support-mobile-application-management) without enrollment to access resources. This capability can be provided by a service such as Microsoft Intune. + * **Microsoft Entra registered**. Devices not owned by the organization, for example, a personal device, used to access company resources. Organizations may require the device be enrolled via [Mobile Device Management (MDM)](https://www.microsoft.com/itshowcase/mobile-device-management-at-microsoft), or enforced through [Mobile Application Management (MAM)](/mem/intune/apps/apps-supported-intune-apps) without enrollment to access resources. This capability can be provided by a service such as Microsoft Intune. * **Group objects** contain objects for the purposes of assigning resource access, applying controls, or configuration. Group objects contain attributes that have the required information about the group including the name, description, group members, group owners, and the group type. Groups in Microsoft Entra ID take multiple forms based on an organization's requirements and can be mastered in Microsoft Entra ID or synchronized from on-premises Active Directory Domain Services (AD DS). |
active-directory | Secure Single Tenant | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/secure-single-tenant.md | In this diagram, there are nonproduction Azure resources and nonproduction insta >[!NOTE] >You cannot have more than one Microsoft 365 environment in a single Microsoft Entra tenant. However, you can have multiple Dynamics 365 environments in a single Microsoft Entra tenant. -Another scenario for isolation within a single tenant could be separation between locations, subsidiary or implementation of tiered administration (according to the "[Enterprise Access Model](/security/compass/privileged-access-access-model)"). +Another scenario for isolation within a single tenant could be separation between locations, subsidiary or implementation of tiered administration (according to the "[Enterprise Access Model](/security/privileged-access-workstations/privileged-access-access-model)"). Azure RBAC role assignments allow scoped administration of Azure resources. Similarly, Microsoft Entra ID allows granular management of Microsoft Entra ID trusting applications through multiple capabilities such as Conditional Access, user and group filtering, administrative unit assignments and application assignments. |
active-directory | Security Operations Applications | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/security-operations-applications.md | Many applications use credentials to authenticate in Microsoft Entra ID. Any oth * Azure Monitor ΓÇô [Microsoft Entra workbook to help you assess Solorigate risk - Microsoft Tech Community](https://techcommunity.microsoft.com/t5/azure-active-directory-identity/azure-ad-workbook-to-help-you-assess-solorigate-risk/ba-p/2010718) -* Defender for Cloud Apps ΓÇô [Defender for Cloud Apps anomaly detection alerts investigation guide](/cloud-app-security/investigate-anomaly-alerts) +* Defender for Cloud Apps ΓÇô [Defender for Cloud Apps anomaly detection alerts investigation guide](/defender-cloud-apps/investigate-anomaly-alerts) * PowerShell - [Sample PowerShell script to find credential lifetime](https://github.com/madansr7/appCredAge). |
active-directory | Security Operations Devices | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/security-operations-devices.md | The log files you use for investigation and monitoring are: * [Sign-in logs](../reports-monitoring/concept-sign-ins.md) -* [Microsoft 365 Audit logs](/microsoft-365/compliance/auditing-solutions-overview) +* [Microsoft 365 Audit logs](/purview/audit-solutions-overview) * [Azure Key Vault logs](/azure/key-vault/general/logging?tabs=Vault) From the Azure portal, you can view the Microsoft Entra audit logs and download * **[Azure Event Hubs](/azure/event-hubs/event-hubs-about) -integrated with a SIEM**- [Microsoft Entra logs can be integrated to other SIEMs](../reports-monitoring/howto-stream-logs-to-event-hub.md) such as Splunk, ArcSight, QRadar, and Sumo Logic via the Azure Event Hubs integration. -* **[Microsoft Defender for Cloud Apps](/cloud-app-security/what-is-cloud-app-security)** ΓÇô enables you to discover and manage apps, govern across apps and resources, and check your cloud appsΓÇÖ compliance. +* **[Microsoft Defender for Cloud Apps](/defender-cloud-apps/what-is-defender-for-cloud-apps)** ΓÇô enables you to discover and manage apps, govern across apps and resources, and check your cloud appsΓÇÖ compliance. * **[Securing workload identities with Identity Protection Preview](..//identity-protection/concept-workload-identity-risk.md)** - Used to detect risk on workload identities across sign-in behavior and offline indicators of compromise. |
active-directory | Security Operations Infrastructure | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/security-operations-infrastructure.md | The remainder of this article describes what to monitor and alert on. It is orga In hybrid environments that contain both on-premises and cloud-based resources and accounts, the Active Directory infrastructure is a key part of the authentication stack. The stack is also a target for attacks so must be configured to maintain a secure environment and must be monitored properly. Examples of current types of attacks used against your authentication infrastructure use Password Spray and Solorigate techniques. The following are links to articles we recommend: -* [Securing privileged access overview](/security/compass/overview) ΓÇô This article provides an overview of current techniques using Zero Trust techniques to create and maintain secure privileged access. +* [Securing privileged access overview](/security/privileged-access-workstations/overview) ΓÇô This article provides an overview of current techniques using Zero Trust techniques to create and maintain secure privileged access. * [Microsoft Defender for Identity monitored domain activities](/defender-for-identity/monitored-activities) - This article provides a comprehensive list of activities to monitor and set alerts for. To configure monitoring for Application Proxy, see [Troubleshoot Application Pro | - | - | - | - | - | | Kerberos errors| Medium | Various tools| Medium | Kerberos authentication error guidance under Kerberos errors on [Troubleshoot Application Proxy problems and error messages](../app-proxy/application-proxy-troubleshoot.md). | | DC security issues| High| DC Security Audit logs| Event ID 4742(S): A computer account was changed<br>-and-<br>Flag ΓÇô Trusted for Delegation<br>-or-<br>Flag ΓÇô Trusted to Authenticate for Delegation| Investigate any flag change. |-| Pass-the-ticket like attacks| High| | | Follow guidance in:<br>[Security principal reconnaissance (LDAP) (external ID 2038)](/defender-for-identity/reconnaissance-discovery-alerts)<br>[Tutorial: Compromised credential alerts](/defender-for-identity/credential-access-alerts)<br>[Understand and use Lateral Movement Paths with Microsoft Defender for Identity](/defender-for-identity/use-case-lateral-movement-path)<br>[Understanding entity profiles](/defender-for-identity/investigate-assets) | +| Pass-the-ticket like attacks| High| | | Follow guidance in:<br>[Security principal reconnaissance (LDAP) (external ID 2038)](/defender-for-identity/reconnaissance-discovery-alerts)<br>[Tutorial: Compromised credential alerts](/defender-for-identity/credential-access-alerts)<br>[Understand and use Lateral Movement Paths with Microsoft Defender for Identity](/defender-for-identity/understand-lateral-movement-paths)<br>[Understanding entity profiles](/defender-for-identity/investigate-assets) | ### Legacy authentication settings For multifactor authentication (MFA) to be effective, you also need to block legacy authentication. You then need to monitor your environment and alert on any use of legacy authentication. Legacy authentication protocols like POP, SMTP, IMAP, and MAPI canΓÇÖt enforce MFA. This makes these protocols the preferred entry points for attackers. For more information on tools that you can use to block legacy authentication, see [New tools to block legacy authentication in your organization](https://techcommunity.microsoft.com/t5/azure-active-directory-identity/new-tools-to-block-legacy-authentication-in-your-organization/ba-p/1225302). -Legacy authentication is captured in the Microsoft Entra sign-in log as part of the detail of the event. You can use the Azure Monitor workbook to help with identifying legacy authentication usage. For more information, see [Sign-ins using legacy authentication](../reports-monitoring/howto-use-azure-monitor-workbooks.md), which is part of [How to use Azure Monitor Workbooks for Microsoft Entra reports](../reports-monitoring/howto-use-azure-monitor-workbooks.md). You can also use the Insecure protocols workbook for Microsoft Sentinel. For more information, see [Microsoft Sentinel Insecure Protocols Workbook Implementation Guide](https://techcommunity.microsoft.com/t5/azure-sentinel/azure-sentinel-insecure-protocols-workbook-implementation-guide/ba-p/1197564). Specific activities to monitor include: +Legacy authentication is captured in the Microsoft Entra sign-in log as part of the detail of the event. You can use the Azure Monitor workbook to help with identifying legacy authentication usage. For more information, see [Sign-ins using legacy authentication](../reports-monitoring/howto-use-workbooks.md), which is part of [How to use Azure Monitor Workbooks for Microsoft Entra reports](../reports-monitoring/howto-use-workbooks.md). You can also use the Insecure protocols workbook for Microsoft Sentinel. For more information, see [Microsoft Sentinel Insecure Protocols Workbook Implementation Guide](https://techcommunity.microsoft.com/t5/azure-sentinel/azure-sentinel-insecure-protocols-workbook-implementation-guide/ba-p/1197564). Specific activities to monitor include: | What to monitor| Risk level| Where| Filter/sub-filter| Notes | | - | - | - | - | - | |
active-directory | Security Operations Introduction | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/security-operations-introduction.md | For more information, see [What is Identity Protection](../identity-protection/o For the best results, we recommend that you monitor your domain controllers using Microsoft Defender for Identity. This approach enables the best detection and automation capabilities. Follow the guidance from these resources: * [Microsoft Defender for Identity architecture](/defender-for-identity/architecture)-* [Connect Microsoft Defender for Identity to Active Directory quickstart](/defender-for-identity/install-step2) +* [Connect Microsoft Defender for Identity to Active Directory quickstart](/defender-for-identity/directory-service-accounts) If you don't plan to use Microsoft Defender for Identity, monitor your domain controllers by one of these approaches: |
active-directory | Sync Directory | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/sync-directory.md | Explore the following resources to learn more about directory synchronization wi ## Next steps -* [What is hybrid identity with Microsoft Entra ID?](../../active-directory/hybrid/whatis-hybrid-identity.md) Microsoft's identity solutions span on-premises and cloud-based capabilities. Hybrid identity solutions create a common user identity for authentication and authorization to all resources, regardless of location. +* [What is hybrid identity with Microsoft Entra ID?](../hybrid/whatis-hybrid-identity.md) Microsoft's identity solutions span on-premises and cloud-based capabilities. Hybrid identity solutions create a common user identity for authentication and authorization to all resources, regardless of location. * [Install the Microsoft Entra Connect provisioning agent](../hybrid/cloud-sync/how-to-install.md) walks you through the installation process for the Microsoft Entra Connect provisioning agent and how to initially configure it in the Azure portal. * [Microsoft Entra Connect cloud sync new agent configuration](../hybrid/cloud-sync/how-to-configure.md) guides you through configuring Microsoft Entra Connect cloud sync. * [Microsoft Entra authentication and synchronization protocol overview](auth-sync-overview.md) describes integration with authentication and synchronization protocols. Authentication integrations enable you to use Microsoft Entra ID and its security and management features with little or no changes to your applications that use legacy authentication methods. Synchronization integrations enable you to sync user and group data to Microsoft Entra ID and then user Microsoft Entra management capabilities. Some sync patterns enable automated provisioning. |
active-directory | Sync Ldap | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/architecture/sync-ldap.md | Explore the following resources to learn more about LDAP synchronization with Mi ## Next steps -* [What is hybrid identity with Microsoft Entra ID?](../../active-directory/hybrid/whatis-hybrid-identity.md) Microsoft's identity solutions span on-premises and cloud-based capabilities. Hybrid identity solutions create a common user identity for authentication and authorization to all resources, regardless of location. +* [What is hybrid identity with Microsoft Entra ID?](../hybrid/whatis-hybrid-identity.md) Microsoft's identity solutions span on-premises and cloud-based capabilities. Hybrid identity solutions create a common user identity for authentication and authorization to all resources, regardless of location. * [Microsoft Entra authentication and synchronization protocol overview](auth-sync-overview.md) describes integration with authentication and synchronization protocols. Authentication integrations enable you to use Microsoft Entra ID and its security and management features with little or no changes to your applications that use legacy authentication methods. Synchronization integrations enable you to sync user and group data to Microsoft Entra ID and then user Microsoft Entra management capabilities. Some sync patterns enable automated provisioning. |
active-directory | Concept Registration Mfa Sspr Combined | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/authentication/concept-registration-mfa-sspr-combined.md | Users can access manage mode by going to [Security info](https://aka.ms/mysecuri ## Key usage scenarios ### Update a password in MySignIns (preview)-A user navigates to [Security info](https://aka.ms/mysecurityinfo). After signing in, the user can update their password. For more information about different authentication methods that you can require by using Conditional Access policies, see [How to secure the registration of security info](/azure/active-directory/conditional-access/howto-conditional-access-policy-registration). When finished, the user has the new password updated on the Security info page. +A user navigates to [Security info](https://aka.ms/mysecurityinfo). After signing in, the user can update their password. For more information about different authentication methods that you can require by using Conditional Access policies, see [How to secure the registration of security info](../conditional-access/howto-conditional-access-policy-registration.md). When finished, the user has the new password updated on the Security info page. ### Protect Security info registration with Conditional Access |
active-directory | Concept Sspr Policy | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/authentication/concept-sspr-policy.md | This guidance applies to other providers, such as Intune and Microsoft 365, whic ### Set or check the password policies by using PowerShell -To get started, [download and install the Azure AD PowerShell module](/powershell/module/Azuread/) and [connect it to your Microsoft Entra tenant](/powershell/module/azuread/connect-azuread#examples). +To get started, [download and install the Azure AD PowerShell module](/powershell/module/azuread/) and [connect it to your Microsoft Entra tenant](/powershell/module/azuread/connect-azuread#examples). After the module is installed, use the following steps to complete each task as needed. |
active-directory | How To Mfa Server Migration Utility | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/authentication/how-to-mfa-server-migration-utility.md | A few important points: During the previous phases, you can remove users from the Staged Rollout folders to take them out of scope of Microsoft Entra multifactor authentication and route them back to your on-premises Azure MFA server for all MFA requests originating from Microsoft Entra ID. -**Phase 3** requires moving all clients that authenticate to the on-premises MFA Server (VPNs, password managers, and so on) to Microsoft Entra federation via SAML/OAUTH. If modern authentication standards aren't supported, you're required to stand up NPS server(s) with the Microsoft Entra multifactor authentication extension installed. Once dependencies are migrated, users should no longer use the User portal on the MFA Server, but rather should manage their authentication methods in Microsoft Entra ID ([aka.ms/mfasetup](https://aka.ms/mfasetup)). Once users begin managing their authentication data in Microsoft Entra ID, those methods won't be synced back to MFA Server. If you roll back to the on-premises MFA Server after users have made changes to their Authentication Methods in Microsoft Entra ID, those changes will be lost. After user migrations are complete, change the [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-1.0#federatedidpmfabehavior-values&preserve-view=true) domain federation setting. The change tells Microsoft Entra ID to no longer perform MFA on-premises and to perform _all_ MFA requests with Microsoft Entra multifactor authentication, regardless of group membership. +**Phase 3** requires moving all clients that authenticate to the on-premises MFA Server (VPNs, password managers, and so on) to Microsoft Entra federation via SAML/OAUTH. If modern authentication standards aren't supported, you're required to stand up NPS server(s) with the Microsoft Entra multifactor authentication extension installed. Once dependencies are migrated, users should no longer use the User portal on the MFA Server, but rather should manage their authentication methods in Microsoft Entra ID ([aka.ms/mfasetup](https://aka.ms/mfasetup)). Once users begin managing their authentication data in Microsoft Entra ID, those methods won't be synced back to MFA Server. If you roll back to the on-premises MFA Server after users have made changes to their Authentication Methods in Microsoft Entra ID, those changes will be lost. After user migrations are complete, change the [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-1.0&preserve-view=true#federatedidpmfabehavior-values) domain federation setting. The change tells Microsoft Entra ID to no longer perform MFA on-premises and to perform _all_ MFA requests with Microsoft Entra multifactor authentication, regardless of group membership. The following sections explain the migration steps in more detail. Using the data points you collected in [Authentication services](#authentication ### Update domain federation settings Once you've completed user migrations, and moved all of your [Authentication services](#authentication-services) off of MFA Server, it's time to update your domain federation settings. After the update, Microsoft Entra no longer sends MFA request to your on-premises federation server. -To configure Microsoft Entra ID to ignore MFA requests to your on-premises federation server, install the [Microsoft Graph PowerShell SDK](/powershell/microsoftgraph/installation?view=graph-powershell-1.0&preserve-view=true&viewFallbackFrom=graph-powershell-) and set [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-1.0#federatedidpmfabehavior-values&preserve-view=true) to `rejectMfaByFederatedIdp`, as shown in the following example. +To configure Microsoft Entra ID to ignore MFA requests to your on-premises federation server, install the [Microsoft Graph PowerShell SDK](/powershell/microsoftgraph/installation?view=graph-powershell-1.0&preserve-view=true&viewFallbackFrom=graph-powershell-) and set [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-1.0&preserve-view=true#federatedidpmfabehavior-values) to `rejectMfaByFederatedIdp`, as shown in the following example. #### Request <!-- { If the upgrade had issues, follow these steps to roll back: >Any changes since the backup was made will be lost, but should be minimal if backup was made right before upgrade and upgrade was unsuccessful. 1. Run the installer for your previous version (for example, 8.0.x.x).-1. Configure Microsoft Entra ID to accept MFA requests to your on-premises federation server. Use Graph PowerShell to set [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-1.0#federatedidpmfabehavior-values&preserve-view=true) to `enforceMfaByFederatedIdp`, as shown in the following example. +1. Configure Microsoft Entra ID to accept MFA requests to your on-premises federation server. Use Graph PowerShell to set [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-1.0&preserve-view=true#federatedidpmfabehavior-values) to `enforceMfaByFederatedIdp`, as shown in the following example. **Request** <!-- { |
active-directory | How To Migrate Mfa Server To Mfa With Federation | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/authentication/how-to-migrate-mfa-server-to-mfa-with-federation.md | This section covers final steps before migrating user MFA settings. ### Set federatedIdpMfaBehavior to enforceMfaByFederatedIdp -For federated domains, MFA may be enforced by Microsoft Entra Conditional Access or by the on-premises federation provider. Each federated domain has a Microsoft Graph PowerShell security setting named **federatedIdpMfaBehavior**. You can set **federatedIdpMfaBehavior** to `enforceMfaByFederatedIdp` so Microsoft Entra ID accepts MFA that's performed by the federated identity provider. If the federated identity provider didn't perform MFA, Microsoft Entra ID redirects the request to the federated identity provider to perform MFA. For more information, see [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-beta#federatedidpmfabehavior-values&preserve-view=true). +For federated domains, MFA may be enforced by Microsoft Entra Conditional Access or by the on-premises federation provider. Each federated domain has a Microsoft Graph PowerShell security setting named **federatedIdpMfaBehavior**. You can set **federatedIdpMfaBehavior** to `enforceMfaByFederatedIdp` so Microsoft Entra ID accepts MFA that's performed by the federated identity provider. If the federated identity provider didn't perform MFA, Microsoft Entra ID redirects the request to the federated identity provider to perform MFA. For more information, see [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-beta&preserve-view=true#federatedidpmfabehavior-values). >[!NOTE] > The **federatedIdpMfaBehavior** setting is a new version of the **SupportsMfa** property of the [New-MgDomainFederationConfiguration](/powershell/module/microsoft.graph.identity.directorymanagement/new-mgdomainfederationconfiguration) cmdlet. |
active-directory | Howto Mfa Adfs | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/authentication/howto-mfa-adfs.md | ->Set the domain setting [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-beta#federatedidpmfabehavior-values&preserve-view=true) to `enforceMfaByFederatedIdp` (recommended) or **SupportsMFA** to `$True`. The **federatedIdpMfaBehavior** setting overrides **SupportsMFA** when both are set. +>Set the domain setting [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-beta&preserve-view=true#federatedidpmfabehavior-values) to `enforceMfaByFederatedIdp` (recommended) or **SupportsMFA** to `$True`. The **federatedIdpMfaBehavior** setting overrides **SupportsMFA** when both are set. <a name='secure-azure-ad-resources-using-ad-fs'></a> |
active-directory | Howto Mfa Mfasettings | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/authentication/howto-mfa-mfasettings.md | When a user reports a MFA prompt as suspicious, the event shows up in the Sign-i - To view fraud reports in the Audit logs, select **Identity** > **Monitoring & health** > **Audit logs**. The fraud report appears under Activity type Fraud reported - user is blocked for MFA or Fraud reported - no action taken based on the tenant-level settings for fraud report. +>[!NOTE] +>A user is not reported as High Risk if they perform passwordless authentication. + ### Manage suspicious activity events Once a user has reported a prompt as suspicious, the risk should be investigated and remediated with [Identity Protection](../identity-protection/howto-identity-protection-remediate-unblock.md). To enable trusted IPs by using Conditional Access policies, complete the followi `c:[Type== "https://schemas.microsoft.com/ws/2012/01/insidecorporatenetwork"] => issue(claim = c);` + >[!NOTE] + >The **Skip multi-factor authentication for requests from federated users on my intranet** option will affect the Conditional Access evaluation for locations. + * **For requests from a specific range of public IPs**: To choose this option, enter the IP addresses in the text box, in CIDR notation. * For IP addresses that are in the range *xxx.xxx.xxx*.1 through *xxx.xxx.xxx*.254, use notation like ***xxx.xxx.xxx*.0/24**. * For a single IP address, use notation like ***xxx.xxx.xxx.xxx*/32**. |
active-directory | Howto Mfa Userstates | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/authentication/howto-mfa-userstates.md | After you enable users, notify them via email. Tell the users that a prompt is d If your users were enabled using per-user enabled and enforced Microsoft Entra multifactor authentication the following PowerShell can assist you in making the conversion to Conditional Access based Microsoft Entra multifactor authentication. -Run this PowerShell in an ISE window or save as a `.PS1` file to run locally. The operation can only be done by using the [MSOnline module](/powershell/module/msonline#msonline). +Run this PowerShell in an ISE window or save as a `.PS1` file to run locally. The operation can only be done by using the [MSOnline module](/powershell/module/msonline/#msonline). ```PowerShell # Connect to tenant |
active-directory | Howto Registration Mfa Sspr Combined Troubleshoot | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/authentication/howto-registration-mfa-sspr-combined-troubleshoot.md | The following table lists all audit events generated by combined registration: | Symptom | Troubleshooting steps | | | |-| I'm not seeing the methods I expected to see. | 1. Check if the user has a Microsoft Entra admin role. If yes, view the SSPR admin policy differences. <br> 2. Determine whether the user is being interrupted because of multifactor authentication registration enforcement or SSPR registration enforcement. See the [flowchart](../../active-directory/authentication/concept-registration-mfa-sspr-combined.md#combined-registration-modes) under "Combined registration modes" to determine which methods should be shown. <br> 3. Determine how recently the multifactor authentication or SSPR policy was changed. If the change was recent, it might take some time for the updated policy to propagate.| +| I'm not seeing the methods I expected to see. | 1. Check if the user has a Microsoft Entra admin role. If yes, view the SSPR admin policy differences. <br> 2. Determine whether the user is being interrupted because of multifactor authentication registration enforcement or SSPR registration enforcement. See the [flowchart](../authentication/concept-registration-mfa-sspr-combined.md#combined-registration-modes) under "Combined registration modes" to determine which methods should be shown. <br> 3. Determine how recently the multifactor authentication or SSPR policy was changed. If the change was recent, it might take some time for the updated policy to propagate.| ## Troubleshooting manage mode |
active-directory | Troubleshoot Sspr | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/authentication/troubleshoot-sspr.md | Use the following information to understand the problem and what needs to be cor ## Microsoft Entra forums -If you have general questions about Microsoft Entra ID and self-service password reset, you can ask the community for assistance on the [Microsoft Q&A question page for Microsoft Entra ID](/answers/topics/azure-active-directory.html). Members of the community include engineers, product managers, MVPs, and fellow IT professionals. +If you have general questions about Microsoft Entra ID and self-service password reset, you can ask the community for assistance on the [Microsoft Q&A question page for Microsoft Entra ID](/answers/tags/455/entra-id). Members of the community include engineers, product managers, MVPs, and fellow IT professionals. ## Contact Microsoft support |
active-directory | Howto Get Appsource Certified | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/azuread-dev/howto-get-appsource-certified.md | For more information about the AppSource trial experience, see [this video](http ## Get support -For Azure AD integration, we use [Microsoft Q&A](/answers/products/) with the community to provide support. +For Azure AD integration, we use [Microsoft Q&A](/answers/) with the community to provide support. We highly recommend you ask your questions on Microsoft Q&A first and browse existing issues to see if someone has asked your question before. Make sure that your questions or comments are tagged with [`[azure-active-directory]`](/answers/topics/azure-active-directory.html). Use the following comments section to provide feedback and help us refine and sh [AAD-Dev-Guide]: v1-overview.md [AAD-QuickStart-Web-Apps]: v1-overview.md#get-started -<!--Image references--> +<!--Image references--> |
active-directory | Block Legacy Authentication | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/conditional-access/block-legacy-authentication.md | The following messaging protocols support legacy authentication: - Authenticated SMTP - Used to send authenticated email messages. - Autodiscover - Used by Outlook and EAS clients to find and connect to mailboxes in Exchange Online. - Exchange ActiveSync (EAS) - Used to connect to mailboxes in Exchange Online.-- Exchange Online PowerShell - Used to connect to Exchange Online with remote PowerShell. If you block Basic authentication for Exchange Online PowerShell, you need to use the Exchange Online PowerShell Module to connect. For instructions, see [Connect to Exchange Online PowerShell using multifactor authentication](/powershell/exchange/exchange-online/connect-to-exchange-online-powershell/mfa-connect-to-exchange-online-powershell).+- Exchange Online PowerShell - Used to connect to Exchange Online with remote PowerShell. If you block Basic authentication for Exchange Online PowerShell, you need to use the Exchange Online PowerShell Module to connect. For instructions, see [Connect to Exchange Online PowerShell using multifactor authentication](/powershell/exchange/connect-to-exchange-online-powershell). - Exchange Web Services (EWS) - A programming interface that's used by Outlook, Outlook for Mac, and third-party apps. - IMAP4 - Used by IMAP email clients. - MAPI over HTTP (MAPI/HTTP) - Primary mailbox access protocol used by Outlook 2010 SP2 and later. Before you can block legacy authentication in your directory, you need to first 1. Browse to **Identity** > **Monitoring & health** > **Sign-in logs**. 1. Add the **Client App** column if it isn't shown by clicking on **Columns** > **Client App**. 1. Select **Add filters** > **Client App** > choose all of the legacy authentication protocols and select **Apply**.-1. If you've activated the [new sign-in activity reports preview](../reports-monitoring/concept-all-sign-ins.md), repeat the above steps also on the **User sign-ins (non-interactive)** tab. +1. If you've activated the [new sign-in activity reports preview](../reports-monitoring/concept-sign-ins.md), repeat the above steps also on the **User sign-ins (non-interactive)** tab. Filtering shows you sign-in attempts made by legacy authentication protocols. Clicking on each individual sign-in attempt shows you more details. The **Client App** field under the **Basic Info** tab indicates which legacy authentication protocol was used. You can select all available grant controls for the **Other clients** condition; - [Determine effect using Conditional Access report-only mode](howto-conditional-access-insights-reporting.md) - If you aren't familiar with configuring Conditional Access policies yet, see [require MFA for specific apps with Microsoft Entra Conditional Access](../authentication/tutorial-enable-azure-mfa.md) for an example.-- For more information about modern authentication support, see [How modern authentication works for Office client apps](/office365/enterprise/modern-auth-for-office-2013-and-2016) +- For more information about modern authentication support, see [How modern authentication works for Office client apps](/microsoft-365/enterprise/modern-auth-for-office-2013-and-2016) - [How to set up a multifunction device or application to send email using Microsoft 365](/exchange/mail-flow-best-practices/how-to-set-up-a-multifunction-device-or-application-to-send-email-using-microsoft-365-or-office-365) - [Enable modern authentication in Exchange Online](/exchange/clients-and-mobile-in-exchange-online/enable-or-disable-modern-authentication-in-exchange-online)-- [Enable Modern Authentication for Office 2013 on Windows devices](/office365/admin/security-and-compliance/enable-modern-authentication)-- [How to configure Exchange Server on-premises to use Hybrid Modern Authentication](/office365/enterprise/configure-exchange-server-for-hybrid-modern-authentication)-- [How to use Modern Authentication with Skype for Business](/skypeforbusiness/manage/authentication/use-adal)+- [Enable Modern Authentication for Office 2013 on Windows devices](/microsoft-365/admin/) +- [How to configure Exchange Server on-premises to use Hybrid Modern Authentication](/microsoft-365/enterprise/configure-exchange-server-for-hybrid-modern-authentication) +- [How to use Modern Authentication with Skype for Business](/microsoft-365/enterprise/hybrid-modern-auth-overview) |
active-directory | Concept Conditional Access Conditions | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/conditional-access/concept-conditional-access-conditions.md | This setting has an effect on access attempts made from the following mobile app | Mail/Calendar/People app, Outlook 2016, Outlook 2013 (with modern authentication)| Exchange Online | Windows 10 | | MFA and location policy for apps. Device-based policies arenΓÇÖt supported.| Any My Apps app service | Android and iOS | | Microsoft Teams Services - this client app controls all services that support Microsoft Teams and all its Client Apps - Windows Desktop, iOS, Android, WP, and web client | Microsoft Teams | Windows 10, Windows 8.1, Windows 7, iOS, Android, and macOS |-| Office 2016 apps, Office 2013 (with modern authentication), [OneDrive sync client](/onedrive/enable-conditional-access) | SharePoint | Windows 8.1, Windows 7 | +| Office 2016 apps, Office 2013 (with modern authentication), [OneDrive sync client](/sharepoint/enable-conditional-access) | SharePoint | Windows 8.1, Windows 7 | | Office 2016 apps, Universal Office apps, Office 2013 (with modern authentication), [OneDrive sync client](/sharepoint/enable-conditional-access) | SharePoint Online | Windows 10 | | Office 2016 (Word, Excel, PowerPoint, OneNote only). | SharePoint | macOS | | Office 2019| SharePoint | Windows 10, macOS | |
active-directory | Concept Conditional Access Grant | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/conditional-access/concept-conditional-access-grant.md | Administrators can choose to enforce one or more controls when granting access. - [Require multifactor authentication (Microsoft Entra multifactor authentication)](../authentication/concept-mfa-howitworks.md) - [Require authentication strength](#require-authentication-strength)-- [Require device to be marked as compliant (Microsoft Intune)](/intune/protect/device-compliance-get-started)+- [Require device to be marked as compliant (Microsoft Intune)](/mem/intune/protect/device-compliance-get-started) - [Require Microsoft Entra hybrid joined device](../devices/concept-hybrid-join.md) - [Require approved client app](./howto-policy-approved-app-or-app-protection.md) - [Require app protection policy](./howto-policy-approved-app-or-app-protection.md) The **Require Microsoft Entra hybrid joined device** control: ### Require approved client app -Organizations can require that an approved client app is used to access selected cloud apps. These approved client apps support [Intune app protection policies](/intune/app-protection-policy) independent of any mobile device management solution. +Organizations can require that an approved client app is used to access selected cloud apps. These approved client apps support [Intune app protection policies](/mem/intune/apps/app-protection-policy) independent of any mobile device management solution. To apply this grant control, the device must be registered in Microsoft Entra ID, which requires using a broker app. The broker app can be Microsoft Authenticator for iOS, or either Microsoft Authenticator or Microsoft Company Portal for Android devices. If a broker app isn't installed on the device when the user attempts to authenticate, the user is redirected to the appropriate app store to install the required broker app. |
active-directory | Concept Conditional Access Session | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/conditional-access/concept-conditional-access-session.md | Organizations can use this control to require Microsoft Entra ID to pass device For more information on the use and configuration of app-enforced restrictions, see the following articles: - [Enabling limited access with SharePoint Online](/sharepoint/control-access-from-unmanaged-devices)-- [Enabling limited access with Exchange Online](/microsoft-365/security/office-365-security/secure-email-recommended-policies?view=o365-worldwide#limit-access-to-exchange-online-from-outlook-on-the-web&preserve-view=true)+- [Enabling limited access with Exchange Online](/microsoft-365/security/office-365-security/secure-email-recommended-policies?view=o365-worldwide&preserve-view=true#limit-access-to-exchange-online-from-outlook-on-the-web) ## Conditional Access application control |
active-directory | Howto Conditional Access Policy Compliant Device | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/conditional-access/howto-conditional-access-policy-compliant-device.md | Organizations that use the [Subscription Activation](/windows/deployment/windows [Use report-only mode for Conditional Access to determine the results of new policy decisions.](concept-conditional-access-report-only.md) -[Device compliance policies work with Microsoft Entra ID](/intune/device-compliance-get-started#device-compliance-policies-work-with-azure-ad) +[Device compliance policies work with Microsoft Entra ID](/mem/intune/protect/device-compliance-get-started#device-compliance-policies-work-with-azure-ad) |
active-directory | What If Tool | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/conditional-access/what-if-tool.md | You start an evaluation by clicking **What If**. The evaluation result provides - Policies that will apply to your user or workload identity. - Policies that don't apply to your user or workload identity. -If [classic policies](policy-migration.md#classic-policies) exist for the selected cloud apps, an indicator is presented to you. By clicking the indicator, you're redirected to the classic policies page. On the classic policies page, you can migrate a classic policy or just disable it. You can return to your evaluation result by closing this page. +If [classic policies](./policy-migration-mfa.md) exist for the selected cloud apps, an indicator is presented to you. By clicking the indicator, you're redirected to the classic policies page. On the classic policies page, you can migrate a classic policy or just disable it. You can return to your evaluation result by closing this page. :::image type="content" source="media/what-if-tool/conditional-access-what-if-evaluation-result-example.png" alt-text="Screenshot of an example of the policy evaluation in the What If tool showing policies that would apply." lightbox="media/what-if-tool/conditional-access-what-if-evaluation-result-example.png"::: |
active-directory | Authentication Flows App Scenarios | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/authentication-flows-app-scenarios.md | Though we don't recommend that you use it, the [username/password flow](scenario Using the username/password flow constrains your applications. For instance, applications can't sign in a user who needs to use multifactor authentication or the Conditional Access tool in Microsoft Entra ID. Your applications also don't benefit from single sign-on. Authentication with the username/password flow goes against the principles of modern authentication and is provided only for legacy reasons. -In desktop apps, if you want the token cache to persist, you can customize the [token cache serialization](msal-net-token-cache-serialization.md). By implementing dual token cache serialization, you can use backward-compatible and forward-compatible token caches. +In desktop apps, if you want the token cache to persist, you can customize the [token cache serialization](/entra/msal/dotnet/how-to/token-cache-serialization). By implementing dual token cache serialization, you can use backward-compatible and forward-compatible token caches. For more information, see [Desktop app that calls web APIs](scenario-desktop-overview.md). Some scenarios, like those that involve Conditional Access related to a device I For more information, see [Mobile app that calls web APIs](scenario-mobile-overview.md). > [!NOTE]-> A mobile app that uses MSAL.iOS, MSAL.Android, or MSAL.NET on Xamarin can have app protection policies applied to it. For instance, the policies might prevent a user from copying protected text. The mobile app is managed by Intune and is recognized by Intune as a managed app. For more information, see [Microsoft Intune App SDK overview](/intune/app-sdk). +> A mobile app that uses MSAL.iOS, MSAL.Android, or MSAL.NET on Xamarin can have app protection policies applied to it. For instance, the policies might prevent a user from copying protected text. The mobile app is managed by Intune and is recognized by Intune as a managed app. For more information, see [Microsoft Intune App SDK overview](/mem/intune/developer/app-sdk). >-> The [Intune App SDK](/intune/app-sdk-get-started) is separate from MSAL libraries and interacts with Microsoft Entra ID on its own. +> The [Intune App SDK](/mem/intune/developer/app-sdk-get-started) is separate from MSAL libraries and interacts with Microsoft Entra ID on its own. ### Protected web API For more information about authentication, see: - [Authentication vs. authorization.](./authentication-vs-authorization.md) - [Microsoft identity platform access tokens.](access-tokens.md)-- [Securing access to IoT apps.](/azure/architecture/example-scenario/iot-aad/iot-aad#security)+- [Securing access to IoT apps.](/azure/architecture/reference-architectures/iot#security) |
active-directory | Certificate Credentials | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/certificate-credentials.md | If you're interested in using a JWT issued by another identity provider as a cre ## Assertion format -To compute the assertion, you can use one of the many JWT libraries in the language of your choice - [MSAL supports this using `.WithCertificate()`](msal-net-client-assertions.md). The information is carried by the token in its **Header**, **Claims**, and **Signature**. +To compute the assertion, you can use one of the many JWT libraries in the language of your choice - [MSAL supports this using `.WithCertificate()`](/entra/msal/dotnet/acquiring-tokens/msal-net-client-assertions). The information is carried by the token in its **Header**, **Claims**, and **Signature**. ### Header Client assertions can be used anywhere a client secret would be used. For exampl ## Next steps -The [MSAL.NET library handles this scenario](msal-net-client-assertions.md) in a single line of code. +The [MSAL.NET library handles this scenario](/entra/msal/dotnet/acquiring-tokens/web-apps-apis/confidential-client-assertions) in a single line of code. The [.NET Core daemon console application using Microsoft identity platform](https://github.com/Azure-Samples/active-directory-dotnetcore-daemon-v2) code sample on GitHub shows how an application uses its own credentials for authentication. It also shows how you can [create a self-signed certificate](https://github.com/Azure-Samples/active-directory-dotnetcore-daemon-v2/tree/master/1-Call-MSGraph#optional-use-the-automation-script) using the `New-SelfSignedCertificate` PowerShell cmdlet. You can also use the [app creation scripts](https://github.com/Azure-Samples/active-directory-dotnetcore-daemon-v2/blob/master/1-Call-MSGraph/AppCreationScripts/AppCreationScripts.md) in the sample repo to create certificates, compute the thumbprint, and so on. |
active-directory | Developer Guide Conditional Access Authentication Context | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/developer-guide-conditional-access-authentication-context.md | The table below will show all corner cases where ACRS is added to the token's cl ## Next steps - [Granular Conditional Access for sensitive data and actions (Blog)](https://techcommunity.microsoft.com/t5/azure-active-directory-identity/granular-conditional-access-for-sensitive-data-and-actions/ba-p/1751775)-- [Zero trust with the Microsoft identity platform](/security/zero-trust/identity-developer)+- [Zero trust with the Microsoft identity platform](/security/zero-trust/develop/identity) - [Building Zero Trust ready apps with the Microsoft identity platform](/security/zero-trust/develop/identity) - [Conditional Access authentication context](../conditional-access/concept-conditional-access-cloud-apps.md#authentication-context) - [authenticationContextClassReference resource type - MS Graph](/graph/api/conditionalaccessroot-list-authenticationcontextclassreferences) |
active-directory | Howto Convert App To Be Multi Tenant | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/howto-convert-app-to-be-multi-tenant.md | To learn more about making API calls to Microsoft Entra ID and Microsoft 365 ser [AAD-App-SP-Objects]:app-objects-and-service-principals.md [AAD-Auth-Scenarios]:./authentication-vs-authorization.md [AAD-Consent-Overview]:./application-consent-experience.md-[AAD-Dev-Guide]:azure-ad-developers-guide.md +[AAD-Dev-Guide]:./index.yml [AAD-Integrating-Apps]:./quickstart-register-app.md [AAD-Samples-MT]: /samples/browse/?products=azure-active-directory [AAD-Why-To-Integrate]: ./how-to-integrate.md |
active-directory | Howto Create Service Principal Portal | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/howto-create-service-principal-portal.md | To configure access policies: ## Next steps -- Learn how to use [Azure PowerShell](howto-authenticate-service-principal-powershell.md) or [Azure CLI](/cli/azure/create-an-azure-service-principal-azure-cli) to create a service principal.+- Learn how to use [Azure PowerShell](howto-authenticate-service-principal-powershell.md) or [Azure CLI](/cli/azure/azure-cli-sp-tutorial-1) to create a service principal. - To learn about specifying security policies, see [Azure role-based access control (Azure RBAC)](/azure/role-based-access-control/role-assignments-portal). - For a list of available actions that can be granted or denied to users, see [Azure Resource Manager Resource Provider operations](/azure/role-based-access-control/resource-provider-operations). - For information about working with app registrations by using **Microsoft Graph**, see the [Applications](/graph/api/resources/application) API reference. |
active-directory | Identity Platform Integration Checklist | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/identity-platform-integration-checklist.md | Use the following checklist to ensure that your application is effectively integ ## Branding -![checkbox](./media/integration-checklist/checkbox-two.svg) Adhere to the [Branding guidelines for applications](/azure/active-directory/develop/howto-add-branding-in-apps). +![checkbox](./medi). ![checkbox](./medi). Make sure your name and logo are representative of your company/product so that users can make informed decisions. Ensure that you're not violating any trademarks. Use the following checklist to ensure that your application is effectively integ ![checkbox](./medi). If you must hand-code for the authentication protocols, you should follow the [Microsoft SDL](https://www.microsoft.com/sdl/default.aspx) or similar development methodology. Pay close attention to the security considerations in the standards specifications for each protocol. -![checkbox](./medi) apps. +![checkbox](./medi) apps. ![checkbox](./media/integration-checklist/checkbox-two.svg) For mobile apps, configure each platform using the application registration experience. In order for your application to take advantage of the Microsoft Authenticator or Microsoft Company Portal for single sign-in, your app needs a "broker redirect URI" configured. This allows Microsoft to return control to your application after authentication. When configuring each platform, the app registration experience will guide you through the process. Use the quickstart to download a working example. On iOS, use brokers and system webview whenever possible. -![checkbox](./medi). +![checkbox](./media/integration-checklist/checkbox-two.svg) In web apps or web APIs, keep one token cache per account. For web apps, the token cache should be keyed by the account ID. For web APIs, the account should be keyed by the hash of the token used to call the API. MSAL.NET provides custom token cache serialization in the .NET Framework and .NET Core subplatforms. For security and performance reasons, our recommendation is to serialize one cache per user. For more information, read about [token cache serialization](/entra/msal/dotnet/how-to/token-cache-serialization). ![checkbox](./media/integration-checklist/checkbox-two.svg) If the data your app requires is available through [Microsoft Graph](https://developer.microsoft.com/graph), request permissions for this data using the Microsoft Graph endpoint rather than the individual API. |
active-directory | Jwt Claims Customization | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/jwt-claims-customization.md | You can use the following functions to transform claims. | **ExtractNumeric() - Suffix** | Returns the suffix numerical part of the string.<br/>For example, if the input's value is `BSimon_123`, then it returns `123`. | | **IfEmpty()** | Outputs an attribute or constant if the input is null or empty.<br/>For example, if you want to output an attribute stored in an extension attribute if the employee ID for a given user is empty. To perform this function, configure the following values:<br/>Parameter 1(input): user.employeeid<br/>Parameter 2 (output): user.extensionattribute1<br/>Parameter 3 (output if there's no match): user.employeeid | | **IfNotEmpty()** | Outputs an attribute or constant if the input isn't null or empty.<br/>For example, if you want to output an attribute stored in an extension attribute if the employee ID for a given user isn't empty. To perform this function, you configure the following values:<br/>Parameter 1(input): user.employeeid<br/>Parameter 2 (output): user.extensionattribute1 |-| **Substring() - Fixed Length** (Preview)| Extracts parts of a string claim type, beginning at the character at the specified position, and returns the specified number of characters.<br/>SourceClaim - The claim source of the transform that should be executed.<br/>StartIndex - The zero-based starting character position of a substring in this instance.<br/>Length - The length in characters of the substring.<br/>For example:<br/>sourceClaim - PleaseExtractThisNow<br/>StartIndex - 6<br/>Length - 11<br/>Output: ExtractThis | -| **Substring() - EndOfString** (Preview) | Extracts parts of a string claim type, beginning at the character at the specified position, and returns the rest of the claim from the specified start index. <br/>SourceClaim - The claim source of the transform.<br/>StartIndex - The zero-based starting character position of a substring in this instance.<br/>For example:<br/>sourceClaim - PleaseExtractThisNow<br/>StartIndex - 6<br/>Output: ExtractThisNow | -| **RegexReplace()** (Preview) | RegexReplace() transformation accepts as input parameters:<br/>- Parameter 1: a user attribute as regex input<br/>- An option to trust the source as multivalued<br/>- Regex pattern<br/>- Replacement pattern. The replacement pattern may contain static text format along with a reference that points to regex output groups and more input parameters. | +| **Substring() - Fixed Length** | Extracts parts of a string claim type, beginning at the character at the specified position, and returns the specified number of characters.<br/>SourceClaim - The claim source of the transform that should be executed.<br/>StartIndex - The zero-based starting character position of a substring in this instance.<br/>Length - The length in characters of the substring.<br/>For example:<br/>sourceClaim - PleaseExtractThisNow<br/>StartIndex - 6<br/>Length - 11<br/>Output: ExtractThis | +| **Substring() - EndOfString** | Extracts parts of a string claim type, beginning at the character at the specified position, and returns the rest of the claim from the specified start index. <br/>SourceClaim - The claim source of the transform.<br/>StartIndex - The zero-based starting character position of a substring in this instance.<br/>For example:<br/>sourceClaim - PleaseExtractThisNow<br/>StartIndex - 6<br/>Output: ExtractThisNow | +| **RegexReplace()** | RegexReplace() transformation accepts as input parameters:<br/>- Parameter 1: a user attribute as regex input<br/>- An option to trust the source as multivalued<br/>- Regex pattern<br/>- Replacement pattern. The replacement pattern may contain static text format along with a reference that points to regex output groups and more input parameters. | If you need other transformations, submit your idea in the [feedback forum in Microsoft Entra ID](https://feedback.azure.com/d365community/forum/22920db1-ad25-ec11-b6e6-000d3a4f0789) under the *SaaS application* category. Authorization: Bearer {token} ``` ## Configure a custom signing key using PowerShell-Use PowerShell to [instantiate an MSAL Public Client Application](msal-net-initializing-client-applications.md#initializing-a-public-client-application-from-code) and use the [Authorization Code Grant](v2-oauth2-auth-code-flow.md) flow to obtain a delegated permission access token for Microsoft Graph. Use the access token to call Microsoft Graph and configure a custom signing key for the service principal. After you configure the custom signing key, your application code needs to [validate the token signing key](#validate-token-signing-key). +Use PowerShell to [instantiate an MSAL Public Client Application](/entr) flow to obtain a delegated permission access token for Microsoft Graph. Use the access token to call Microsoft Graph and configure a custom signing key for the service principal. After you configure the custom signing key, your application code needs to [validate the token signing key](#validate-token-signing-key). To run this script, you need: |
active-directory | Msal Acquire Cache Tokens | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/msal-acquire-cache-tokens.md | When your client requests an access token, Microsoft Entra ID also returns an au ## Next steps Several of the platforms supported by MSAL have additional token cache-related information in the documentation for that platform's library. For example:-- [Get a token from the token cache using MSAL.NET](msal-net-acquire-token-silently.md)+- [Get a token from the token cache using MSAL.NET](/entra/msal/dotnet/acquiring-tokens/acquire-token-silently) - [Single sign-on with MSAL.js](msal-js-sso.md) - [Custom token cache serialization in MSAL for Python](/entra/msal/python/advanced/msal-python-token-cache-serialization) - [Custom token cache serialization in MSAL for Java](/entra/msal/java/advanced/msal-java-token-cache-serialization) |
active-directory | Msal Client Application Configuration | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/msal-client-application-configuration.md | The authority you specify in your code needs to be consistent with the **Support The authority can be: - A Microsoft Entra cloud authority.-- An Azure AD B2C authority. See [B2C specifics](msal-net-b2c-considerations.md).-- An Active Directory Federation Services (AD FS) authority. See [AD FS support](msal-net-adfs-support.md).+- An Azure AD B2C authority. See [B2C specifics](/entra/msal/dotnet/acquiring-tokens/desktop-mobile/social-identities). +- An Active Directory Federation Services (AD FS) authority. See [AD FS support](/entra/msal/dotnet/acquiring-tokens/desktop-mobile/adfs-support). Microsoft Entra cloud authorities have two parts: To help in debugging and authentication failure troubleshooting scenarios, the M :::row::: :::column:::- - [Logging in MSAL.NET](msal-logging-dotnet.md) + - [Logging in MSAL.NET](/entra/msal/dotnet/advanced/exceptions/msal-logging) - [Logging in MSAL for Android](msal-logging-android.md) - [Logging in MSAL.js](msal-logging-js.md) :::column-end::: To help in debugging and authentication failure troubleshooting scenarios, the M ## Next steps -Learn about [instantiating client applications by using MSAL.NET](msal-net-initializing-client-applications.md) and [instantiating client applications by using MSAL.js](msal-js-initializing-client-applications.md). +Learn about [instantiating client applications by using MSAL.NET](/entr). |
active-directory | Msal Client Applications | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/msal-client-applications.md | In MSAL, the client ID, also called the _application ID_ or _app ID_, is passed For more information about application configuration and instantiating, see: - [Client application configuration options](msal-client-application-configuration.md)-- [Instantiating client applications by using MSAL.NET](msal-net-initializing-client-applications.md)+- [Instantiating client applications by using MSAL.NET](/entra/msal/dotnet/getting-started/initializing-client-applications) - [Instantiating client applications by using MSAL.js](msal-js-initializing-client-applications.md) |
active-directory | Msal Migration | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/msal-migration.md | ADAL to MSAL migration guide for different platforms are available in the follow - [Migrate to MSAL iOS and macOS](migrate-objc-adal-msal.md) - [Migrate to MSAL Java](/entra/msal/java/advanced/migrate-adal-msal-java) - [Migrate to MSAL.js](msal-compare-msal-js-and-adal-js.md)-- [Migrate to MSAL .NET](msal-net-migration.md)+- [Migrate to MSAL .NET](/entra/msal/dotnet/how-to/msal-net-migration) - [Migrate to MSAL Node](msal-node-migration.md) - [Migrate to MSAL Python](/entra/msal/python/advanced/migrate-python-adal-msal) |
active-directory | Msal Net System Browser Android Considerations | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/msal-net-system-browser-android-considerations.md | If authentication fails (for example, if authentication launches with DuckDuckGo - **Mitigation**: Ask the user to enable a browser on their device. Recommend a browser that supports custom tabs. ## Next steps-For more information and code examples, see [Choosing between an embedded web browser and a system browser on Xamarin Android](https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/wiki/MSAL.NET-uses-web-browser#choosing-between-embedded-web-browser-or-system-browser-on-xamarinandroid) and [Embedded versus system web UI](msal-net-web-browsers.md#embedded-vs-system-web-ui). +For more information and code examples, see [Choosing between an embedded web browser and a system browser on Xamarin Android](https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/wiki/MSAL.NET-uses-web-browser#choosing-between-embedded-web-browser-or-system-browser-on-xamarinandroid) and [Embedded versus system web UI](/entra/msal/dotnet/acquiring-tokens/using-web-browsers#embedded-vs-system-web-ui). |
active-directory | Msal Net Use Brokers With Xamarin Apps | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/msal-net-use-brokers-with-xamarin-apps.md | Here are a few tips on avoiding issues when you implement brokered authenticatio ## Next steps -Learn about [Considerations for using Universal Windows Platform with MSAL.NET](msal-net-uwp-considerations.md). +Learn about [Considerations for using Universal Windows Platform with MSAL.NET](/entra/msal/dotnet/acquiring-tokens/desktop-mobile/uwp). |
active-directory | Msal Net Xamarin Android Considerations | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/msal-net-xamarin-android-considerations.md | var authResult = AcquireTokenInteractive(scopes) .ExecuteAsync(); ``` -For more information, see [Use web browsers for MSAL.NET](msal-net-web-browsers.md) and [Xamarin Android system browser considerations](msal-net-system-browser-android-considerations.md). +For more information, see [Use web browsers for MSAL.NET](/entr). ## Troubleshooting |
active-directory | Quickstart Single Page App Javascript Sign In | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/quickstart-single-page-app-javascript-sign-in.md | -This quickstart uses a sample JavaScript (JS) single-page app (SPA) to show you how to sign in users by using the [authorization code flow](./v2-oauth2-auth-code-flow.md) with Proof Key for Code Exchange (PKCE) and call the Microsoft Graph API. The sample uses the [Microsoft Authentication Library for JavaScript](/javascript/api/@azure/msal-react) to handle authentication. +This quickstart uses a sample JavaScript (JS) single-page app (SPA) to show you how to sign in users by using the [authorization code flow](./v2-oauth2-auth-code-flow.md) with Proof Key for Code Exchange (PKCE) and call the Microsoft Graph API. The sample uses the [Microsoft Authentication Library for JavaScript](/javascript/api/%40azure/msal-react/) to handle authentication. ## Prerequisites |
active-directory | Quickstart V2 Aspnet Core Webapp Calls Graph | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/quickstart-v2-aspnet-core-webapp-calls-graph.md | -> This section gives an overview of the code required to sign in users and call the Microsoft Graph API on their behalf. This overview can be useful to understand how the code works, main arguments, and also if you want to add sign-in to an existing ASP.NET Core application and call Microsoft Graph. It uses [Microsoft.Identity.Web](microsoft-identity-web.md), which is a wrapper around [MSAL.NET](msal-overview.md). +> This section gives an overview of the code required to sign in users and call the Microsoft Graph API on their behalf. This overview can be useful to understand how the code works, main arguments, and also if you want to add sign-in to an existing ASP.NET Core application and call Microsoft Graph. It uses [Microsoft.Identity.Web](/entr). > > ### How the sample works > -> The line containing `.AddMicrosoftIdentityWebApp` adds the Microsoft identity platform authentication to your application. This is provided by [Microsoft.Identity.Web](microsoft-identity-web.md). It's then configured to sign in using the Microsoft identity platform based on the information in the `AzureAD` section of the *appsettings.json* configuration file: +> The line containing `.AddMicrosoftIdentityWebApp` adds the Microsoft identity platform authentication to your application. This is provided by [Microsoft.Identity.Web](/entra/msal/dotnet/microsoft-identity-web/). It's then configured to sign in using the Microsoft identity platform based on the information in the `AzureAD` section of the *appsettings.json* configuration file: > > | *appsettings.json* key | Description > | > ||-| |
active-directory | Reference Breaking Changes | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/reference-breaking-changes.md | Check this article regularly to learn about: > [!TIP] > To be notified of updates to this page, add this URL to your RSS feed reader:<br/>`https://learn.microsoft.com/api/search/rss?search=%22Azure+Active+Directory+breaking+changes+reference%22&locale=en-us` +## October 2023 ++### Updated RemoteConnect UX Prompt ++**Effective date**: October 2023 ++**Endpoints impacted**: v2.0 and v1.0 ++**Protocol impacted**: RemoteConnect ++RemoteConnect is a cross-device flow that is used for Microsoft Authentication Broker and Microsoft Intune related scenarios involving [Primary Refresh Tokens](../devices/concept-primary-refresh-token.md). To help prevent phishing attacks, the RemoteConnect flow will be receiving updated UX language to call out that the remote device (the device which initiated the flow) will be able to access any applications used by your organization upon successful completion of the flow. ++The prompt that appears will look something like this: ++ ## June 2023 ### Omission of email claims with an unverified domain owner You can review the current text of the 50105 error and more on the error lookup **Change** For single tenant applications, adding or updating the AppId URI validates that the domain in the HTTPS scheme URI is listed in the verified domain list in the customer tenant or that the value uses the default scheme (`api://{appId}`) provided by Azure AD. This could prevent applications from adding an AppId URI if the domain isn't in the verified domain list or the value doesn't use the default scheme.-To find more information on verified domains, refer to the [custom domains documentation](../../active-directory/fundamentals/add-custom-domain.md). +To find more information on verified domains, refer to the [custom domains documentation](../fundamentals/add-custom-domain.md). The change doesn't affect existing applications using unverified domains in their AppID URI. It validates only new applications or when an existing application updates an identifier URI or adds a new one to the identifierUri collection. The new restrictions apply only to URIs added to an app's identifierUris collection after October 15, 2021. AppId URIs already in an application's identifierUris collection when the restriction takes effect on October 15, 2021 will continue to function even if you add new URIs to that collection. |
active-directory | Reference Claims Mapping Policy Type | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/reference-claims-mapping-policy-type.md | Based on the method chosen, a set of inputs and outputs is expected. Define the |-|-|--|-| | **Join** | string1, string2, separator | output claim | Joins input strings by using a separator in between. For example, string1:`foo@bar.com` , string2:`sandbox` , separator:`.` results in output claim:`foo@bar.com.sandbox`. | | **ExtractMailPrefix** | Email or UPN | extracted string | Extension attributes 1-15 or any other directory extensions, which store a UPN or email address value for the user. For example, `johndoe@contoso.com`. Extracts the local part of an email address. For example, mail:`foo@bar.com` results in output claim:`foo`. If no \@ sign is present, then the original input string is returned. |+| **ToLowercase()** | string | output string | Converts the characters of the selected attribute into lowercase characters. | +| **ToUppercase()** | string | output string | Converts the characters of the selected attribute into uppercase characters. | +| **RegexReplace()** | | | RegexReplace() transformation accepts as input parameters:<br/>- Parameter 1: a user attribute as regex input<br/>- An option to trust the source as multivalued<br/>- Regex pattern<br/>- Replacement pattern. The replacement pattern may contain static text format along with a reference that points to regex output groups and more input parameters. | - **InputClaims** - Used to pass the data from a claim schema entry to a transformation. It has three attributes: **ClaimTypeReferenceId**, **TransformationClaimType** and **TreatAsMultiValue**. - **ClaimTypeReferenceId** - Joined with the ID element of the claim schema entry to find the appropriate input claim. |
active-directory | Reference Error Codes | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/reference-error-codes.md | The `error` field has several possible values - review the protocol documentatio | Error | Description | |||-| AADSTS16000 | SelectUserAccount - This is an interrupt thrown by Microsoft Entra ID, which results in UI that allows the user to select from among multiple valid SSO sessions. This error is fairly common and may be returned to the application if `prompt=none` is specified. | +| AADSTS16000 | SelectUserAccount - This is an interrupt thrown by Microsoft Entra ID, which results in UI that allows the user to select from among multiple valid SSO sessions. This error is fairly common and might be returned to the application if `prompt=none` is specified. | | AADSTS16001 | UserAccountSelectionInvalid - You'll see this error if the user selects on a tile that the session select logic has rejected. When triggered, this error allows the user to recover by picking from an updated list of tiles/sessions, or by choosing another account. This error can occur because of a code defect or race condition. | | AADSTS16002 | AppSessionSelectionInvalid - The app-specified SID requirement wasn't met. | | AADSTS160021| AppSessionSelectionInvalidSessionNotExist - Application requested a user session which does not exist. | The `error` field has several possible values - review the protocol documentatio | AADSTS50000 | TokenIssuanceError - There's an issue with the sign-in service. [Open a support ticket](../fundamentals/how-to-get-support.md) to resolve this issue. | | AADSTS50001 | InvalidResource - The resource is disabled or doesn't exist. Check your app's code to ensure that you have specified the exact resource URL for the resource you're trying to access. | | AADSTS50002 | NotAllowedTenant - Sign-in failed because of a restricted proxy access on the tenant. If it's your own tenant policy, you can change your restricted tenant settings to fix this issue. |-| AADSTS500011 | InvalidResourceServicePrincipalNotFound - The resource principal named {name} was not found in the tenant named {tenant}. This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant. You might have sent your authentication request to the wrong tenant. If you expect the app to be installed, you may need to provide administrator permissions to add it. Check with the developers of the resource and application to understand what the right setup for your tenant is. | +| AADSTS500011 | InvalidResourceServicePrincipalNotFound - The resource principal named {name} was not found in the tenant named {tenant}. This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant. You might have sent your authentication request to the wrong tenant. If you expect the app to be installed, you might need to provide administrator permissions to add it. Check with the developers of the resource and application to understand what the right setup for your tenant is. | | AADSTS500021 | Access to '{tenant}' tenant is denied. AADSTS500021 indicates that the tenant restriction feature is configured and that the user is trying to access a tenant that isn't in the list of allowed tenants specified in the header `Restrict-Access-To-Tenant`. For more information, see [Use tenant restrictions to manage access to SaaS cloud applications](../manage-apps/tenant-restrictions.md).| | AADSTS500022 | Access to '{tenant}' tenant is denied. AADSTS500022 indicates that the tenant restriction feature is configured and that the user is trying to access a tenant that isn't in the list of allowed tenants specified in the header `Restrict-Access-To-Tenant`. For more information, see [Use tenant restrictions to manage access to SaaS cloud applications](../manage-apps/tenant-restrictions.md).| | AADSTS50003 | MissingSigningKey - Sign-in failed because of a missing signing key or certificate. This might be because there was no signing key configured in the app. To learn more, see the troubleshooting article for error [AADSTS50003](/troubleshoot/azure/active-directory/error-code-aadsts50003-cert-or-key-not-configured). If you still see issues, contact the app owner or an app admin. | The `error` field has several possible values - review the protocol documentatio | AADSTS50014 | GuestUserInPendingState - The user account doesnΓÇÖt exist in the directory. An application likely chose the wrong tenant to sign into, and the currently logged in user was prevented from doing so since they did not exist in your tenant. If this user should be able to log in, add them as a guest. For further information, please visit [add B2B users](/azure/active-directory/b2b/add-users-administrator). | | AADSTS50015 | ViralUserLegalAgeConsentRequiredState - The user requires legal age group consent. | | AADSTS50017 | CertificateValidationFailed - Certification validation failed, reasons for the following reasons:<ul><li>Cannot find issuing certificate in trusted certificates list</li><li>Unable to find expected CrlSegment</li><li>Cannot find issuing certificate in trusted certificates list</li><li>Delta CRL distribution point is configured without a corresponding CRL distribution point</li><li>Unable to retrieve valid CRL segments because of a timeout issue</li><li>Unable to download CRL</li></ul>Contact the tenant admin. |-| AADSTS50020 | UserUnauthorized - Users are unauthorized to call this endpoint. User account '{email}' from identity provider '{idp}' does not exist in tenant '{tenant}' and cannot access the application '{appid}'({appName}) in that tenant. This account needs to be added as an external user in the tenant first. Sign out and sign in again with a different Microsoft Entra user account. If this user should be a member of the tenant, they should be invited via the [B2B system](/azure/active-directory/b2b/add-users-administrator). For additional information, visit [AADSTS50020](/troubleshoot/azure/active-directory/error-code-aadsts50020-user-account-identity-provider-does-not-exist). | +| AADSTS50020 | UserUnauthorized - Users are unauthorized to call this endpoint. User account '{email}' from identity provider '{idp}' does not exist in tenant '{tenant}' and cannot access the application '{appid}'({appName}) in that tenant. This account needs to be added as an external user in the tenant first. Sign out and sign in again with a different Microsoft Entra user account. If this user should be a member of the tenant, they should be invited via the [B2B system](../external-identities/add-users-administrator.md). For additional information, visit [AADSTS50020](/troubleshoot/azure/active-directory/error-code-aadsts50020-user-account-identity-provider-does-not-exist). | | AADSTS500208 | The domain is not a valid login domain for the account type - This situation occurs when the user's account does not match the expected account type for the given tenant.. For instance, if the tenant is configured to allow only work or school accounts, and the user tries to sign in with a personal Microsoft account, they will receive this error. | AADSTS500212 | NotAllowedByOutboundPolicyTenant - The user's administrator has set an outbound access policy that doesn't allow access to the resource tenant. | | AADSTS500213 | NotAllowedByInboundPolicyTenant - The resource tenant's cross-tenant access policy doesn't allow this user to access this tenant. | The `error` field has several possible values - review the protocol documentatio | AADSTS50029 | Invalid URI - domain name contains invalid characters. Contact the tenant admin. | | AADSTS50032 | WeakRsaKey - Indicates the erroneous user attempt to use a weak RSA key. | | AADSTS50033 | RetryableError - Indicates a transient error not related to the database operations. |-| AADSTS50034 | UserAccountNotFound - To sign into this application, the account must be added to the directory. This error can occur because the user mis-typed their username, or isn't in the tenant. An application may have chosen the wrong tenant to sign into, and the currently logged in user was prevented from doing so since they did not exist in your tenant. If this user should be able to log in, add them as a guest. See docs here: [Add B2B users](../external-identities/add-users-administrator.md). | +| AADSTS50034 | UserAccountNotFound - To sign into this application, the account must be added to the directory. This error can occur because the user mis-typed their username, or isn't in the tenant. An application might have chosen the wrong tenant to sign into, and the currently logged in user was prevented from doing so since they did not exist in your tenant. If this user should be able to log in, add them as a guest. See docs here: [Add B2B users](../external-identities/add-users-administrator.md). | | AADSTS50042 | UnableToGeneratePairwiseIdentifierWithMissingSalt - The salt required to generate a pairwise identifier is missing in principle. Contact the tenant admin. | | AADSTS50043 | UnableToGeneratePairwiseIdentifierWithMultipleSalts | | AADSTS50048 | SubjectMismatchesIssuer - Subject mismatches Issuer claim in the client assertion. Contact the tenant admin. | | AADSTS50049 | NoSuchInstanceForDiscovery - Unknown or invalid instance. | | AADSTS50050 | MalformedDiscoveryRequest - The request is malformed. | | AADSTS50053 | This error can result from two different reasons: <br><ul><li>IdsLocked - The account is locked because the user tried to sign in too many times with an incorrect user ID or password. The user is blocked due to repeated sign-in attempts. See [Remediate risks and unblock users](../identity-protection/howto-identity-protection-remediate-unblock.md).</li><li>Or, sign-in was blocked because it came from an IP address with malicious activity.</li></ul> <br>To determine which failure reason caused this error, sign in to the [Microsoft Entra admin center](https://entra.microsoft.com) as at least an [Cloud Application Administrator](../roles/permissions-reference.md#cloud-application-administrator). Navigate to your Microsoft Entra tenant and then **Monitoring & health** -> **Sign-in logs**. Find the failed user sign-in with **Sign-in error code** 50053 and check the **Failure reason**.|-| AADSTS50055 | InvalidPasswordExpiredPassword - The password is expired. The user's password is expired, and therefore their login or session was ended. They will be offered the opportunity to reset it, or may ask an admin to reset it via [Reset a user's password using Microsoft Entra ID](../fundamentals/users-reset-password-azure-portal.md). | +| AADSTS50055 | InvalidPasswordExpiredPassword - The password is expired. The user's password is expired, and therefore their login or session was ended. They will be offered the opportunity to reset it, or can ask an admin to reset it via [Reset a user's password using Microsoft Entra ID](../fundamentals/users-reset-password-azure-portal.md). | | AADSTS50056 | Invalid or null password: password doesn't exist in the directory for this user. The user should be asked to enter their password again. | | AADSTS50057 | UserDisabled - The user account is disabled. The user object in Active Directory backing this account has been disabled. An admin can re-enable this account [through PowerShell](/powershell/module/activedirectory/enable-adaccount) |-| AADSTS50058 | UserInformationNotProvided - Session information isn't sufficient for single-sign-on. This means that a user isn't signed in. This is a common error that's expected when a user is unauthenticated and has not yet signed in.</br>If this error is encountered in an SSO context where the user has previously signed in, this means that the SSO session was either not found or invalid.</br>This error may be returned to the application if prompt=none is specified. | +| AADSTS50058 | UserInformationNotProvided - Session information isn't sufficient for single-sign-on. This means that a user isn't signed in. This is a common error that's expected when a user is unauthenticated and has not yet signed in.</br>If this error is encountered in an SSO context where the user has previously signed in, this means that the SSO session was either not found or invalid.</br>This error might be returned to the application if prompt=none is specified. | | AADSTS50059 | MissingTenantRealmAndNoUserInformationProvided - Tenant-identifying information was not found in either the request or implied by any provided credentials. The user can contact the tenant admin to help resolve the issue. | | AADSTS50061 | SignoutInvalidRequest - Unable to complete sign out. The request was invalid. | | AADSTS50064 | CredentialAuthenticationError - Credential validation on username or password has failed. | The `error` field has several possible values - review the protocol documentatio | AADSTS50071 | SignoutMessageExpired - The logout request has expired. | | AADSTS50072 | UserStrongAuthEnrollmentRequiredInterrupt - User needs to enroll for second factor authentication (interactive). | | AADSTS50074 | UserStrongAuthClientAuthNRequiredInterrupt - Strong authentication is required and the user did not pass the MFA challenge. |-| AADSTS50076 | UserStrongAuthClientAuthNRequired - Due to a configuration change made by the admin such as a Conditional Access policy, per-user enforcement, or because you moved to a new location, the user must use multi-factor authentication to access the resource. Retry with a new authorize request for the resource. | -| AADSTS50078 | UserStrongAuthExpired- Presented multi-factor authentication has expired due to policies configured by your administrator, you must refresh your multi-factor authentication to access '{resource}'.| -| AADSTS50079 | UserStrongAuthEnrollmentRequired - Due to a configuration change made by the admin such as a Conditional Access policy, per-user enforcement, or because the user moved to a new location, the user is required to use multi-factor authentication. Either a managed user needs to register security info to complete multi-factor authentication, or a federated user needs to get the multi-factor claim from the federated identity provider. | +| AADSTS50076 | UserStrongAuthClientAuthNRequired - Due to a configuration change made by the admin such as a Conditional Access policy, per-user enforcement, or because you moved to a new location, the user must use multifactor authentication to access the resource. Retry with a new authorize request for the resource. | +| AADSTS50078 | UserStrongAuthExpired- Presented multifactor authentication has expired due to policies configured by your administrator, you must refresh your multifactor authentication to access '{resource}'.| +| AADSTS50079 | UserStrongAuthEnrollmentRequired - Due to a configuration change made by the admin such as a Conditional Access policy, per-user enforcement, or because the user moved to a new location, the user is required to use multifactor authentication. Either a managed user needs to register security info to complete multifactor authentication, or a federated user needs to get the multifactor claim from the federated identity provider. | | AADSTS50085 | Refresh token needs social IDP login. Have user try signing-in again with username -password | | AADSTS50086 | SasNonRetryableError | | AADSTS50087 | SasRetryableError - A transient error has occurred during strong authentication. Please try again. | The `error` field has several possible values - review the protocol documentatio | AADSTS50178 | SessionControlNotSupportedForPassthroughUsers - Session control isn't supported for passthrough users. | | AADSTS50180 | WindowsIntegratedAuthMissing - Integrated Windows authentication is needed. Enable the tenant for Seamless SSO. | | AADSTS50187 | DeviceInformationNotProvided - The service failed to perform device authentication. |-| AADSTS50194 | Application '{appId}'({appName}) isn't configured as a multi-tenant application. Usage of the /common endpoint isn't supported for such applications created after '{time}'. Use a tenant-specific endpoint or configure the application to be multi-tenant. | +| AADSTS50194 | Application '{appId}'({appName}) isn't configured as a multitenant application. Usage of the /common endpoint isn't supported for such applications created after '{time}'. Use a tenant-specific endpoint or configure the application to be multitenant. | | AADSTS50196 | LoopDetected - A client loop has been detected. Check the appΓÇÖs logic to ensure that token caching is implemented, and that error conditions are handled correctly. The app has made too many of the same request in too short a period, indicating that it is in a faulty state or is abusively requesting tokens. | | AADSTS50197 | ConflictingIdentities - The user could not be found. Try signing in again. | | AADSTS50199 | CmsiInterrupt - For security reasons, user confirmation is required for this request. Interrupt is shown for all scheme redirects in mobile browsers. <br />No action required. The user was asked to confirm that this app is the application they intended to sign into. <br />This is a security feature that helps prevent spoofing attacks. This occurs because a system webview has been used to request a token for a native application. <br />To avoid this prompt, the redirect URI should be part of the following safe list: <br />http://<br />https://<br />chrome-extension:// (desktop Chrome browser only) | The `error` field has several possible values - review the protocol documentatio | AADSTS53002 | ApplicationUsedIsNotAnApprovedApp - The app used isn't an approved app for Conditional Access. User needs to use one of the apps from the list of approved apps to use in order to get access. | | AADSTS53003 | BlockedByConditionalAccess - Access has been blocked by Conditional Access policies. The access policy does not allow token issuance. If this is unexpected, see the Conditional Access policy that applied to this request or contact your administrator. For additional information, please visit [troubleshooting sign-in with Conditional Access](../conditional-access/troubleshoot-conditional-access.md). | | AADSTS530035 |BlockedBySecurityDefaults - Access has been blocked by security defaults. This is due to the request using legacy auth or being deemed unsafe by security defaults policies. For additional information, please visit [enforced security policies](../fundamentals/security-defaults.md#enforced-security-policies).|-| AADSTS53004 | ProofUpBlockedDueToRisk - User needs to complete the multi-factor authentication registration process before accessing this content. User should register for multi-factor authentication. | -| AADSTS53010 | ProofUpBlockedDueToSecurityInfoAcr - Cannot configure multi-factor authentication methods because the organization requires this information to be set from specific locations or devices. | +| AADSTS53004 | ProofUpBlockedDueToRisk - User needs to complete the multifactor authentication registration process before accessing this content. User should register for multifactor authentication. | +| AADSTS53010 | ProofUpBlockedDueToSecurityInfoAcr - Cannot configure multifactor authentication methods because the organization requires this information to be set from specific locations or devices. | | AADSTS53011 | User blocked due to risk on home tenant. | | AADSTS530034 | DelegatedAdminBlockedDueToSuspiciousActivity - A delegated administrator was blocked from accessing the tenant due to account risk in their home tenant. | | AADSTS54000 | MinorUserBlockedLegalAgeGroupRule | | AADSTS54005 | OAuth2 Authorization code was already redeemed, please retry with a new valid code or use an existing refresh token. | | AADSTS65001 | DelegationDoesNotExist - The user or administrator has not consented to use the application with ID X. Send an interactive authorization request for this user and resource. |-| AADSTS65002 | Consent between first party application '{applicationId}' and first party resource '{resourceId}' must be configured via preauthorization - applications owned and operated by Microsoft must get approval from the API owner before requesting tokens for that API. A developer in your tenant may be attempting to reuse an App ID owned by Microsoft. This error prevents them from impersonating a Microsoft application to call other APIs. They must move to another app ID they register.| +| AADSTS65002 | Consent between first party application '{applicationId}' and first party resource '{resourceId}' must be configured via preauthorization - applications owned and operated by Microsoft must get approval from the API owner before requesting tokens for that API. A developer in your tenant might be attempting to reuse an App ID owned by Microsoft. This error prevents them from impersonating a Microsoft application to call other APIs. They must move to another app ID they register.| | AADSTS65004 | UserDeclinedConsent - User declined to consent to access the app. Have the user retry the sign-in and consent to the app|-| AADSTS65005 | MisconfiguredApplication - The app required resource access list does not contain apps discoverable by the resource or The client app has requested access to resource, which was not specified in its required resource access list or Graph service returned bad request or resource not found. If the app supports SAML, you may have configured the app with the wrong Identifier (Entity). To learn more, see the troubleshooting article for error [AADSTS650056](/troubleshoot/azure/active-directory/error-code-aadsts650056-misconfigured-app). | +| AADSTS65005 | MisconfiguredApplication - The app required resource access list does not contain apps discoverable by the resource or The client app has requested access to resource, which was not specified in its required resource access list or Graph service returned bad request or resource not found. If the app supports SAML, you might have configured the app with the wrong Identifier (Entity). To learn more, see the troubleshooting article for error [AADSTS650056](/troubleshoot/azure/active-directory/error-code-aadsts650056-misconfigured-app). | | AADSTS650052 | The app needs access to a service `(\"{name}\")` that your organization `\"{organization}\"` has not subscribed to or enabled. Contact your IT Admin to review the configuration of your service subscriptions. | | AADSTS650054 | The application asked for permissions to access a resource that has been removed or is no longer available. Make sure that all resources the app is calling are present in the tenant you're operating in. | | AADSTS650056 | Misconfigured application. This could be due to one of the following: the client has not listed any permissions for '{name}' in the requested permissions in the client's application registration. Or, the admin has not consented in the tenant. Or, check the application identifier in the request to ensure it matches the configured client application identifier. Or, check the certificate in the request to ensure it's valid. Please contact your admin to fix the configuration or consent on behalf of the tenant. Client app ID: {ID}. Please contact your admin to fix the configuration or consent on behalf of the tenant.| | AADSTS650057 | Invalid resource. The client has requested access to a resource which isn't listed in the requested permissions in the client's application registration. Client app ID: {appId}({appName}). Resource value from request: {resource}. Resource app ID: {resourceAppId}. List of valid resources from app registration: {regList}. | | AADSTS67003 | ActorNotValidServiceIdentity |-| AADSTS70000 | InvalidGrant - Authentication failed. The refresh token isn't valid. Error may be due to the following reasons:<ul><li>Token binding header is empty</li><li>Token binding hash does not match</li></ul> | +| AADSTS70000 | InvalidGrant - Authentication failed. The refresh token isn't valid. Error might be due to the following reasons:<ul><li>Token binding header is empty</li><li>Token binding hash does not match</li></ul> | | AADSTS70001 | UnauthorizedClient - The application is disabled. To learn more, see the troubleshooting article for error [AADSTS70001](/troubleshoot/azure/active-directory/error-code-aadsts70001-app-not-found-in-directory). | | AADSTS700011 | UnauthorizedClientAppNotFoundInOrgIdTenant - Application with identifier {appIdentifier} was not found in the directory. A client application requested a token from your tenant, but the client app doesn't exist in your tenant, so the call failed. | | AADSTS70002 | InvalidClient - Error validating the credentials. The specified client_secret does not match the expected value for this client. Correct the client_secret and try again. For more info, see [Use the authorization code to request an access token](v2-oauth2-auth-code-flow.md#redeem-a-code-for-an-access-token). | The `error` field has several possible values - review the protocol documentatio | AADSTS9001023 |The grant type isn't supported over the /common or /consumers endpoints. Please use the /organizations or tenant-specific endpoint.| | AADSTS90012 | RequestTimeout - The requested has timed out. | | AADSTS90013 | InvalidUserInput - The input from the user isn't valid. |-| AADSTS90014 | MissingRequiredField - This error code may appear in various cases when an expected field isn't present in the credential. | +| AADSTS90014 | MissingRequiredField - This error code might appear in various cases when an expected field isn't present in the credential. | | AADSTS900144 | The request body must contain the following parameter: '{name}'. Developer error - the app is attempting to sign in without the necessary or correct authentication parameters.| | AADSTS90015 | QueryStringTooLong - The query string is too long. | | AADSTS90016 | MissingRequiredClaim - The access token isn't valid. The required claim is missing. | The `error` field has several possible values - review the protocol documentatio | AADSTS90033 | MsodsServiceUnavailable - The Microsoft Online Directory Service (MSODS) isn't available. | | AADSTS90036 | MsodsServiceUnretryableFailure - An unexpected, non-retryable error from the WCF service hosted by MSODS has occurred. [Open a support ticket](../fundamentals/how-to-get-support.md) to get more details on the error. | | AADSTS90038 | NationalCloudTenantRedirection - The specified tenant 'Y' belongs to the National Cloud 'X'. Current cloud instance 'Z' does not federate with X. A cloud redirect error is returned. |+| AADSTS900384 | JWT token failed signature validation. Actual message content is runtime specific, there are a variety of causes for this error. Please see the returned exception message for details.| | AADSTS90043 | NationalCloudAuthCodeRedirection - The feature is disabled. | | AADSTS900432 | Confidential Client isn't supported in Cross Cloud request.| | AADSTS90051 | InvalidNationalCloudId - The national cloud identifier contains an invalid cloud identifier. | |
active-directory | Sample V2 Code | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/sample-v2-code.md | The following samples show how to build applications for the Python language and > [!div class="mx-tdCol2BreakAll"] > | App type | Code sample(s) <br/> on GitHub |Auth<br/> libraries |Auth flow | > | -- | -- |-- |-- |-> | Web application | • [Sign in users](https://github.com/Azure-Samples/ms-identity-python-django-tutorial/tree/main/1-Authentication/sign-in) <br/> • [Sign in users (B2C)](https://github.com/Azure-Samples/ms-identity-python-django-tutorial/tree/main/1-Authentication/sign-in-b2c) <br/> • [Call Microsoft Graph](https://github.com/Azure-Samples/ms-identity-python-django-tutorial/tree/main/2-Authorization-I/call-graph) <br/> • [Deploy to Azure App Service](https://github.com/Azure-Samples/ms-identity-python-django-tutorial/tree/main/3-Deployment/deploy-to-azure-app-service)| [MSAL Python](/entra/msal/python) | Authorization code | +> | Web application | • [Sign in users](https://github.com/Azure-Samples/ms-identity-python-django-tutorial/tree/main/1-Authentication/sign-in) <br/> • [Sign in users (B2C)](https://github.com/Azure-Samples/ms-identity-python-django-tutorial/tree/main/1-Authentication/sign-in-b2c) <br/> • [Call Microsoft Graph](https://github.com/Azure-Samples/ms-identity-python-django-tutorial/tree/main/2-Authorization-I/call-graph) <br/> • [Deploy to Azure App Service](https://github.com/Azure-Samples/ms-identity-python-django-tutorial/tree/main/3-Deployment/deploy-to-azure-app-service)| [MSAL Python](/entra/msal/python/) | Authorization code | ### Kotlin The following samples show how to build applications with Windows Presentation F > | App type | Code sample(s) <br/> on GitHub |Auth<br/> libraries |Auth flow | > | -- | -- |-- |-- | > | Desktop | [Sign in users and call Microsoft Graph](https://github.com/Azure-Samples/active-directory-dotnet-native-aspnetcore-v2/tree/master/2.%20Web%20API%20now%20calls%20Microsoft%20Graph) | [MSAL.NET](/entra/msal/dotnet) | Authorization code with PKCE |-> | Desktop | • [Sign in users and call ASP.NET Core web API](https://github.com/Azure-Samples/active-directory-dotnet-native-aspnetcore-v2/tree/master/1.%20Desktop%20app%20calls%20Web%20API) <br/> • [Sign in users and call Microsoft Graph](https://github.com/azure-samples/active-directory-dotnet-desktop-msgraph-v2) | [MSAL.NET](/entra/msal/dotnet) | Authorization code with PKCE | +> | Desktop | • [Sign in users and call ASP.NET Core web API](https://github.com/Azure-Samples/active-directory-dotnet-native-aspnetcore-v2/tree/master/1.%20Desktop%20app%20calls%20Web%20API) <br/> • [Sign in users and call Microsoft Graph](https://github.com/azure-samples/active-directory-dotnet-desktop-msgraph-v2) | [MSAL.NET](/entra/msal/dotnet/) | Authorization code with PKCE | |
active-directory | Scenario Daemon App Configuration | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/scenario-daemon-app-configuration.md | MSAL.NET has two methods to provide signed assertions to the confidential client - `.WithClientAssertion()` - `.WithClientClaims()` -When you use `WithClientAssertion`, provide a signed JWT. This advanced scenario is detailed in [Client assertions](msal-net-client-assertions.md). +When you use `WithClientAssertion`, provide a signed JWT. This advanced scenario is detailed in [Client assertions](/entra/msal/dotnet/acquiring-tokens/msal-net-client-assertions). ```csharp string signedClientAssertion = ComputeAssertion(); app = ConfidentialClientApplicationBuilder.Create(config.ClientId) .Build(); ``` -Again, for details, see [Client assertions](msal-net-client-assertions.md). +Again, for details, see [Client assertions](/entra/msal/dotnet/acquiring-tokens/msal-net-client-assertions). |
active-directory | Scenario Daemon Production | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/scenario-daemon-production.md | You'll need to explain to your customers how to perform these operations. For mo - Reference documentation for: - Instantiating [ConfidentialClientApplication](/dotnet/api/microsoft.identity.client.confidentialclientapplicationbuilder).- - Calling [AcquireTokenForClient](/dotnet/api/microsoft.identity.client.acquiretokenforclientparameterbuilder?view=azure-dotnet&preserve-view=true). + - Calling [AcquireTokenForClient](/dotnet/api/microsoft.identity.client.acquiretokenforclientparameterbuilder?preserve-view=true&view=msal-dotnet-latest&viewFallbackFrom=azure-dotnet). - Other samples/tutorials: - [microsoft-identity-platform-console-daemon](https://github.com/Azure-Samples/microsoft-identity-platform-console-daemon) features a small .NET Core daemon console application that displays the users of a tenant querying Microsoft Graph. |
active-directory | Scenario Desktop Acquire Token Interactive | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/scenario-desktop-acquire-token-interactive.md | catch(MsalUiRequiredException) On both desktop and mobile applications, it's important to specify the parent by using `.WithParentActivityOrWindow`. In many cases, it's a requirement and MSAL will throw exceptions. -For desktop applications, see [Parent window handles](/azure/active-directory/develop/scenario-desktop-acquire-token-wam#parent-window-handles). +For desktop applications, see [Parent window handles](./scenario-desktop-acquire-token-wam.md#parent-window-handles). For mobile applications, provide `Activity` (Android) or `UIViewController` (iOS). The structure defines the following constants: #### WithUseEmbeddedWebView -This method enables you to specify if you want to force the usage of an embedded WebView or the system WebView (when available). For more information, see [Usage of web browsers](msal-net-web-browsers.md). +This method enables you to specify if you want to force the usage of an embedded WebView or the system WebView (when available). For more information, see [Usage of web browsers](/entra/msal/dotnet/acquiring-tokens/using-web-browsers). ```csharp var result = await app.AcquireTokenInteractive(scopes) |
active-directory | Scenario Mobile App Configuration | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/scenario-mobile-app-configuration.md | For considerations about the browsers on Android, see [Xamarin.Android-specific On UWP, you can use corporate networks. The following sections explain the tasks that you should complete in the corporate scenario. -For more information, see [UWP-specific considerations with MSAL.NET](msal-net-uwp-considerations.md). +For more information, see [UWP-specific considerations with MSAL.NET](/entra/msal/dotnet/acquiring-tokens/desktop-mobile/uwp). ## Configure the application to use the broker |
active-directory | Scenario Web App Call Api App Configuration | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/scenario-web-app-call-api-app-configuration.md | The following image shows the various possibilities of *Microsoft.Identity.Web* :::image type="content" source="media/scenarios/microsoft-identity-web-startup-cs.svg" alt-text="Block diagram showing service configuration options in startup dot C S for calling a web API and specifying a token cache implementation"::: > [!NOTE]-> To fully understand the code examples here, be familiar with [ASP.NET Core fundamentals](/aspnet/core/fundamentals), and in particular with [dependency injection](/aspnet/core/fundamentals/dependency-injection) and [options](/aspnet/core/fundamentals/configuration/options). +> To fully understand the code examples here, be familiar with [ASP.NET Core fundamentals](/aspnet/core/fundamentals/), and in particular with [dependency injection](/aspnet/core/fundamentals/dependency-injection) and [options](/aspnet/core/fundamentals/configuration/options). Code examples in this article and the following one are extracted from the [ASP.NET Web app sample](https://github.com/Azure-Samples/ms-identity-aspnet-webapp-openidconnect). You might want to refer to that sample for full implementation details. See `app.py` for the full context of that code. Instead of a client secret, the confidential client application can also prove its identity by using a client certificate or a client assertion.-The use of client assertions is an advanced scenario, detailed in [Client assertions](msal-net-client-assertions.md). +The use of client assertions is an advanced scenario, detailed in [Client assertions](/entra/msal/dotnet/acquiring-tokens/msal-net-client-assertions). ## Token cache > [!IMPORTANT]-> The token-cache implementation for web apps or web APIs is different from the implementation for desktop applications, which is often [file based](msal-net-token-cache-serialization.md). +> The token-cache implementation for web apps or web APIs is different from the implementation for desktop applications, which is often [file based](/entra/msal/dotnet/how-to/token-cache-serialization). > For security and performance reasons, it's important to ensure that for web apps and web APIs there is one token cache per user account. You must serialize the token cache for each account. # [ASP.NET Core](#tab/aspnetcore) -The ASP.NET core tutorial uses dependency injection to let you decide the token cache implementation in the Startup.cs file for your application. Microsoft.Identity.Web comes with prebuilt token-cache serializers described in [Token cache serialization](msal-net-token-cache-serialization.md). An interesting possibility is to choose ASP.NET Core [distributed memory caches](/aspnet/core/performance/caching/distributed#distributed-memory-cache): +The ASP.NET core tutorial uses dependency injection to let you decide the token cache implementation in the Startup.cs file for your application. Microsoft.Identity.Web comes with prebuilt token-cache serializers described in [Token cache serialization](/entra/msal/dotnet/how-to/token-cache-serialization). An interesting possibility is to choose ASP.NET Core [distributed memory caches](/aspnet/core/performance/caching/distributed#distributed-memory-cache): ```csharp // Use a distributed token cache by adding: For details about the token-cache providers, see also Microsoft.Identity.Web's [ # [ASP.NET](#tab/aspnet) -The ASP.NET tutorial uses dependency injection to let you decide the token cache implementation in the *Startup.Auth.cs* file for your application. *Microsoft.Identity.Web* comes with prebuilt token-cache serializers described in [Token cache serialization](msal-net-token-cache-serialization.md). An interesting possibility is to choose ASP.NET Core [distributed memory caches](/aspnet/core/performance/caching/distributed#distributed-memory-cache): +The ASP.NET tutorial uses dependency injection to let you decide the token cache implementation in the *Startup.Auth.cs* file for your application. *Microsoft.Identity.Web* comes with prebuilt token-cache serializers described in [Token cache serialization](/entra/msal/dotnet/how-to/token-cache-serialization). An interesting possibility is to choose ASP.NET Core [distributed memory caches](/aspnet/core/performance/caching/distributed#distributed-memory-cache): ```csharp var services = owinTokenAcquirerFactory.Services; services.AddDistributedSqlServerCache(options => For details about the token-cache providers, see also the *Microsoft.Identity.Web* [Token cache serialization](https://aka.ms/ms-id-web/token-cache-serialization) article, and the [ASP.NET Core Web app tutorials | Token caches](https://github.com/Azure-Samples/active-directory-aspnetcore-webapp-openidconnect-v2/tree/master/2-WebApp-graph-user/2-2-TokenCache) phase of the web app's tutorial. -For details see [Token cache serialization for MSAL.NET](./msal-net-token-cache-serialization.md). +For details see [Token cache serialization for MSAL.NET](/entra/msal/dotnet/how-to/token-cache-serialization). # [Java](#tab/java) |
active-directory | Secure Group Access Control | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/secure-group-access-control.md | The following table presents several security best practices for security groups |--|| |**Ensure resource owner and group owner are the same principal**. Applications should build their own group management experience and create new groups to manage access. For example, an application can create groups with the `Group.Create` permission and add itself as the owner of the group. This way the application has control over its groups without being over privileged to modify other groups in the tenant.|When group owners and resource owners are different entities, group owners can add users to the group who aren't supposed to access the resource but can then access it unintentionally.| |**Build an implicit contract between the resource owner and group owner**. The resource owner and the group owner should align on the group purpose, policies, and members that can be added to the group to get access to the resource. This level of trust is non-technical and relies on human or business contract.|When group owners and resource owners have different intentions, the group owner may add users to the group the resource owner didn't intend on giving access to. This action can result in unnecessary and potentially risky access.|-|**Use private groups for access control**. Microsoft 365 groups are managed by the [visibility concept](/graph/api/resources/group?view=graph-rest-1.0#group-visibility-options&preserve-view=true). This property controls the join policy of the group and visibility of group resources. Security groups have join policies that either allow anyone to join or require owner approval. On-premises-synced groups can also be public or private. Users joining an on-premises-synced group can get access to cloud resource as well.|When you use a public group for access control, any member can join the group and get access to the resource. The risk of elevation of privilege exists when a public group is used to give access to an external resource.| +|**Use private groups for access control**. Microsoft 365 groups are managed by the [visibility concept](/graph/api/resources/group?view=graph-rest-1.0&preserve-view=true#group-visibility-options). This property controls the join policy of the group and visibility of group resources. Security groups have join policies that either allow anyone to join or require owner approval. On-premises-synced groups can also be public or private. Users joining an on-premises-synced group can get access to cloud resource as well.|When you use a public group for access control, any member can join the group and get access to the resource. The risk of elevation of privilege exists when a public group is used to give access to an external resource.| |**Group nesting**. When you use a group for access control and it has other groups as its members, members of the subgroups can get access to the resource. In this case, there are multiple group owners of the parent group and the subgroups.|Aligning with multiple group owners on the purpose of each group and how to add the right members to these groups is more complex and more prone to accidental grant of access. Limit the number of nested groups or don't use them at all if possible.| ## Next steps |
active-directory | Support Fido2 Authentication | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/support-fido2-authentication.md | -These configurations and best practices will help you avoid common scenarios that block [FIDO2 passwordless authentication](../../active-directory/authentication/concept-authentication-passwordless.md) from being available to users of your applications. +These configurations and best practices will help you avoid common scenarios that block [FIDO2 passwordless authentication](../authentication/concept-authentication-passwordless.md) from being available to users of your applications. ## General best practices ### Domain hints -Don't use a domain hint to bypass [home-realm discovery](../../active-directory/manage-apps/configure-authentication-for-federated-users-portal.md). This feature is meant to make sign-ins more streamlined, but the federated identity provider may not support passwordless authentication. +Don't use a domain hint to bypass [home-realm discovery](../manage-apps/configure-authentication-for-federated-users-portal.md). This feature is meant to make sign-ins more streamlined, but the federated identity provider may not support passwordless authentication. ### Requiring specific credentials The availability of FIDO2 passwordless authentication for applications that run ## Next steps -[Passwordless authentication options for Microsoft Entra ID](../../active-directory/authentication/concept-authentication-passwordless.md) +[Passwordless authentication options for Microsoft Entra ID](../authentication/concept-authentication-passwordless.md) |
active-directory | Troubleshoot Publisher Verification | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/troubleshoot-publisher-verification.md | Below are some common issues that may occur during the process. Your app registrations may have been created using a different user account in this tenant, a personal/consumer account, or in a different tenant. Ensure you're signed in with the correct account in the tenant where your app registrations were created. - **I'm getting an error related to multi-factor authentication. What should I do?** - Ensure [multi-factor authentication](../fundamentals/concept-fundamentals-mfa-get-started.md) is enabled and **required** for the user you're signing in with and for this scenario. For example, MFA could be: + Ensure [multi-factor authentication](../authentication/concept-mfa-licensing.md) is enabled and **required** for the user you're signing in with and for this scenario. For example, MFA could be: - Always required for the user you're signing in with. - [Required for Azure management](../conditional-access/howto-conditional-access-policy-azure-management.md). - [Required for the type of administrator](../conditional-access/howto-conditional-access-policy-admin-mfa.md) you're signing in with. Occurs when multi-factor authentication (MFA) hasn't been enabled and performed The error message displayed will be: "Due to a configuration change made by your administrator, or because you moved to a new location, you must use multi-factor authentication to proceed." **Remediation Steps**-1. Ensure [multi-factor authentication](../fundamentals/concept-fundamentals-mfa-get-started.md) is enabled and **required** for the user you're signing in with and for this scenario +1. Ensure [multi-factor authentication](../authentication/concept-mfa-licensing.md) is enabled and **required** for the user you're signing in with and for this scenario 1. Retry Publisher Verification ### UserUnableToAddPublisher |
active-directory | Tutorial V2 Windows Uwp | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/tutorial-v2-windows-uwp.md | In this tutorial: ## Prerequisites -* [Visual Studio 2019](https://visualstudio.microsoft.com/vs/) with the [Universal Windows Platform development](/windows/uwp/get-started/get-set-up) workload installed +* [Visual Studio 2019](https://visualstudio.microsoft.com/vs/) with the [Universal Windows Platform development](/windows/apps/windows-app-sdk/set-up-your-development-environment) workload installed ## How this guide works |
active-directory | Userinfo | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/userinfo.md | The claims shown in the response are all those that the UserInfo endpoint can re You can't add to or customize the information returned by the UserInfo endpoint. -To customize the information returned by the identity platform during authentication and authorization, use [claims mapping](active-directory-claims-mapping.md) and [optional claims](active-directory-optional-claims.md) to modify security token configuration. +To customize the information returned by the identity platform during authentication and authorization, use [claims mapping](./saml-claims-customization.md) and [optional claims](./optional-claims.md) to modify security token configuration. ## Next steps |
active-directory | V2 Admin Consent | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/v2-admin-consent.md | http://localhost/myapp/permissions | `scope` | The set of permissions that were granted access to, for the application.| | `admin_consent` | Will be set to `True`.| +> [!WARNING] +> Never use the **tenant ID** value of the `tenant` parameter to authenticate or authorize users. The tenant ID value can be updated and sent by bad actors to impersonate a response to your app. This can cause your application to be exposed to security incidents. + ### Error response ```none |
active-directory | Whats New Docs | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/develop/whats-new-docs.md | Welcome to what's new in the Microsoft identity platform documentation. This art - [Access tokens in the Microsoft identity platform](access-tokens.md) - Improve the explanations on how to validate a token - [Claims mapping policy type](reference-claims-mapping-policy-type.md) - Updates to Restricted Claims Set-- [Migrate confidential client applications from ADAL.NET to MSAL.NET](msal-net-migration-confidential-client.md) - Improving clarity in the content+- [Migrate confidential client applications from ADAL.NET to MSAL.NET](/entra/msal/dotnet/how-to/migrate-confidential-client) - Improving clarity in the content - [Single sign-on with MSAL.js](msal-js-sso.md) - Add guidance on using the loginHint claim for SSO - [Tutorial: Create a Blazor Server app that uses the Microsoft identity platform for authentication](tutorial-blazor-server.md) - Simplified and leverage the Microsoft Identity App Sync .NET tool |
active-directory | Concept Directory Join | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/devices/concept-directory-join.md | The goal of Microsoft Entra joined devices is to simplify: Microsoft Entra join can be deployed by using any of the following methods: -- [Windows Autopilot](/windows/deployment/windows-autopilot/windows-10-autopilot)-- [Bulk deployment](/intune/windows-bulk-enroll)+- [Windows Autopilot](/autopilot/windows-autopilot) +- [Bulk deployment](/mem/intune/enrollment/windows-bulk-enroll) - [Self-service experience](device-join-out-of-box.md) ## Next steps |
active-directory | Howto Vm Sign In Azure Ad Linux | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/devices/howto-vm-sign-in-azure-ad-linux.md | There are a few ways to open Cloud Shell: If you choose to install and use the Azure CLI locally, this article requires you to use version 2.22.1 or later. Run `az --version` to find the version. If you need to install or upgrade, see [Install the Azure CLI](/cli/azure/install-azure-cli). 1. Create a resource group by running [az group create](/cli/azure/group#az-group-create).-1. Create a VM by running [az vm create](/cli/azure/vm?preserve-view=true#az-vm-create&preserve-view=true). Use a supported distribution in a supported region. +1. Create a VM by running [az vm create](/cli/azure/vm?#az-vm-create). Use a supported distribution in a supported region. 1. Install the Microsoft Entra login VM extension by using [az vm extension set](/cli/azure/vm/extension#az-vm-extension-set). The following example deploys a VM and then installs the extension to enable Microsoft Entra login for a Linux VM. VM extensions are small applications that provide post-deployment configuration and automation tasks on Azure virtual machines. Customize the example as needed to support your testing requirements. |
active-directory | Hybrid Join Control | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/devices/hybrid-join-control.md | If your Microsoft Entra ID is federated with AD FS, you first need to configure To register Windows down-level devices, organizations must install [Microsoft Workplace Join for non-Windows 10 computers](https://www.microsoft.com/download/details.aspx?id=53554) available on the Microsoft Download Center. -You can deploy the package by using a software distribution system likeΓÇ»[Microsoft Configuration Manager](/configmgr/). The package supports the standard silent installation options with the quiet parameter. The current branch of Configuration Manager offers benefits over earlier versions, like the ability to track completed registrations. +You can deploy the package by using a software distribution system likeΓÇ»[Microsoft Configuration Manager](/mem/configmgr/). The package supports the standard silent installation options with the quiet parameter. The current branch of Configuration Manager offers benefits over earlier versions, like the ability to track completed registrations. The installer creates a scheduled task on the system that runs in the user context. The task is triggered when the user signs in to Windows. The task silently joins the device with Microsoft Entra ID with the user credentials after authenticating with Microsoft Entra ID. |
active-directory | Manage Device Identities | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/devices/manage-device-identities.md | From there, you can go to **All devices** to: - Identify devices, including: - Devices that have been joined or registered in Microsoft Entra ID.- - Devices deployed via [Windows Autopilot](/windows/deployment/windows-autopilot/windows-autopilot). + - Devices deployed via [Windows Autopilot](/autopilot/windows-autopilot). - Printers that use [Universal Print](/universal-print/fundamentals/universal-print-getting-started). - Complete device identity management tasks like enable, disable, delete, and manage. - The management options for [Printers](/universal-print/fundamentals/) and [Windows Autopilot](/autopilot/windows-autopilot) are limited in Microsoft Entra ID. These devices must be managed from their respective admin interfaces. You must be assigned one of the following roles to manage device settings: - **Users may join devices to Microsoft Entra ID**: This setting enables you to select the users who can register their devices as Microsoft Entra joined devices. The default is **All**. > [!NOTE]- > The **Users may join devices to Microsoft Entra ID** setting is applicable only to Microsoft Entra join on Windows 10 or newer. This setting doesn't apply to Microsoft Entra hybrid joined devices, [Microsoft Entra joined VMs in Azure](./howto-vm-sign-in-azure-ad-windows.md#enable-azure-ad-login-for-a-windows-vm-in-azure), or Microsoft Entra joined devices that use [Windows Autopilot self-deployment mode](/mem/autopilot/self-deploying) because these methods work in a userless context. + > The **Users may join devices to Microsoft Entra ID** setting is applicable only to Microsoft Entra join on Windows 10 or newer. This setting doesn't apply to Microsoft Entra hybrid joined devices, [Microsoft Entra joined VMs in Azure](./howto-vm-sign-in-azure-ad-windows.md#enable-azure-ad-login-for-a-windows-vm-in-azure), or Microsoft Entra joined devices that use [Windows Autopilot self-deployment mode](/autopilot/self-deploying) because these methods work in a userless context. - **Users may register their devices with Microsoft Entra ID**: You need to configure this setting to allow users to register Windows 10 or newer personal, iOS, Android, and macOS devices with Microsoft Entra ID. If you select **None**, devices aren't allowed to register with Microsoft Entra ID. Enrollment with Microsoft Intune or mobile device management for Microsoft 365 requires registration. If you've configured either of these services, **ALL** is selected, and **NONE** is unavailable. - **Require multifactor authentication to register or join devices with Microsoft Entra ID**: |
active-directory | Plan Device Deployment | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/devices/plan-device-deployment.md | Conditional Access <br>(Require Microsoft Entra hybrid joined devices) | | | ![C ## Microsoft Entra Registration -Registered devices are often managed with [Microsoft Intune](/mem/intune/enrollment/device-enrollment). Devices are enrolled in Intune in several ways, depending on the operating system. +Registered devices are often managed with [Microsoft Intune](/mem/intune/fundamentals/deployment-guide-enrollment). Devices are enrolled in Intune in several ways, depending on the operating system. Microsoft Entra registered devices provide support for Bring Your Own Devices (BYOD) and corporate owned devices to SSO to cloud resources. Access to resources is based on the Microsoft Entra [Conditional Access policies](../conditional-access/concept-conditional-access-grant.md) applied to the device and the user. Review supported and unsupported platforms for integrated devices: | Device management tools | Microsoft Entra registered | Microsoft Entra joined | Microsoft Entra hybrid joined | | | :: | :: | :: |-| [Mobile Device Management (MDM)](/windows/client-management/mdm/azure-active-directory-integration-with-mdm) <br>Example: Microsoft Intune | ![Checkmark for these values.](./media/plan-device-deployment/check.png) | ![Checkmark for these values.](./media/plan-device-deployment/check.png) | ![Checkmark for these values.](./media/plan-device-deployment/check.png) | +| [Mobile Device Management (MDM)](/windows/client-management/azure-active-directory-integration-with-mdm) <br>Example: Microsoft Intune | ![Checkmark for these values.](./media/plan-device-deployment/check.png) | ![Checkmark for these values.](./media/plan-device-deployment/check.png) | ![Checkmark for these values.](./media/plan-device-deployment/check.png) | | [Co-management with Microsoft Intune and Microsoft Configuration Manager](/mem/configmgr/comanage/overview) <br>(Windows 10 or newer) | | ![Checkmark for these values.](./media/plan-device-deployment/check.png) | ![Checkmark for these values.](./media/plan-device-deployment/check.png) | | [Group policy](/previous-versions/windows/it-pro/windows-server-2012-R2-and-2012/hh831791(v=ws.11))<br>(Windows only) | | | ![Checkmark for these values.](./media/plan-device-deployment/check.png) | |
active-directory | Troubleshoot Mac Sso Extension Plugin | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/devices/troubleshoot-mac-sso-extension-plugin.md | Finished SSO request. At this point in the authentication/authorization flow, the PRT has been bootstrapped and it should be visible in the macOS keychain access. See [Checking Keychain Access for PRT](#checking-keychain-access-for-prt). The **MSAL macOS sample** application uses the access token received from the Microsoft SSO Extension Broker to display the user's information. -Next, examine server-side [Microsoft Entra sign-in logs](../reports-monitoring/reference-basic-info-sign-in-logs.md#correlation-id) based on the correlation ID collected from the client-side SSO extension logs. For more information, see [Sign-in logs in Microsoft Entra ID](../reports-monitoring/concept-sign-ins.md). +Next, examine server-side [Microsoft Entra sign-in logs](../reports-monitoring/concept-sign-in-log-activity-details.md) based on the correlation ID collected from the client-side SSO extension logs. For more information, see [Sign-in logs in Microsoft Entra ID](../reports-monitoring/concept-sign-ins.md). <a name='view-azure-ad-sign-in-logs-by-correlation-id-filter'></a> |
active-directory | Domains Manage | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/enterprise-users/domains-manage.md | You must change or delete any such resource in your Microsoft Entra organization You can **ForceDelete** a domain name in the [Azure portal](https://portal.azure.com) or using [Microsoft Graph API](/graph/api/domain-forcedelete). These options use an asynchronous operation and update all references from the custom domain name like ΓÇ£user@contoso.comΓÇ¥ to the initial default domain name such as ΓÇ£user@contoso.onmicrosoft.com.ΓÇ¥ -To call **ForceDelete** in the Azure portal, you must ensure that there are fewer than 1000 references to the domain name, and any references where Exchange is the provisioning service must be updated or removed in the [Exchange Admin Center](https://outlook.office365.com/ecp/). This includes Exchange Mail-Enabled Security Groups and distributed lists. For more information, see [Removing mail-enabled security groups](/Exchange/recipients/mail-enabled-security-groups#Remove%20mail-enabled%20security%20groups&preserve-view=true). Also, the **ForceDelete** operation won't succeed if either of the following is true: +To call **ForceDelete** in the Azure portal, you must ensure that there are fewer than 1000 references to the domain name, and any references where Exchange is the provisioning service must be updated or removed in the [Exchange Admin Center](https://outlook.office365.com/ecp/). This includes Exchange Mail-Enabled Security Groups and distributed lists. For more information, see [Removing mail-enabled security groups](/Exchange/recipients/mail-enabled-security-groups#Remove%20mail-enabled%20security%20groups). Also, the **ForceDelete** operation won't succeed if either of the following is true: * You purchased a domain via Microsoft 365 domain subscription services * You are a partner administering on behalf of another customer organization If you find that any of the conditions havenΓÇÖt been met, manually clean up the Most management tasks for domain names in Microsoft Entra ID can also be completed using Microsoft PowerShell, or programmatically using the Microsoft Graph API. -* [Using PowerShell to manage domain names in Microsoft Entra ID](/powershell/module/azuread/#domains&preserve-view=true) +* [Using PowerShell to manage domain names in Microsoft Entra ID](/powershell/module/azuread/?preserve-view=true#domains) * [Domain resource type](/graph/api/resources/domain) ## Next steps -* [Add custom domain names](../fundamentals/add-custom-domain.md?context=azure%2factive-directory%2fusers-groups-roles%2fcontext%2fugr-context) -* [Remove Exchange mail-enabled security groups in Exchange Admin Center on a custom domain name in Microsoft Entra ID](/Exchange/recipients/mail-enabled-security-groups#Remove%20mail-enabled%20security%20groups&preserve-view=true) +* [Add custom domain names](../fundamentals/add-custom-domain.md?context=azure/active-directory/users-groups-roles/context/ugr-context) +* [Remove Exchange mail-enabled security groups in Exchange Admin Center on a custom domain name in Microsoft Entra ID](/exchange/recipients/mail-enabled-security-groups?preserve-view=true#Remove%20mail-enabled%20security%20groups) * [ForceDelete a custom domain name with Microsoft Graph API](/graph/api/domain-forcedelete) |
active-directory | Domains Verify Custom Subdomain | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/enterprise-users/domains-verify-custom-subdomain.md | -In the Azure portal, when the parent domain is federated and the admin tries to verify a managed subdomain on the **Custom domain names** page, you'll get a 'Failed to add domain' error with the reason "One or more properties contains invalid values." If you try to add this subdomain from the Microsoft 365 admin center, you'll receive a similar error. For more information about the error, see [A child domain doesn't inherit parent domain changes in Office 365, Azure, or Intune](/office365/troubleshoot/administration/child-domain-fails-inherit-parent-domain-changes). +In the Azure portal, when the parent domain is federated and the admin tries to verify a managed subdomain on the **Custom domain names** page, you'll get a 'Failed to add domain' error with the reason "One or more properties contains invalid values." If you try to add this subdomain from the Microsoft 365 admin center, you'll receive a similar error. For more information about the error, see [A child domain doesn't inherit parent domain changes in Office 365, Azure, or Intune](/microsoft-365/troubleshoot/administration/child-domain-fails-inherit-parent-domain-changes). Because subdomains inherit the authentication type of the root domain by default, you must promote the subdomain to a root domain in Microsoft Entra ID using the Microsoft Graph so you can set the authentication type to your desired type. Invoking API with a federated verified subdomain with user references | POST | 4 ## Next steps -- [Add custom domain names](../fundamentals/add-custom-domain.md?context=azure%2factive-directory%2fusers-groups-roles%2fcontext%2fugr-context)+- [Add custom domain names](../fundamentals/add-custom-domain.md?context=azure/active-directory/users-groups-roles/context/ugr-context) - [Manage domain names](domains-manage.md) - [ForceDelete a custom domain name with Microsoft Graph API](/graph/api/domain-forcedelete) |
active-directory | Groups Dynamic Membership | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/enterprise-users/groups-dynamic-membership.md | The following device attributes can be used. managementType | MDM (for mobile devices) | device.managementType -eq "MDM" memberOf | Any string value (valid group object ID) | device.memberof -any (group.objectId -in ['value']) objectId | a valid Microsoft Entra object ID | device.objectId -eq "76ad43c9-32c5-45e8-a272-7b58b58f596d"- profileType | a valid [profile type](/graph/api/resources/device?view=graph-rest-1.0#properties&preserve-view=true) in Microsoft Entra ID | device.profileType -eq "RegisteredDevice" + profileType | a valid [profile type](/graph/api/resources/device?view=graph-rest-1.0&preserve-view=true#properties) in Microsoft Entra ID | device.profileType -eq "RegisteredDevice" systemLabels | any string matching the Intune device property for tagging Modern Workplace devices | device.systemLabels -contains "M365Managed" <!-- docutune:enable --> The following device attributes can be used. > [!NOTE] > When using `deviceOwnership` to create Dynamic Groups for devices, you need to set the value equal to `Company`. On Intune the device ownership is represented instead as Corporate. For more information, see [OwnerTypes](/mem/intune/developer/reports-ref-devices#ownertypes) for more details. > When using `deviceTrustType` to create Dynamic Groups for devices, you need to set the value equal to `AzureAD` to represent Microsoft Entra joined devices, `ServerAD` to represent Microsoft Entra hybrid joined devices or `Workplace` to represent Microsoft Entra registered devices.-> When using `extensionAttribute1-15` to create Dynamic Groups for devices you need to set the value for `extensionAttribute1-15` on the device. Learn more on [how to write `extensionAttributes` on a Microsoft Entra device object](/graph/api/device-update?view=graph-rest-1.0&tabs=http#example-2--write-extensionattributes-on-a-device&preserve-view=true) +> When using `extensionAttribute1-15` to create Dynamic Groups for devices you need to set the value for `extensionAttribute1-15` on the device. Learn more on [how to write `extensionAttributes` on a Microsoft Entra device object](/graph/api/device-update?view=graph-rest-1.0&tabs=http&preserve-view=true#example-2--write-extensionattributes-on-a-device) ## Next steps |
active-directory | Groups Lifecycle | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/enterprise-users/groups-lifecycle.md | If the group you're restoring contains documents, SharePoint sites, or other per ## How to retrieve Microsoft 365 group expiration date -In addition to Access Panel where users can view group details including expiration date and last renewed date, expiration date of a Microsoft 365 group can be retrieved from Microsoft Graph REST API Beta. expirationDateTime as a group property has been enabled in Microsoft Graph Beta. It can be retrieved with a GET request. For more details, please refer to [this example](/graph/api/group-get?view=graph-rest-beta#example&preserve-view=true). +In addition to Access Panel where users can view group details including expiration date and last renewed date, expiration date of a Microsoft 365 group can be retrieved from Microsoft Graph REST API Beta. expirationDateTime as a group property has been enabled in Microsoft Graph Beta. It can be retrieved with a GET request. For more details, please refer to [this example](/graph/api/group-get?view=graph-rest-beta&preserve-view=true#example). > [!NOTE] > In order to manage group memberships on Access Panel, "Restrict access to Groups in Access Panel" needs to be set to "No" in Microsoft Entra groups General Setting. |
active-directory | Groups Settings V2 Cmdlets | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/enterprise-users/groups-settings-v2-cmdlets.md | $param = @{ mailNickname="Demo" } -New-MgGroup -BodyParameter $param +New-MgGroup @param ``` ## Update groups |
active-directory | Users Close Account | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/enterprise-users/users-close-account.md | Users in an unmanaged organization are often created during self-service sign-up Before you can close your account, you should confirm the following items: -* Make sure you are a user of an unmanaged Microsoft Entra organization. You can't close your account if you belong to a managed organization. If you belong to a managed organization and want to close your account, you must contact your administrator. For information about how to determine whether you belong to an unmanaged organization, see [Delete the user from Unmanaged Tenant](/power-automate/gdpr-dsr-delete#delete-the-user-from-unmanaged-tenant). +* Make sure you are a user of an unmanaged Microsoft Entra organization. You can't close your account if you belong to a managed organization. If you belong to a managed organization and want to close your account, you must contact your administrator. For information about how to determine whether you belong to an unmanaged organization, see [Delete the user from Unmanaged Tenant](/power-automate/privacy-dsr-delete#delete-the-user-from-unmanaged-tenant). -* Save any data you want to keep. For information about how to submit an export request, see [Accessing and exporting system-generated logs for Unmanaged Tenants](/power-platform/admin/powerapps-gdpr-dsr-guide-systemlogs#accessing-and-exporting-system-generated-logs-for-unmanaged-tenants). +* Save any data you want to keep. For information about how to submit an export request, see [Accessing and exporting system-generated logs for Unmanaged Tenants](/power-platform/admin/powerapps-privacy-dsr-guide-systemlogs#accessing-and-exporting-system-generated-logs-for-unmanaged-tenants). > [!WARNING] > Closing your account is irreversible. When you close your account, all personal data will be removed. You will no longer have access to your account and data associated with your account. |
active-directory | How To Customize Branding Customers | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/external-identities/customers/how-to-customize-branding-customers.md | You can also create user flows programmatically using the Company Branding Graph By default, Microsoft offers a neutral branding for your tenant that can be personalized to suit your company's specific requirements. This default branding doesn't include any pre-existing Microsoft branding. In the event that the custom company branding fails to load, the sign-in page will automatically switch back to this neutral branding. Additionally, each custom branding property can be manually added to the custom sign-in page. -You can customize this neutral branding with a custom background image or color, favicon, layout, header, and footer. You can also customize the sign-in form and add custom text to different instances or upload [custom CSS](/azure/active-directory/fundamentals/reference-company-branding-css-template). +You can customize this neutral branding with a custom background image or color, favicon, layout, header, and footer. You can also customize the sign-in form and add custom text to different instances or upload [custom CSS](../../fundamentals/reference-company-branding-css-template.md). The following image displays the neutral default branding of the tenant. You can find the numbered branding elements and their corresponding descriptions after the image. :::image type="content" source="media/how-to-customize-branding-customers/ciam-neutral-branding.png" alt-text="Screenshot of the CIAM neutral branding." lightbox="media/how-to-customize-branding-customers/ciam-neutral-branding.png"::: The following image displays the neutral default branding of the tenant. You can ## How to customize the default sign-in experience -Before you customize any settings, the neutral default branding will appear in your sign-in and sign-up pages. You can customize this default experience with a custom background image or color, favicon, layout, header, and footer. You can also upload a [custom CSS](/azure/active-directory/fundamentals/reference-company-branding-css-template). +Before you customize any settings, the neutral default branding will appear in your sign-in and sign-up pages. You can customize this default experience with a custom background image or color, favicon, layout, header, and footer. You can also upload a [custom CSS](../../fundamentals/reference-company-branding-css-template.md). -1. Sign in to the [Microsoft Entra admin center](https://entra.microsoft.com) as at least a [Global Administrator](/azure/active-directory/roles/permissions-reference#global-administrator). +1. Sign in to the [Microsoft Entra admin center](https://entra.microsoft.com) as at least a [Global Administrator](../../roles/permissions-reference.md#global-administrator). 1. If you have access to multiple tenants, use the **Directories + subscriptions** filter :::image type="icon" source="media/common/portal-directory-subscription-filter.png" border="false"::: in the top menu to switch to the customer tenant you created earlier. 1. Browse to **Company Branding** > **Default sign-in** > **Edit**. Your tenant name replaces the Microsoft banner logo in the neutral default sign- :::image type="content" source="media/how-to-customize-branding-customers/tenant-name.png" alt-text="Screenshot of the tenant name." lightbox="media/how-to-customize-branding-customers/tenant-name.png"::: -1. Sign in to the [Microsoft Entra admin center](https://entra.microsoft.com/) as at least a [Global Administrator](/azure/active-directory/roles/permissions-reference#global-administrator). +1. Sign in to the [Microsoft Entra admin center](https://entra.microsoft.com/) as at least a [Global Administrator](../../roles/permissions-reference.md#global-administrator). 1. If you have access to multiple tenants, use the **Directories + subscriptions** filter :::image type="icon" source="media/common/portal-directory-subscription-filter.png" border="false"::: in the top menu to switch to the customer tenant you created earlier. 1. In the search bar, type and select **Properties**. 1. Edit the **Name** field. You can use the Microsoft Graph API to customize a few items programmatically. F ## Next steps In this article we learned how to customize the look and feel of the customer sign-in and sing-up experience. To learn more about customizing the language of the tenant, see the [Language customization](how-to-customize-languages-customers.md) article.-For an understanding of the differences in workforce tenant branding, see the article [How to customize branding for your workforce](/azure/active-directory/fundamentals/how-to-customize-branding). +For an understanding of the differences in workforce tenant branding, see the article [How to customize branding for your workforce](../../fundamentals/how-to-customize-branding.md). |
active-directory | How To Web App Node Sign In Call Api Sign In Acquire Access Token | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/external-identities/customers/how-to-web-app-node-sign-in-call-api-sign-in-acquire-access-token.md | The `/signin`, `/signout` and `/redirect` routes are defined in the *routes/auth - Initiates sign-in flow by triggering the first leg of auth code flow. - - Initializes a [confidential client application](../../../active-directory/develop/msal-client-applications.md) instance by using `msalConfig` MSAL configuration object. + - Initializes a [confidential client application](../../develop/msal-client-applications.md) instance by using `msalConfig` MSAL configuration object. ```javascript const msalInstance = this.getMsalInstance(this.config.msalConfig); |
active-directory | How To Web App Node Use Certificate | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/external-identities/customers/how-to-web-app-node-use-certificate.md | -Microsoft Entra ID for customers supports two types of authentication for [confidential client applications](../../../active-directory/develop/msal-client-applications.md); password-based authentication (such as client secret) and certificate-based authentication. For a higher level of security, we recommend using a certificate (instead of a client secret) as a credential in your confidential client applications. +Microsoft Entra ID for customers supports two types of authentication for [confidential client applications](../../develop/msal-client-applications.md); password-based authentication (such as client secret) and certificate-based authentication. For a higher level of security, we recommend using a certificate (instead of a client secret) as a credential in your confidential client applications. In production, you should purchase a certificate signed by a well-known certificate authority, and use [Azure Key Vault](https://azure.microsoft.com/products/key-vault/) to manage certificate access and lifetime for you. However, for testing purposes, you can create a self-signed certificate and configure your apps to authenticate with it. |
active-directory | Tutorial Web App Node Sign In Sign Out | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/external-identities/customers/tutorial-web-app-node-sign-in-sign-out.md | The `/` route is the entry point to the application. It renders the *views/index - It initiates sign-in flow by triggering the first leg of auth code flow. - - It initializes a [confidential client application](../../../active-directory/develop/msal-client-applications.md) instance by using MSAL configuration object, `msalConfig`, that you created earlier. + - It initializes a [confidential client application](../../develop/msal-client-applications.md) instance by using MSAL configuration object, `msalConfig`, that you created earlier. ```javascript const msalInstance = this.getMsalInstance(this.config.msalConfig); |
active-directory | Whats New Docs | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/external-identities/customers/whats-new-docs.md | Welcome to what's new in Microsoft Entra ID for customers documentation. This ar ## September 2023 -This month, we renamed Microsoft Entra ID to Microsoft Entra ID. For more information about the rebranding, see the [New name for Microsoft Entra ID](/azure/active-directory/fundamentals/new-name) article. +This month, we renamed Microsoft Entra ID to Microsoft Entra ID. For more information about the rebranding, see the [New name for Microsoft Entra ID](../../fundamentals/new-name.md) article. ### Updated articles |
active-directory | External Identities Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/external-identities/external-identities-overview.md | A multi-tenant organization is an organization that has more than one instance o - [What is Microsoft Entra B2B collaboration?](what-is-b2b.md) - [What is Microsoft Entra B2B direct connect?](b2b-direct-connect-overview.md) - [About Azure AD B2C](/azure/active-directory-b2c/overview)-- [About Microsoft Entra multi-tenant organizations](../../active-directory/multi-tenant-organizations/overview.md)+- [About Microsoft Entra multi-tenant organizations](../multi-tenant-organizations/overview.md) |
active-directory | Hybrid On Premises To Cloud | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/external-identities/hybrid-on-premises-to-cloud.md | -If you create accounts for your external partners in your on-premises directory (for example, you create an account with a sign-in name of "msullivan" for an external user named Maria Sullivan in your partners.contoso.com domain), you can now sync these accounts to the cloud. Specifically, you can use [Microsoft Entra Connect](/azure/active-directory/hybrid/connect/whatis-azure-ad-connect) to sync the partner accounts to the cloud, which creates a user account with UserType = Guest. This enables your partner users to access cloud resources using the same credentials as their local accounts, without giving them more access than they need. For more information about converting local guest accounts see [Convert local guest accounts to Microsoft Entra B2B guest accounts](/azure/active-directory/architecture/10-secure-local-guest). +If you create accounts for your external partners in your on-premises directory (for example, you create an account with a sign-in name of "msullivan" for an external user named Maria Sullivan in your partners.contoso.com domain), you can now sync these accounts to the cloud. Specifically, you can use [Microsoft Entra Connect](../hybrid/connect/whatis-azure-ad-connect.md) to sync the partner accounts to the cloud, which creates a user account with UserType = Guest. This enables your partner users to access cloud resources using the same credentials as their local accounts, without giving them more access than they need. For more information about converting local guest accounts see [Convert local guest accounts to Microsoft Entra B2B guest accounts](../architecture/10-secure-local-guest.md). > [!NOTE] > See also how to [invite internal users to B2B collaboration](invite-internal-users.md). With this feature, you can invite internal guest users to use B2B collaboration, regardless of whether you've synced their accounts from your on-premises directory to the cloud. Once the user accepts the invitation to use B2B collaboration, they'll be able to use their own identities and credentials to sign in to the resources you want them to access. You wonΓÇÖt need to maintain passwords or manage account lifecycles. |
active-directory | Tenant Restrictions V2 | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/external-identities/tenant-restrictions-v2.md | There are three options for enforcing tenant restrictions v2 for clients: ### Option 1: Universal tenant restrictions v2 as part of Microsoft Entra Global Secure Access (preview) -Universal tenant restrictions v2 as part of [Microsoft Entra Global Secure Access](/azure/global-secure-access/overview-what-is-global-secure-access) is recommended because it provides authentication and data plane protection for all devices and platforms. This option provides more protection against sophisticated attempts to bypasses authentication. For example, attackers might try to allow anonymous access to a malicious tenantΓÇÖs apps, such as anonymous meeting join in Teams. Or, attackers might attempt to import to your organizational device an access token lifted from a device in the malicious tenant. Universal tenant restrictions v2 prevents these attacks by sending tenant restrictions v2 signals on the authentication plane (Microsoft Entra ID and Microsoft Account) and data plane (Microsoft cloud applications). +Universal tenant restrictions v2 as part of [Microsoft Entra Global Secure Access](/entra/global-secure-access/overview-what-is-global-secure-access) is recommended because it provides authentication and data plane protection for all devices and platforms. This option provides more protection against sophisticated attempts to bypasses authentication. For example, attackers might try to allow anonymous access to a malicious tenantΓÇÖs apps, such as anonymous meeting join in Teams. Or, attackers might attempt to import to your organizational device an access token lifted from a device in the malicious tenant. Universal tenant restrictions v2 prevents these attacks by sending tenant restrictions v2 signals on the authentication plane (Microsoft Entra ID and Microsoft Account) and data plane (Microsoft cloud applications). ### Option 2: Set up tenant restrictions v2 on your corporate proxy |
active-directory | Tutorial Bulk Invite | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/external-identities/tutorial-bulk-invite.md | Check to see that the guest users you added exist in the directory either in the ### View guest users with PowerShell -To view guest users with PowerShell, you'll need the [`Microsoft.Graph.Users` PowerShell module](/powershell/module/microsoft.graph.users/?view=graph-powershell-beta&preserve-view=true). Then sign in using the `Connect-MgGraph` command with an admin account to consent to the required scopes: +To view guest users with PowerShell, you'll need the [`Microsoft.Graph.Users` PowerShell module](/powershell/module/microsoft.graph.users/?view=graph-powershell-1.0&viewFallbackFrom=graph-powershell-beta&preserve-view=true). Then sign in using the `Connect-MgGraph` command with an admin account to consent to the required scopes: ```powershell Connect-MgGraph -Scopes "User.Read.All" ``` |
active-directory | What Is B2b | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/external-identities/what-is-b2b.md | B2B collaboration is enabled by default, but comprehensive admin settings let yo - Use [external collaboration settings](external-collaboration-settings-configure.md) to define who can invite external users, allow or block B2B specific domains, and set restrictions on guest user access to your directory. -- Use [Microsoft cloud settings](cross-cloud-settings.md) to establish mutual B2B collaboration between the Microsoft Azure global cloud and [Microsoft Azure Government](/azure/azure-government/) or [Microsoft Azure operated by 21Vianet](/azure/china).+- Use [Microsoft cloud settings](cross-cloud-settings.md) to establish mutual B2B collaboration between the Microsoft Azure global cloud and [Microsoft Azure Government](/azure/azure-government/) or [Microsoft Azure operated by 21Vianet](/azure/china/). ## Easily invite guest users from the Azure portal |
active-directory | Whats New Docs | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/external-identities/whats-new-docs.md | Welcome to what's new in Azure Active Directory External Identities documentatio ## September 2023 -This month, we renamed Azure Active Directory (Azure AD) to Microsoft Entra ID. For more information about the rebranding, see the [New name for Azure Active Directory](/azure/active-directory/fundamentals/new-name) article. +This month, we renamed Azure Active Directory (Azure AD) to Microsoft Entra ID. For more information about the rebranding, see the [New name for Azure Active Directory](../fundamentals/new-name.md) article. ### Updated articles |
active-directory | Compare | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/fundamentals/compare.md | Most IT administrators are familiar with Active Directory Domain Services concep |:-|:-|:-| |**Users**||| |Provisioning: users | Organizations create internal users manually or use an in-house or automated provisioning system, such as the Microsoft Identity Manager, to integrate with an HR system.|Existing AD organizations use [Microsoft Entra Connect](../hybrid/connect/how-to-connect-sync-whatis.md) to sync identities to the cloud.</br> Microsoft Entra ID adds support to automatically create users from [cloud HR systems](../app-provisioning/what-is-hr-driven-provisioning.md). </br>Microsoft Entra ID can provision identities in [SCIM enabled](../app-provisioning/use-scim-to-provision-users-and-groups.md) SaaS apps to automatically provide apps with the necessary details to allow access for users. |-|Provisioning: external identities| Organizations create external users manually as regular users in a dedicated external AD forest, resulting in administration overhead to manage the lifecycle of external identities (guest users)| Microsoft Entra ID provides a special class of identity to support external identities. [Microsoft Entra B2B](/azure/active-directory/b2b/) will manage the link to the external user identity to make sure they are valid. | +|Provisioning: external identities| Organizations create external users manually as regular users in a dedicated external AD forest, resulting in administration overhead to manage the lifecycle of external identities (guest users)| Microsoft Entra ID provides a special class of identity to support external identities. [Microsoft Entra B2B](../external-identities/index.yml) will manage the link to the external user identity to make sure they are valid. | | Entitlement management and groups| Administrators make users members of groups. App and resource owners then give groups access to apps or resources.| [Groups](./how-to-manage-groups.md) are also available in Microsoft Entra ID and administrators can also use groups to grant permissions to resources. In Microsoft Entra ID, administrators can assign membership to groups manually or use a query to dynamically include users to a group. </br> Administrators can use [Entitlement management](../governance/entitlement-management-overview.md) in Microsoft Entra ID to give users access to a collection of apps and resources using workflows and, if necessary, time-based criteria. | | Admin management|Organizations will use a combination of domains, organizational units, and groups in AD to delegate administrative rights to manage the directory and resources it controls.| Microsoft Entra ID provides [built-in roles](./how-subscriptions-associated-directory.md) with its Microsoft Entra role-based access control (Microsoft Entra RBAC) system, with limited support for [creating custom roles](../roles/custom-overview.md) to delegate privileged access to the identity system, the apps, and resources it controls.</br>Managing roles can be enhanced with [Privileged Identity Management (PIM)](../privileged-identity-management/pim-configure.md) to provide just-in-time, time-restricted, or workflow-based access to privileged roles. | | Credential management| Credentials in Active Directory are based on passwords, certificate authentication, and smartcard authentication. Passwords are managed using password policies that are based on password length, expiry, and complexity.|Microsoft Entra ID uses intelligent [password protection](../authentication/concept-password-ban-bad.md) for cloud and on-premises. Protection includes smart lockout plus blocking common and custom password phrases and substitutions. </br>Microsoft Entra ID significantly boosts security [through Multi-factor authentication](../authentication/concept-mfa-howitworks.md) and [passwordless](../authentication/concept-authentication-passwordless.md) technologies, like FIDO2. </br>Microsoft Entra ID reduces support costs by providing users a [self-service password reset](../authentication/concept-sspr-howitworks.md) system. | Most IT administrators are familiar with Active Directory Domain Services concep | Mid-tier/Daemon services|Services running in on-premises environments normally use AD service accounts or group Managed Service Accounts (gMSA) to run. These apps will then inherit the permissions of the service account.| Microsoft Entra ID provides [managed identities](../managed-identities-azure-resources/index.yml) to run other workloads in the cloud. The lifecycle of these identities is managed by Microsoft Entra ID and is tied to the resource provider and it can't be used for other purposes to gain backdoor access.| | **Devices**||| | Mobile|Active Directory doesn't natively support mobile devices without third-party solutions.| MicrosoftΓÇÖs mobile device management solution, Microsoft Intune, is integrated with Microsoft Entra ID. Microsoft Intune provides device state information to the identity system to evaluate during authentication. |-| Windows desktops|Active Directory provides the ability to domain join Windows devices to manage them using Group Policy, System Center Configuration Manager, or other third-party solutions.|Windows devices can be [joined to Microsoft Entra ID](../devices/index.yml). Conditional Access can check if a device is Microsoft Entra joined as part of the authentication process. Windows devices can also be managed with [Microsoft Intune](/intune/what-is-intune). In this case, Conditional Access, will consider whether a device is compliant (for example, up-to-date security patches and virus signatures) before allowing access to the apps.| +| Windows desktops|Active Directory provides the ability to domain join Windows devices to manage them using Group Policy, System Center Configuration Manager, or other third-party solutions.|Windows devices can be [joined to Microsoft Entra ID](../devices/index.yml). Conditional Access can check if a device is Microsoft Entra joined as part of the authentication process. Windows devices can also be managed with [Microsoft Intune](/mem/intune/fundamentals/what-is-intune). In this case, Conditional Access, will consider whether a device is compliant (for example, up-to-date security patches and virus signatures) before allowing access to the apps.| | Windows servers| Active Directory provides strong management capabilities for on-premises Windows servers using Group Policy or other management solutions.| Windows servers virtual machines in Azure can be managed with [Microsoft Entra Domain Services](../../active-directory-domain-services/index.yml). [Managed identities](../managed-identities-azure-resources/index.yml) can be used when VMs need access to the identity system directory or resources.| | Linux/Unix workloads|Active Directory doesn't natively support non-Windows without third-party solutions, although Linux machines can be configured to authenticate with Active Directory as a Kerberos realm.|Linux/Unix VMs can use [managed identities](../managed-identities-azure-resources/index.yml) to access the identity system or resources. Some organizations, migrate these workloads to cloud container technologies, which can also use managed identities.| ## Next steps - [What is Microsoft Entra ID?](./whatis.md)-- [Compare self-managed Active Directory Domain Services, Microsoft Entra ID, and managed Microsoft Entra Domain Services](../../active-directory-domain-services/compare-identity-solutions.md)+- [Compare self-managed Active Directory Domain Services, Microsoft Entra ID, and managed Microsoft Entra Domain Services](/entra/identity/domain-services/compare-identity-solutions) - [Frequently asked questions about Microsoft Entra ID](./active-directory-faq.yml) - [What's new in Microsoft Entra ID?](./whats-new.md) |
active-directory | Concept Secure Remote Workers | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/fundamentals/concept-secure-remote-workers.md | The following table is intended to highlight the key actions for the following l ## Next steps - For detailed deployment guidance for individual features of Microsoft Entra ID, review the [Microsoft Entra ID project deployment plans](../architecture/deployment-plans.md).-- Organizations can use [identity secure score](identity-secure-score.md) to track their progress against other Microsoft recommendations.+- Organizations can use [identity secure score](../reports-monitoring/concept-identity-secure-score.md) to track their progress against other Microsoft recommendations. |
active-directory | How To Manage Groups | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/fundamentals/how-to-manage-groups.md | To create a basic group and add members: ### Turn off group welcome email -A welcome notification is sent to all users when they're added to a new Microsoft 365 group, regardless of the membership type. When an attribute of a user or device changes, all dynamic group rules in the organization are processed for potential membership changes. Users who are added then also receive the welcome notification. You can turn off this behavior in [Exchange PowerShell](/powershell/module/exchange/users-and-groups/Set-UnifiedGroup). +A welcome notification is sent to all users when they're added to a new Microsoft 365 group, regardless of the membership type. When an attribute of a user or device changes, all dynamic group rules in the organization are processed for potential membership changes. Users who are added then also receive the welcome notification. You can turn off this behavior in [Exchange PowerShell](/powershell/module/exchange/set-unifiedgroup). ## Add or remove members and owners |
active-directory | How To View Support Access Request Logs | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/fundamentals/how-to-view-support-access-request-logs.md | There are three activities that can be associated with an automated or system-in ## Next steps - [Manage Microsoft Support access requests](how-to-manage-support-access-requests.md)-- [Learn about audit logs](../../active-directory/reports-monitoring/concept-audit-logs.md)+- [Learn about audit logs](../reports-monitoring/concept-audit-logs.md) |
active-directory | Introduction Identity Access Management | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/fundamentals/introduction-identity-access-management.md | These are the most well-known and commonly used authentication and authorization OAuth is an open-standards identity management protocol that provides secure access for websites, mobile apps, and Internet of Things and other devices. It uses tokens that are encrypted in transit and eliminates the need to share credentials. OAuth 2.0, the latest release of OAuth, is a popular framework used by major social media platforms and consumer services, from Facebook and LinkedIn to Google, PayPal, and Netflix. To learn more, read about [OAuth 2.0 protocol](/azure/active-directory/develop/active-directory-v2-protocols). #### OpenID Connect (OIDC) -With the release of the OpenID Connect (which uses public-key encryption), OpenID became a widely adopted authentication layer for OAuth. Like SAML, OpenID Connect (OIDC) is widely used for single sign-on (SSO), but OIDC uses REST/JSON instead of XML. OIDC was designed to work with both native and mobile apps by using REST/JSON protocols. The primary use case for SAML, however, is web-based apps. To learn more, read about [OpenID Connect protocol](/azure/active-directory/develop/active-directory-v2-protocols). +With the release of the OpenID Connect (which uses public-key encryption), OpenID became a widely adopted authentication layer for OAuth. Like SAML, OpenID Connect (OIDC) is widely used for single sign-on (SSO), but OIDC uses REST/JSON instead of XML. OIDC was designed to work with both native and mobile apps by using REST/JSON protocols. The primary use case for SAML, however, is web-based apps. To learn more, read about [OpenID Connect protocol](../develop/v2-protocols.md). #### JSON web tokens (JWTs) |
active-directory | New Name | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/fundamentals/new-name.md | The following table lists terminology that is not impacted by the Azure AD renam | Active Directory <br/><br/>• Windows Server Active Directory <br/>• Active Directory Federation Services (AD FS) <br/>• Active Directory Domain Services (AD DS) <br/>• Active Directory <br/>• Any Active Directory feature(s) | Windows Server Active Directory, commonly known as Active Directory, and related features and services associated with Active Directory aren't branded with Microsoft Entra. | | Authentication library <br/><br/>• Azure AD Authentication Library (ADAL) <br/>• Microsoft Authentication Library (MSAL) | Azure Active Directory Authentication Library (ADAL) is deprecated. While existing apps that use ADAL continue to work, Microsoft will no longer release security fixes on ADAL. Migrate applications to the Microsoft Authentication Library (MSAL) to avoid putting your app's security at risk. <br/><br/>[Microsoft Authentication Library (MSAL)](../develop/msal-overview.md) - Provides security tokens from the Microsoft identity platform to authenticate users and access secured web APIs to provide secure access to Microsoft Graph, other Microsoft APIs, third-party web APIs, or your own web API. | | B2C <br/><br/>• Azure Active Directory B2C <br/>• Azure AD B2C | [Azure Active Directory B2C](/azure/active-directory-b2c) isn't being renamed. We're continuing to invest in security, availability, and reliability in Azure AD B2C and our next-generation solution for external identities, [Microsoft Entra External ID](../external-identities/index.yml). |-| Graph <br/><br/>• Azure Active Directory Graph <br/>• Azure AD Graph <br/>• Microsoft Graph | Azure Active Directory (Azure AD) Graph is deprecated. Going forward, further investment in Azure AD Graph won't be made, and Azure AD Graph APIs have no SLA or maintenance commitment beyond security-related fixes. Investments in new features and functionalities will only be made in Microsoft Graph.<br/><br/>[Microsoft Graph](/graph) - Grants programmatic access to organization, user, and application data stored in Microsoft Entra ID. | +| Graph <br/><br/>• Azure Active Directory Graph <br/>• Azure AD Graph <br/>• Microsoft Graph | Azure Active Directory (Azure AD) Graph is deprecated. Going forward, further investment in Azure AD Graph won't be made, and Azure AD Graph APIs have no SLA or maintenance commitment beyond security-related fixes. Investments in new features and functionalities will only be made in Microsoft Graph.<br/><br/>[Microsoft Graph](/graph/) - Grants programmatic access to organization, user, and application data stored in Microsoft Entra ID. | | PowerShell <br/><br/>• Azure Active Directory PowerShell <br/>• Azure AD PowerShell <br/>• Microsoft Graph PowerShell | Azure AD PowerShell for Graph is planned for deprecation on March 30, 2024. For more info on the deprecation plans, see the deprecation update. We encourage you to migrate to Microsoft Graph PowerShell, which is the recommended module for interacting with Azure AD. <br/><br/>[Microsoft Graph PowerShell](/powershell/microsoftgraph/overview) - Acts as an API wrapper for the Microsoft Graph APIs and helps administer every Microsoft Entra ID feature that has an API in Microsoft Graph. | | Accounts <br/><br/>• Microsoft account <br/>• Work or school account | For end user sign-ins and account experiences, follow guidance for work and school accounts in [Sign in with Microsoft branding guidelines](../develop/howto-add-branding-in-apps.md). | | Microsoft identity platform | The Microsoft identity platform encompasses all our identity and access developer assets. It continues to provide the resources to help you build applications that your users and customers can sign in to using their Microsoft identities or social accounts. | Only official product names are capitalized, plus Conditional Access and My * ap - [How to: Rename Azure AD](how-to-rename-azure-ad.md) - [Stay up-to-date with what's new in Microsoft Entra ID (formerly Azure AD)](./whats-new.md) - [Get started using Microsoft Entra ID at the Microsoft Entra admin center](https://entra.microsoft.com/)-- [Learn more about the Microsoft Entra family with content from Microsoft Learn](/entra)+- [Learn more about the Microsoft Entra family with content from Microsoft Learn](/entra/) <!-- docutune:ignore "Azure Active Directory" "Azure AD" "AAD" "Entra ID" "Cloud Knox" "Identity Governance" --> |
active-directory | What Is Deprecated | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/fundamentals/what-is-deprecated.md | Use the following table to learn about changes including deprecations, retiremen |||:| |[Azure AD Authentication Library (ADAL)](https://techcommunity.microsoft.com/t5/microsoft-entra-azure-ad-blog/microsoft-entra-change-announcements-march-2023-train/ba-p/2967448)|Retirement|Jun 30, 2023| |[My Apps improvements](https://techcommunity.microsoft.com/t5/microsoft-entra-azure-ad-blog/microsoft-entra-change-announcements-march-2023-train/ba-p/2967448)|Feature change|Jun 30, 2023|-|[Microsoft Authenticator Lite for Outlook mobile](../../active-directory/authentication/how-to-mfa-authenticator-lite.md)|Feature change|Jun 9, 2023| +|[Microsoft Authenticator Lite for Outlook mobile](../authentication/how-to-mfa-authenticator-lite.md)|Feature change|Jun 9, 2023| |[My Groups experience](https://techcommunity.microsoft.com/t5/microsoft-entra-azure-ad-blog/microsoft-entra-change-announcements-march-2023-train/ba-p/2967448)|Feature change|May 2023| |[My Apps browser extension](https://techcommunity.microsoft.com/t5/microsoft-entra-azure-ad-blog/microsoft-entra-change-announcements-march-2023-train/ba-p/2967448)|Feature change|May 2023| |Microsoft Authenticator app [Number matching](../authentication/how-to-mfa-number-match.md)|Feature change|May 8, 2023| Use the definitions in this section help clarify the state, availability, and su * **End-of-life** - engineering investments have ended, and the feature is unavailable to any customer ## Next steps-[What's new in Microsoft Entra ID?](../../active-directory/fundamentals/whats-new.md) +[What's new in Microsoft Entra ID?](../fundamentals/whats-new.md) ## Resources * [Microsoft Entra change announcement blog](https://techcommunity.microsoft.com/t5/microsoft-entra-azure-ad-blog/microsoft-entra-change-announcements-november-2022-train/ba-p/2967452) |
active-directory | Whats New Archive | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/fundamentals/whats-new-archive.md | For more information, see: [Block users from viewing their BitLocker keys (previ **Service category:** Identity Protection **Product capability:** Identity Security & Protection -Identity Protection risk detections (alerts) are now also available in Microsoft 365 Defender to provide a unified investigation experience for security professionals. For more information, see: [Investigate alerts in Microsoft 365 Defender](/microsoft-365/security/defender/investigate-alerts?view=o365-worldwide#alert-sources&preserve-view=true) +Identity Protection risk detections (alerts) are now also available in Microsoft 365 Defender to provide a unified investigation experience for security professionals. For more information, see: [Investigate alerts in Microsoft 365 Defender](/microsoft-365/security/defender/investigate-alerts?view=o365-worldwide&preserve-view=true#alert-sources) To learn more about trusts and how to deploy your own, visit [How trust relation In July 2022 we've added the following 28 new applications in our App gallery with Federation support: -[Lunni Ticket Service](https://ticket.lunni.io/login), [Spring Health](https://benefits.springhealth.com/care), [Sorbet](https://lite.sorbetapp.com/login), [Planview ID](../saas-apps/planview-id-tutorial.md), [Karbonalpha](https://saas.karbonalpha.com/settings/api), [Headspace](../saas-apps/headspace-tutorial.md), [SeekOut](../saas-apps/seekout-tutorial.md), [Stackby](../saas-apps/stackby-tutorial.md), [Infrascale Cloud Backup](../saas-apps/infrascale-cloud-backup-tutorial.md), [Keystone](../saas-apps/keystone-tutorial.md), [LMS・教育管理システム Leaf](../saas-apps/lms-and-education-management-system-leaf-tutorial.md), [ZDiscovery](../saas-apps/zdiscovery-tutorial.md), [ラインズeライブラリアドバンス (Lines eLibrary Advance)](../saas-apps/lines-elibrary-advance-tutorial.md), [Rootly](../saas-apps/rootly-tutorial.md), [Articulate 360](../saas-apps/articulate360-tutorial.md), [Rise.com](../saas-apps/risecom-tutorial.md), [SevOne Network Monitoring System (NMS)](../saas-apps/sevone-network-monitoring-system-tutorial.md), [PGM](https://ups-pgm.4gfactor.com/azure/), [TouchRight Software](https://app.touchrightsoftware.com/), [Tendium](../saas-apps/tendium-tutorial.md), [Training Platform](../saas-apps/training-platform-tutorial.md), [Znapio](https://app.znapio.com/), [Preset](../saas-apps/preset-tutorial.md), [itslearning MS Teams sync](https://itslearning.com/global/), [Veza](../saas-apps/veza-tutorial.md), +[Lunni Ticket Service](https://ticket.lunni.io/login), [Spring Health](https://benefits.springhealth.com/care), [Sorbet](https://lite.sorbetapp.com/login), [Planview ID](../saas-apps/planview-admin-tutorial.md), [Karbonalpha](https://saas.karbonalpha.com/settings/api), [Headspace](../saas-apps/headspace-tutorial.md), [SeekOut](../saas-apps/seekout-tutorial.md), [Stackby](../saas-apps/stackby-tutorial.md), [Infrascale Cloud Backup](../saas-apps/infrascale-cloud-backup-tutorial.md), [Keystone](../saas-apps/keystone-tutorial.md), [LMS・教育管理システム Leaf](../saas-apps/lms-and-education-management-system-leaf-tutorial.md), [ZDiscovery](../saas-apps/zdiscovery-tutorial.md), [ラインズeライブラリアドバンス (Lines eLibrary Advance)](../saas-apps/lines-elibrary-advance-tutorial.md), [Rootly](../saas-apps/rootly-tutorial.md), [Articulate 360](../saas-apps/articulate360-tutorial.md), [Rise.com](../saas-apps/risecom-tutorial.md), [SevOne Network Monitoring System (NMS)](../saas-apps/sevone-network-monitoring-system-tutorial.md), [PGM](https://ups-pgm.4gfactor.com/azure/), [TouchRight Software](https://app.touchrightsoftware.com/), [Tendium](../saas-apps/tendium-tutorial.md), [Training Platform](../saas-apps/training-platform-tutorial.md), [Znapio](https://app.znapio.com/), [Preset](../saas-apps/preset-tutorial.md), [itslearning MS Teams sync](https://itslearning.com/global/), [Veza](../saas-apps/veza-tutorial.md), You can also find the documentation of all the applications from here https://aka.ms/AppsTutorial, Pick a group of up to five members and provision them into your third-party appl **Product capability:** Identity Security & Protection -We're delighted to announce a new security protection that prevents bypassing of cloud Azure AD Multi-Factor Authentication when federated with Azure AD. When enabled for a federated domain in your Azure AD tenant, it ensures that a compromised federated account can't bypass Azure AD Multi-Factor Authentication by imitating that a multi factor authentication has already been performed by the identity provider. The protection can be enabled via new security setting, [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-beta#federatedidpmfabehavior-values&preserve-view=true). +We're delighted to announce a new security protection that prevents bypassing of cloud Azure AD Multi-Factor Authentication when federated with Azure AD. When enabled for a federated domain in your Azure AD tenant, it ensures that a compromised federated account can't bypass Azure AD Multi-Factor Authentication by imitating that a multi factor authentication has already been performed by the identity provider. The protection can be enabled via new security setting, [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-beta&preserve-view=true#federatedidpmfabehavior-values). We highly recommend enabling this new protection when using Azure AD Multi-Factor Authentication as your multi factor authentication for your federated users. To learn more about the protection and how to enable it, visit [Enable protection to prevent by-passing of cloud Azure AD Multi-Factor Authentication when federated with Azure AD](/windows-server/identity/ad-fs/deployment/best-practices-securing-ad-fs#enable-protection-to-prevent-by-passing-of-cloud-azure-ad-multi-factor-authentication-when-federated-with-azure-ad). For listing your application in the Azure AD app gallery, see the details here h -We're delighted to announce a new security protection that prevents bypassing of cloud Azure AD Multi-Factor Authentication when federated with Azure AD. When enabled for a federated domain in your Azure AD tenant, it ensures that a compromised federated account can't bypass Azure AD Multi-Factor Authentication by imitating that a multi factor authentication has already been performed by the identity provider. The protection can be enabled via new security setting, [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-1.0#federatedidpmfabehavior-values&preserve-view=true). +We're delighted to announce a new security protection that prevents bypassing of cloud Azure AD Multi-Factor Authentication when federated with Azure AD. When enabled for a federated domain in your Azure AD tenant, it ensures that a compromised federated account can't bypass Azure AD Multi-Factor Authentication by imitating that a multi factor authentication has already been performed by the identity provider. The protection can be enabled via new security setting, [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-1.0&preserve-view=true#federatedidpmfabehavior-values). We highly recommend enabling this new protection when using Azure AD Multi-Factor Authentication as your multi factor authentication for your federated users. To learn more about the protection and how to enable it, visit [Enable protection to prevent by-passing of cloud Azure AD Multi-Factor Authentication when federated with Azure AD](/windows-server/identity/ad-fs/deployment/best-practices-securing-ad-fs#enable-protection-to-prevent-by-passing-of-cloud-azure-ad-multi-factor-authentication-when-federated-with-azure-ad). |
active-directory | Whats New Sovereign Clouds Archive | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/fundamentals/whats-new-sovereign-clouds-archive.md | With Continuous access evaluation (CAE), critical security events and policies a **Product capability:** Identity Security & Protection -We're delighted to announce a new security protection that prevents bypassing of cloud Azure AD Multi-Factor Authentication when federated with Azure AD. When enabled for a federated domain in your Azure AD tenant, it ensures that a compromised federated account can't bypass Azure AD Multi-Factor Authentication by imitating that a multi factor authentication has already been performed by the identity provider. The protection can be enabled via new security setting, [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-beta#federatedidpmfabehavior-values&preserve-view=true). +We're delighted to announce a new security protection that prevents bypassing of cloud Azure AD Multi-Factor Authentication when federated with Azure AD. When enabled for a federated domain in your Azure AD tenant, it ensures that a compromised federated account can't bypass Azure AD Multi-Factor Authentication by imitating that a multi factor authentication has already been performed by the identity provider. The protection can be enabled via new security setting, [federatedIdpMfaBehavior](/graph/api/resources/internaldomainfederation?view=graph-rest-beta&preserve-view=true#federatedidpmfabehavior-values). We highly recommend enabling this new protection when using Azure AD Multi-Factor Authentication as your multi factor authentication for your federated users. To learn more about the protection and how to enable it, visit [Enable protection to prevent by-passing of cloud Azure AD Multi-Factor Authentication when federated with Azure AD](/windows-server/identity/ad-fs/deployment/best-practices-securing-ad-fs#enable-protection-to-prevent-by-passing-of-cloud-azure-ad-multi-factor-authentication-when-federated-with-azure-ad). |
active-directory | Whats New | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/fundamentals/whats-new.md | TRv2 allows organizations to enable safe and productive cross-company collaborat TRv2 uses the cross-tenant access policy, and offers both authentication and data plane protection. It enforces policies during user authentication, and on data plane access with Exchange Online, SharePoint Online, Teams, and MSGraph.  While the data plane support with Windows GPO and Global Secure Access is still in public preview, authentication plane support with proxy is now generally available. -Visit https://aka.ms/tenant-restrictions-enforcement for more information on tenant restriction V2 and Global Secure Access client-side tagging for TRv2 at [Universal tenant restrictions](/azure/global-secure-access/how-to-universal-tenant-restrictions). +Visit https://aka.ms/tenant-restrictions-enforcement for more information on tenant restriction V2 and Global Secure Access client-side tagging for TRv2 at [Universal tenant restrictions](/entra/global-secure-access/how-to-universal-tenant-restrictions). For more information, see: [Require an app protection policy on Windows devices In July 2023 we've added the following 10 new applications in our App gallery with Federation support: -[Gainsight SAML](../saas-apps/gainsight-saml-tutorial.md), [Dataddo](https://www.dataddo.com/), [Puzzel](https://www.puzzel.com/), [Worthix App](../saas-apps/worthix-app-tutorial.md), [iOps360 IdConnect](https://iops360.com/iops360-id-connect-azuread-single-sign-on/), [Airbase](../saas-apps/airbase-tutorial.md), [Couchbase Capella - SSO](../saas-apps/couchbase-capella-sso-tutorial.md), [SSO for Jama Connect®](../saas-apps/sso-for-jama-connect-tutorial.md), [mediment (メディメント)](https://mediment.jp/), [Netskope Cloud Exchange Administration Console](../saas-apps/netskope-cloud-exchange-administration-console-tutorial.md), [Uber](../saas-apps/uber-tutorial.md), [Plenda](https://app.plenda.nl/), [Deem Mobile](../saas-apps/deem-mobile-tutorial.md), [40SEAS](https://www.40seas.com/), [Vivantio](https://www.vivantio.com/), [AppTweak](https://www.apptweak.com/), [Vbrick Rev Cloud](../saas-apps/vbrick-rev-cloud-tutorial.md), [OptiTurn](../saas-apps/optiturn-tutorial.md), [Application Experience with Mist](https://www.mist.com/), [クラウド勤怠管理システムKING OF TIME](../saas-apps/cloud-attendance-management-system-king-of-time-tutorial.md), [Connect1](../saas-apps/connect1-tutorial.md), [DB Education Portal for Schools](../saas-apps/db-education-portal-for-schools-tutorial.md), [SURFconext](../saas-apps/surfconext-tutorial.md), [Chengliye Smart SMS Platform](../saas-apps/chengliye-smart-sms-platform-tutorial.md), [CivicEye SSO](../saas-apps/civic-eye-sso-tutorial.md), [Colloquial](../saas-apps/colloquial-tutorial.md), [BigPanda](../saas-apps/bigpanda-tutorial.md), [Foreman](https://foreman.mn/) +[Gainsight SAML](../saas-apps/gainsight-tutorial.md), [Dataddo](https://www.dataddo.com/), [Puzzel](https://www.puzzel.com/), [Worthix App](../saas-apps/worthix-app-tutorial.md), [iOps360 IdConnect](https://iops360.com/iops360-id-connect-azuread-single-sign-on/), [Airbase](../saas-apps/airbase-tutorial.md), [Couchbase Capella - SSO](../saas-apps/couchbase-capella-sso-tutorial.md), [SSO for Jama Connect®](../saas-apps/sso-for-jama-connect-tutorial.md), [mediment (メディメント)](https://mediment.jp/), [Netskope Cloud Exchange Administration Console](../saas-apps/netskope-cloud-exchange-administration-console-tutorial.md), [Uber](../saas-apps/uber-tutorial.md), [Plenda](https://app.plenda.nl/), [Deem Mobile](../saas-apps/deem-mobile-tutorial.md), [40SEAS](https://www.40seas.com/), [Vivantio](https://www.vivantio.com/), [AppTweak](https://www.apptweak.com/), [Vbrick Rev Cloud](../saas-apps/vbrick-rev-cloud-tutorial.md), [OptiTurn](../saas-apps/optiturn-tutorial.md), [Application Experience with Mist](https://www.mist.com/), [クラウド勤怠管理システムKING OF TIME](../saas-apps/cloud-attendance-management-system-king-of-time-tutorial.md), [Connect1](../saas-apps/connect1-tutorial.md), [DB Education Portal for Schools](../saas-apps/db-education-portal-for-schools-tutorial.md), [SURFconext](../saas-apps/surfconext-tutorial.md), [Chengliye Smart SMS Platform](../saas-apps/chengliye-smart-sms-platform-tutorial.md), [CivicEye SSO](../saas-apps/civic-eye-sso-tutorial.md), [Colloquial](../saas-apps/colloquial-tutorial.md), [BigPanda](../saas-apps/bigpanda-tutorial.md), [Foreman](https://foreman.mn/) You can also find the documentation of all the applications from here https://aka.ms/AppsTutorial. |
active-directory | Apps | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/governance/apps.md | -| HR | [SuccessFactors - User Provisioning](../../active-directory/saas-apps/sap-successfactors-inbound-provisioning-tutorial.md) | -| HR | [Workday - User Provisioning](../../active-directory/saas-apps/workday-inbound-cloud-only-tutorial.md)| -|[LDAP directory](../../active-directory/app-provisioning/on-premises-ldap-connector-configure.md)| OpenLDAP<br>Microsoft Active Directory Lightweight Directory Services<br>389 Directory Server<br>Apache Directory Server<br>IBM Tivoli DS<br>Isode Directory<br>NetIQ eDirectory<br>Novell eDirectory<br>Open DJ<br>Open DS<br>Oracle (previously Sun ONE) Directory Server Enterprise Edition<br>RadiantOne Virtual Directory Server (VDS) | -| [SQL database](../../active-directory/app-provisioning/tutorial-ecma-sql-connector.md)| Microsoft SQL Server and Azure SQL<br>IBM DB2 10.x<br>IBM DB2 9.x<br>Oracle 10g and 11g<br>Oracle 12c and 18c<br>MySQL 5.x| -| Cloud platform| [AWS IAM Identity Center](../../active-directory/saas-apps/aws-single-sign-on-provisioning-tutorial.md) | -| Cloud platform| [Google Cloud Platform - User Provisioning](../../active-directory/saas-apps/g-suite-provisioning-tutorial.md) | -| Business applications|[SAP Cloud Identity Platform - Provisioning](../../active-directory/saas-apps/sap-cloud-platform-identity-authentication-provisioning-tutorial.md) | -| CRM| [Salesforce - User Provisioning](../../active-directory/saas-apps/salesforce-provisioning-tutorial.md) | -| ITSM| [ServiceNow](../../active-directory/saas-apps/servicenow-provisioning-tutorial.md)| +| HR | [SuccessFactors - User Provisioning](../saas-apps/sap-successfactors-inbound-provisioning-tutorial.md) | +| HR | [Workday - User Provisioning](../saas-apps/workday-inbound-cloud-only-tutorial.md)| +|[LDAP directory](../app-provisioning/on-premises-ldap-connector-configure.md)| OpenLDAP<br>Microsoft Active Directory Lightweight Directory Services<br>389 Directory Server<br>Apache Directory Server<br>IBM Tivoli DS<br>Isode Directory<br>NetIQ eDirectory<br>Novell eDirectory<br>Open DJ<br>Open DS<br>Oracle (previously Sun ONE) Directory Server Enterprise Edition<br>RadiantOne Virtual Directory Server (VDS) | +| [SQL database](../app-provisioning/tutorial-ecma-sql-connector.md)| Microsoft SQL Server and Azure SQL<br>IBM DB2 10.x<br>IBM DB2 9.x<br>Oracle 10g and 11g<br>Oracle 12c and 18c<br>MySQL 5.x| +| Cloud platform| [AWS IAM Identity Center](../saas-apps/aws-single-sign-on-provisioning-tutorial.md) | +| Cloud platform| [Google Cloud Platform - User Provisioning](../saas-apps/g-suite-provisioning-tutorial.md) | +| Business applications|[SAP Cloud Identity Platform - Provisioning](../saas-apps/sap-cloud-platform-identity-authentication-provisioning-tutorial.md) | +| CRM| [Salesforce - User Provisioning](../saas-apps/salesforce-provisioning-tutorial.md) | +| ITSM| [ServiceNow](../saas-apps/servicenow-provisioning-tutorial.md)| <a name='entra-identity-governance-integrations'></a> ## Microsoft Entra ID Governance integrations-The list below provides key integrations between Microsoft Entra ID Governance and various applications, including both provisioning and SSO integrations. For a full list of applications that Microsoft Entra ID integrates with specifically for SSO, see [here](../../active-directory/saas-apps/tutorial-list.md). +The list below provides key integrations between Microsoft Entra ID Governance and various applications, including both provisioning and SSO integrations. For a full list of applications that Microsoft Entra ID integrates with specifically for SSO, see [here](../saas-apps/tutorial-list.md). Microsoft Entra ID Governance can be integrated with many other applications, using standards such as OpenID Connect, SAML, SCIM, SQL and LDAP. If you're using a SaaS application which isn't listed, then [ask the SaaS vendor to onboard](../manage-apps/v2-howto-app-gallery-listing.md). For integration with other applications, see [integrating applications with Microsoft Entra ID](identity-governance-applications-integrate.md). | Application | Automated provisioning | Single Sign On (SSO)| | : | :-: | :-: |-| 389 directory server ([LDAP connector](../../active-directory/app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | -| [4me](../../active-directory/saas-apps/4me-provisioning-tutorial.md) | ΓùÅ | ΓùÅ| -| [8x8](../../active-directory/saas-apps/8x8-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [15five](../../active-directory/saas-apps/15five-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Acunetix 360](../../active-directory/saas-apps/acunetix-360-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Adobe Identity Management](../../active-directory/saas-apps/adobe-identity-management-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Adobe Identity Management (OIDC)](../../active-directory/saas-apps/adobe-identity-management-provisioning-oidc-tutorial.md) | ΓùÅ | ΓùÅ | -| [Airbase](../../active-directory/saas-apps/airbase-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Aha!](../../active-directory/saas-apps/aha-tutorial.md) | | ΓùÅ | -| [Airstack](../../active-directory/saas-apps/airstack-provisioning-tutorial.md) | ΓùÅ | | -| [Akamai Enterprise Application Access](../../active-directory/saas-apps/akamai-enterprise-application-access-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Airtable](../../active-directory/saas-apps/airtable-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Albert](../../active-directory/saas-apps/albert-provisioning-tutorial.md) | ΓùÅ | | -| [AlertMedia](../../active-directory/saas-apps/alertmedia-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Alexis HR](../../active-directory/saas-apps/alexishr-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Alinto Protect (renamed Cleanmail)](../../active-directory/saas-apps/alinto-protect-provisioning-tutorial.md) | ΓùÅ | | -| [Alvao](../../active-directory/saas-apps/alvao-provisioning-tutorial.md) | ΓùÅ | | -| [Amazon Web Services (AWS) - Role Provisioning](../../active-directory/saas-apps/amazon-web-service-tutorial.md) | ΓùÅ | ΓùÅ | -| Apache Directory Server ([LDAP connector](../../active-directory/app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | -| [Appaegis Isolation Access Cloud](../../active-directory/saas-apps/appaegis-isolation-access-cloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Apple School Manager](../../active-directory/saas-apps/apple-school-manager-provision-tutorial.md) | ΓùÅ | | -| [Apple Business Manager](../../active-directory/saas-apps/apple-business-manager-provision-tutorial.md) | ΓùÅ | | -| [Ardoq](../../active-directory/saas-apps/ardoq-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Asana](../../active-directory/saas-apps/asana-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [AskSpoke](../../active-directory/saas-apps/askspoke-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Atea](../../active-directory/saas-apps/atea-provisioning-tutorial.md) | ΓùÅ | | -| [Atlassian Cloud](../../active-directory/saas-apps/atlassian-cloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Atmos](../../active-directory/saas-apps/atmos-provisioning-tutorial.md) | ΓùÅ | | -| [AuditBoard](../../active-directory/saas-apps/auditboard-provisioning-tutorial.md) | ΓùÅ | | -| [Autodesk SSO](../../active-directory/saas-apps/autodesk-sso-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| 389 directory server ([LDAP connector](../app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | +| [4me](../saas-apps/4me-provisioning-tutorial.md) | ΓùÅ | ΓùÅ| +| [8x8](../saas-apps/8x8-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [15five](../saas-apps/15five-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Acunetix 360](../saas-apps/acunetix-360-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Adobe Identity Management](../saas-apps/adobe-identity-management-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Adobe Identity Management (OIDC)](../saas-apps/adobe-identity-management-provisioning-oidc-tutorial.md) | ΓùÅ | ΓùÅ | +| [Airbase](../saas-apps/airbase-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Aha!](../saas-apps/aha-tutorial.md) | | ΓùÅ | +| [Airstack](../saas-apps/airstack-provisioning-tutorial.md) | ΓùÅ | | +| [Akamai Enterprise Application Access](../saas-apps/akamai-enterprise-application-access-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Airtable](../saas-apps/airtable-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Albert](../saas-apps/albert-provisioning-tutorial.md) | ΓùÅ | | +| [AlertMedia](../saas-apps/alertmedia-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Alexis HR](../saas-apps/alexishr-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Alinto Protect (renamed Cleanmail)](../saas-apps/alinto-protect-provisioning-tutorial.md) | ΓùÅ | | +| [Alvao](../saas-apps/alvao-provisioning-tutorial.md) | ΓùÅ | | +| [Amazon Business](../saas-apps/amazon-business-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Amazon Web Services (AWS) - Role Provisioning](../saas-apps/amazon-web-service-tutorial.md) | ΓùÅ | ΓùÅ | +| Apache Directory Server ([LDAP connector](../app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | +| [Appaegis Isolation Access Cloud](../saas-apps/appaegis-isolation-access-cloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Apple School Manager](../saas-apps/apple-school-manager-provision-tutorial.md) | ΓùÅ | | +| [Apple Business Manager](../saas-apps/apple-business-manager-provision-tutorial.md) | ΓùÅ | | +| [Ardoq](../saas-apps/ardoq-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Asana](../saas-apps/asana-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [AskSpoke](../saas-apps/askspoke-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Atea](../saas-apps/atea-provisioning-tutorial.md) | ΓùÅ | | +| [Atlassian Cloud](../saas-apps/atlassian-cloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Atmos](../saas-apps/atmos-provisioning-tutorial.md) | ΓùÅ | | +| [AuditBoard](../saas-apps/auditboard-provisioning-tutorial.md) | ΓùÅ | | +| [Autodesk SSO](../saas-apps/autodesk-sso-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | | [Azure Databricks SCIM Connector](/azure/databricks/administration-guide/users-groups/scim/aad) | ΓùÅ | |-| [AWS IAM Identity Center](../../active-directory/saas-apps/aws-single-sign-on-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Axiad Cloud](../../active-directory/saas-apps/axiad-cloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [BambooHR](../../active-directory/saas-apps/bamboo-hr-tutorial.md) | | ΓùÅ | -| [BenQ IAM](../../active-directory/saas-apps/benq-iam-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Bentley - Automatic User Provisioning](../../active-directory/saas-apps/bentley-automatic-user-provisioning-tutorial.md) | ΓùÅ | | -| [Better Stack](../../active-directory/saas-apps/better-stack-provisioning-tutorial.md) | ΓùÅ | | -| [BIC Cloud Design](../../active-directory/saas-apps/bic-cloud-design-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [BIS](../../active-directory/saas-apps/bis-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [BitaBIZ](../../active-directory/saas-apps/bitabiz-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Bizagi Studio for Digital Process Automation](../../active-directory/saas-apps/bizagi-studio-for-digital-process-automation-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [BLDNG APP](../../active-directory/saas-apps/bldng-app-provisioning-tutorial.md) | ΓùÅ | | -| [Blink](../../active-directory/saas-apps/blink-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Blinq](../../active-directory/saas-apps/blinq-provisioning-tutorial.md) | ΓùÅ | | -| [BlogIn](../../active-directory/saas-apps/blogin-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [BlueJeans](../../active-directory/saas-apps/bluejeans-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Bonusly](../../active-directory/saas-apps/bonusly-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Box](../../active-directory/saas-apps/box-userprovisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Boxcryptor](../../active-directory/saas-apps/boxcryptor-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Bpanda](../../active-directory/saas-apps/bpanda-provisioning-tutorial.md) | ΓùÅ | | -| [Brivo Onair Identity Connector](../../active-directory/saas-apps/brivo-onair-identity-connector-provisioning-tutorial.md) | ΓùÅ | | -| [Britive](../../active-directory/saas-apps/britive-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [BrowserStack Single Sign-on](../../active-directory/saas-apps/browserstack-single-sign-on-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [BullseyeTDP](../../active-directory/saas-apps/bullseyetdp-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Cato Networks Provisioning](../../active-directory/saas-apps/cato-networks-provisioning-tutorial.md) | ΓùÅ | | -| [Cerner Central](../../active-directory/saas-apps/cernercentral-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Cerby](../../active-directory/saas-apps/cerby-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Chaos](../../active-directory/saas-apps/chaos-provisioning-tutorial.md) | ΓùÅ | | -| [Chatwork](../../active-directory/saas-apps/chatwork-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [CheckProof](../../active-directory/saas-apps/checkproof-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Cinode](../../active-directory/saas-apps/cinode-provisioning-tutorial.md) | ΓùÅ | | -| [Cisco Umbrella User Management](../../active-directory/saas-apps/cisco-umbrella-user-management-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Cisco Webex](../../active-directory/saas-apps/cisco-webex-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Clarizen One](../../active-directory/saas-apps/clarizen-one-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Cleanmail Swiss](../../active-directory/saas-apps/cleanmail-swiss-provisioning-tutorial.md) | ΓùÅ | | -| [Clebex](../../active-directory/saas-apps/clebex-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Cloud Academy SSO](../../active-directory/saas-apps/cloud-academy-sso-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Coda](../../active-directory/saas-apps/coda-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Code42](../../active-directory/saas-apps/code42-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Cofense Recipient Sync](../../active-directory/saas-apps/cofense-provision-tutorial.md) | ΓùÅ | | -| [Comeet Recruiting Software](../../active-directory/saas-apps/comeet-recruiting-software-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Connecter](../../active-directory/saas-apps/connecter-provisioning-tutorial.md) | ΓùÅ | | -| [Contentful](../../active-directory/saas-apps/contentful-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Concur](../../active-directory/saas-apps/concur-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Cornerstone OnDemand](../../active-directory/saas-apps/cornerstone-ondemand-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [CybSafe](../../active-directory/saas-apps/cybsafe-provisioning-tutorial.md) | ΓùÅ | | -| [Dagster Cloud](../../active-directory/saas-apps/dagster-cloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Datadog](../../active-directory/saas-apps/datadog-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Documo](../../active-directory/saas-apps/documo-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [DocuSign](../../active-directory/saas-apps/docusign-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Dropbox Business](../../active-directory/saas-apps/dropboxforbusiness-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Dialpad](../../active-directory/saas-apps/dialpad-provisioning-tutorial.md) | ΓùÅ | | -| [DigiCert](../../active-directory/saas-apps/digicert-tutorial.md) | | ΓùÅ | -| [Directprint.io](../../active-directory/saas-apps/directprint-io-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Druva](../../active-directory/saas-apps/druva-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Dynamic Signal](../../active-directory/saas-apps/dynamic-signal-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Embed Signage](../../active-directory/saas-apps/embed-signage-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Envoy](../../active-directory/saas-apps/envoy-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Eletive](../../active-directory/saas-apps/eletive-provisioning-tutorial.md) | ΓùÅ | | -| [Elium](../../active-directory/saas-apps/elium-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Exium](../../active-directory/saas-apps/exium-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Evercate](../../active-directory/saas-apps/evercate-provisioning-tutorial.md) | ΓùÅ | | -| [Facebook Work Accounts](../../active-directory/saas-apps/facebook-work-accounts-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Federated Directory](../../active-directory/saas-apps/federated-directory-provisioning-tutorial.md) | ΓùÅ | | -| [Figma](../../active-directory/saas-apps/figma-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Flock](../../active-directory/saas-apps/flock-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Foodee](../../active-directory/saas-apps/foodee-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Fortes Change Cloud](../../active-directory/saas-apps/fortes-change-cloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Frankli.io](../../active-directory/saas-apps/frankli-io-provisioning-tutorial.md) | ΓùÅ | | -| [Freshservice Provisioning](../../active-directory/saas-apps/freshservice-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Funnel Leasing](../../active-directory/saas-apps/funnel-leasing-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Fuze](../../active-directory/saas-apps/fuze-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [G Suite](../../active-directory/saas-apps/g-suite-provisioning-tutorial.md) | ΓùÅ | | -| [Genesys Cloud for Azure](../../active-directory/saas-apps/purecloud-by-genesys-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [getAbstract](../../active-directory/saas-apps/getabstract-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [GHAE](../../active-directory/saas-apps/ghae-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [GitHub](../../active-directory/saas-apps/github-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [GitHub AE](../../active-directory/saas-apps/github-ae-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [GitHub Enterprise Managed User](../../active-directory/saas-apps/github-enterprise-managed-user-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [GitHub Enterprise Managed User (OIDC)](../../active-directory/saas-apps/github-enterprise-managed-user-oidc-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [GoToMeeting](../../active-directory/saas-apps/citrixgotomeeting-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Global Relay Identity Sync](../../active-directory/saas-apps/global-relay-identity-sync-provisioning-tutorial.md) | ΓùÅ | | -| [Gong](../../active-directory/saas-apps/gong-provisioning-tutorial.md) | ΓùÅ | | -| [GoLinks](../../active-directory/saas-apps/golinks-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Grammarly](../../active-directory/saas-apps/grammarly-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Group Talk](../../active-directory/saas-apps/grouptalk-provisioning-tutorial.md) | ΓùÅ | | -| [Gtmhub](../../active-directory/saas-apps/gtmhub-provisioning-tutorial.md) | ΓùÅ | | -| [H5mag](../../active-directory/saas-apps/h5mag-provisioning-tutorial.md) | ΓùÅ | | -| [Harness](../../active-directory/saas-apps/harness-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [AWS IAM Identity Center](../saas-apps/aws-single-sign-on-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Axiad Cloud](../saas-apps/axiad-cloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [BambooHR](../saas-apps/bamboo-hr-tutorial.md) | | ΓùÅ | +| [BenQ IAM](../saas-apps/benq-iam-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Bentley - Automatic User Provisioning](../saas-apps/bentley-automatic-user-provisioning-tutorial.md) | ΓùÅ | | +| [Better Stack](../saas-apps/better-stack-provisioning-tutorial.md) | ΓùÅ | | +| [BIC Cloud Design](../saas-apps/bic-cloud-design-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [BIS](../saas-apps/bis-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [BitaBIZ](../saas-apps/bitabiz-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Bizagi Studio for Digital Process Automation](../saas-apps/bizagi-studio-for-digital-process-automation-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [BLDNG APP](../saas-apps/bldng-app-provisioning-tutorial.md) | ΓùÅ | | +| [Blink](../saas-apps/blink-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Blinq](../saas-apps/blinq-provisioning-tutorial.md) | ΓùÅ | | +| [BlogIn](../saas-apps/blogin-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [BlueJeans](../saas-apps/bluejeans-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Bonusly](../saas-apps/bonusly-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Box](../saas-apps/box-userprovisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Boxcryptor](../saas-apps/boxcryptor-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Bpanda](../saas-apps/bpanda-provisioning-tutorial.md) | ΓùÅ | | +| [Brivo Onair Identity Connector](../saas-apps/brivo-onair-identity-connector-provisioning-tutorial.md) | ΓùÅ | | +| [Britive](../saas-apps/britive-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [BrowserStack Single Sign-on](../saas-apps/browserstack-single-sign-on-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [BullseyeTDP](../saas-apps/bullseyetdp-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Bustle B2B Transport Systems](../saas-apps/bustle-b2b-transport-systems-provisioning-tutorial.md) | ΓùÅ | | +| [Canva](../saas-apps/canva-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Cato Networks Provisioning](../saas-apps/cato-networks-provisioning-tutorial.md) | ΓùÅ | | +| [Cerner Central](../saas-apps/cernercentral-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Cerby](../saas-apps/cerby-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Chaos](../saas-apps/chaos-provisioning-tutorial.md) | ΓùÅ | | +| [Chatwork](../saas-apps/chatwork-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [CheckProof](../saas-apps/checkproof-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Cinode](../saas-apps/cinode-provisioning-tutorial.md) | ΓùÅ | | +| [Cisco Umbrella User Management](../saas-apps/cisco-umbrella-user-management-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Cisco Webex](../saas-apps/cisco-webex-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Clarizen One](../saas-apps/clarizen-one-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Cleanmail Swiss](../saas-apps/cleanmail-swiss-provisioning-tutorial.md) | ΓùÅ | | +| [Clebex](../saas-apps/clebex-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Cloud Academy SSO](../saas-apps/cloud-academy-sso-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Coda](../saas-apps/coda-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Code42](../saas-apps/code42-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Cofense Recipient Sync](../saas-apps/cofense-provision-tutorial.md) | ΓùÅ | | +| [Colloquial](../saas-apps/colloquial-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Comeet Recruiting Software](../saas-apps/comeet-recruiting-software-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Connecter](../saas-apps/connecter-provisioning-tutorial.md) | ΓùÅ | | +| [Contentful](../saas-apps/contentful-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Concur](../saas-apps/concur-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Cornerstone OnDemand](../saas-apps/cornerstone-ondemand-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Cybozu](../saas-apps/cybozu-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [CybSafe](../saas-apps/cybsafe-provisioning-tutorial.md) | ΓùÅ | | +| [Dagster Cloud](../saas-apps/dagster-cloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Datadog](../saas-apps/datadog-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Documo](../saas-apps/documo-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [DocuSign](../saas-apps/docusign-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Dropbox Business](../saas-apps/dropboxforbusiness-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Dialpad](../saas-apps/dialpad-provisioning-tutorial.md) | ΓùÅ | | +| [Diffchecker](../saas-apps/diffchecker-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [DigiCert](../saas-apps/digicert-tutorial.md) | | ΓùÅ | +| [Directprint.io](../saas-apps/directprint-io-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Druva](../saas-apps/druva-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Dynamic Signal](../saas-apps/dynamic-signal-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Embed Signage](../saas-apps/embed-signage-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Envoy](../saas-apps/envoy-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Eletive](../saas-apps/eletive-provisioning-tutorial.md) | ΓùÅ | | +| [Elium](../saas-apps/elium-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Exium](../saas-apps/exium-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Evercate](../saas-apps/evercate-provisioning-tutorial.md) | ΓùÅ | | +| [Facebook Work Accounts](../saas-apps/facebook-work-accounts-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Federated Directory](../saas-apps/federated-directory-provisioning-tutorial.md) | ΓùÅ | | +| [Figma](../saas-apps/figma-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Flock](../saas-apps/flock-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Foodee](../saas-apps/foodee-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Forcepoint Cloud Security Gateway - User Authentication](../saas-apps/forcepoint-cloud-security-gateway-tutorial.md) | ΓùÅ | ΓùÅ | +| [Fortes Change Cloud](../saas-apps/fortes-change-cloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Frankli.io](../saas-apps/frankli-io-provisioning-tutorial.md) | ΓùÅ | | +| [Freshservice Provisioning](../saas-apps/freshservice-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Funnel Leasing](../saas-apps/funnel-leasing-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Fuze](../saas-apps/fuze-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [G Suite](../saas-apps/g-suite-provisioning-tutorial.md) | ΓùÅ | | +| [Genesys Cloud for Azure](../saas-apps/purecloud-by-genesys-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [getAbstract](../saas-apps/getabstract-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [GHAE](../saas-apps/ghae-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [GitHub](../saas-apps/github-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [GitHub AE](../saas-apps/github-ae-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [GitHub Enterprise Managed User](../saas-apps/github-enterprise-managed-user-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [GitHub Enterprise Managed User (OIDC)](../saas-apps/github-enterprise-managed-user-oidc-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [GoToMeeting](../saas-apps/citrixgotomeeting-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Global Relay Identity Sync](../saas-apps/global-relay-identity-sync-provisioning-tutorial.md) | ΓùÅ | | +| [Gong](../saas-apps/gong-provisioning-tutorial.md) | ΓùÅ | | +| [GoLinks](../saas-apps/golinks-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Grammarly](../saas-apps/grammarly-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Group Talk](../saas-apps/grouptalk-provisioning-tutorial.md) | ΓùÅ | | +| [Gtmhub](../saas-apps/gtmhub-provisioning-tutorial.md) | ΓùÅ | | +| [H5mag](../saas-apps/h5mag-provisioning-tutorial.md) | ΓùÅ | | +| [Harness](../saas-apps/harness-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | | HCL Domino | ΓùÅ | |-| [Headspace](../../active-directory/saas-apps/headspace-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [HelloID](../../active-directory/saas-apps/helloid-provisioning-tutorial.md) | ΓùÅ | | -| [Holmes Cloud](../../active-directory/saas-apps/holmes-cloud-provisioning-tutorial.md) | ΓùÅ | | -| [Hootsuite](../../active-directory/saas-apps/hootsuite-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Hoxhunt](../../active-directory/saas-apps/hoxhunt-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Howspace](../../active-directory/saas-apps/howspace-provisioning-tutorial.md) | ΓùÅ | | -| [Humbol](../../active-directory/saas-apps/humbol-provisioning-tutorial.md) | ΓùÅ | | -| IBM DB2 ([SQL connector](../../active-directory/app-provisioning/tutorial-ecma-sql-connector.md) ) | ΓùÅ | | -| IBM Tivoli Directory Server ([LDAP connector](../../active-directory/app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | -| [Ideo](../../active-directory/saas-apps/ideo-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Ideagen Cloud](../../active-directory/saas-apps/ideagen-cloud-provisioning-tutorial.md) | ΓùÅ | | -| [Infor CloudSuite](../../active-directory/saas-apps/infor-cloudsuite-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [InformaCast](../../active-directory/saas-apps/informacast-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [iPass SmartConnect](../../active-directory/saas-apps/ipass-smartconnect-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Iris Intranet](../../active-directory/saas-apps/iris-intranet-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Insight4GRC](../../active-directory/saas-apps/insight4grc-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Insite LMS](../../active-directory/saas-apps/insite-lms-provisioning-tutorial.md) | ΓùÅ | | -| [introDus Pre and Onboarding Platform](../../active-directory/saas-apps/introdus-pre-and-onboarding-platform-provisioning-tutorial.md) | ΓùÅ | | -| [Invision](../../active-directory/saas-apps/invision-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [InviteDesk](../../active-directory/saas-apps/invitedesk-provisioning-tutorial.md) | ΓùÅ | | -| Isode directory server ([LDAP connector](../../active-directory/app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | -| [Jive](../../active-directory/saas-apps/jive-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Jostle](../../active-directory/saas-apps/jostle-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Joyn FSM](../../active-directory/saas-apps/joyn-fsm-provisioning-tutorial.md) | ΓùÅ | | -| [Juno Journey](../../active-directory/saas-apps/juno-journey-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Keeper Password Manager & Digital Vault](../../active-directory/saas-apps/keeper-password-manager-digitalvault-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Keepabl](../../active-directory/saas-apps/keepabl-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Kintone](../../active-directory/saas-apps/kintone-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Kisi Phsyical Security](../../active-directory/saas-apps/kisi-physical-security-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Klaxoon](../../active-directory/saas-apps/klaxoon-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Klaxoon SAML](../../active-directory/saas-apps/klaxoon-saml-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Kno2fy](../../active-directory/saas-apps/kno2fy-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [KnowBe4 Security Awareness Training](../../active-directory/saas-apps/knowbe4-security-awareness-training-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Kpifire](../../active-directory/saas-apps/kpifire-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [KPN Grip](../../active-directory/saas-apps/kpn-grip-provisioning-tutorial.md) | ΓùÅ | | -| [LanSchool Air](../../active-directory/saas-apps/lanschool-air-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Headspace](../saas-apps/headspace-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [HelloID](../saas-apps/helloid-provisioning-tutorial.md) | ΓùÅ | | +| [Holmes Cloud](../saas-apps/holmes-cloud-provisioning-tutorial.md) | ΓùÅ | | +| [Hootsuite](../saas-apps/hootsuite-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Hoxhunt](../saas-apps/hoxhunt-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Howspace](../saas-apps/howspace-provisioning-tutorial.md) | ΓùÅ | | +| [Humbol](../saas-apps/humbol-provisioning-tutorial.md) | ΓùÅ | | +| [Hypervault](../saas-apps/hypervault-provisioning-tutorial.md) | ΓùÅ | | +| IBM DB2 ([SQL connector](../app-provisioning/tutorial-ecma-sql-connector.md) ) | ΓùÅ | | +| IBM Tivoli Directory Server ([LDAP connector](../app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | +| [Ideo](../saas-apps/ideo-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Ideagen Cloud](../saas-apps/ideagen-cloud-provisioning-tutorial.md) | ΓùÅ | | +| [Infor CloudSuite](../saas-apps/infor-cloudsuite-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [InformaCast](../saas-apps/informacast-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [iPass SmartConnect](../saas-apps/ipass-smartconnect-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Iris Intranet](../saas-apps/iris-intranet-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Insight4GRC](../saas-apps/insight4grc-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Insite LMS](../saas-apps/insite-lms-provisioning-tutorial.md) | ΓùÅ | | +| [introDus Pre and Onboarding Platform](../saas-apps/introdus-pre-and-onboarding-platform-provisioning-tutorial.md) | ΓùÅ | | +| [Invision](../saas-apps/invision-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [InviteDesk](../saas-apps/invitedesk-provisioning-tutorial.md) | ΓùÅ | | +| Isode directory server ([LDAP connector](../app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | +| [Jive](../saas-apps/jive-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Jostle](../saas-apps/jostle-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Joyn FSM](../saas-apps/joyn-fsm-provisioning-tutorial.md) | ΓùÅ | | +| [Juno Journey](../saas-apps/juno-journey-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Keeper Password Manager & Digital Vault](../saas-apps/keeper-password-manager-digitalvault-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Keepabl](../saas-apps/keepabl-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Kintone](../saas-apps/kintone-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Kisi Phsyical Security](../saas-apps/kisi-physical-security-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Klaxoon](../saas-apps/klaxoon-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Klaxoon SAML](../saas-apps/klaxoon-saml-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Kno2fy](../saas-apps/kno2fy-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [KnowBe4 Security Awareness Training](../saas-apps/knowbe4-security-awareness-training-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Kpifire](../saas-apps/kpifire-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [KPN Grip](../saas-apps/kpn-grip-provisioning-tutorial.md) | ΓùÅ | | +| [LanSchool Air](../saas-apps/lanschool-air-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | | [LawVu](../..//active-directory/saas-apps/lawvu-provisioning-tutorial.md) | ΓùÅ | ΓùÅ |-| [LDAP](../../active-directory/app-provisioning/on-premises-ldap-connector-configure.md) | ΓùÅ | | -| [LimbleCMMS](../../active-directory/saas-apps/limblecmms-provisioning-tutorial.md) | ΓùÅ | | -| [LinkedIn Elevate](../../active-directory/saas-apps/linkedinelevate-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [LinkedIn Sales Navigator](../../active-directory/saas-apps/linkedinsalesnavigator-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Lucid (All Products)](../../active-directory/saas-apps/lucid-all-products-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Lucidchart](../../active-directory/saas-apps/lucidchart-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [LUSID](../../active-directory/saas-apps/LUSID-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Leapsome](../../active-directory/saas-apps/leapsome-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [LogicGate](../../active-directory/saas-apps/logicgate-provisioning-tutorial.md) | ΓùÅ | | -| [Looop](../../active-directory/saas-apps/looop-provisioning-tutorial.md) | ΓùÅ | | -| [LogMeIn](../../active-directory/saas-apps/logmein-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Maptician](../../active-directory/saas-apps/maptician-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Markit Procurement Service](../../active-directory/saas-apps/markit-procurement-service-provisioning-tutorial.md) | ΓùÅ | | -| [MediusFlow](../../active-directory/saas-apps/mediusflow-provisioning-tutorial.md) | ΓùÅ | | -| [MerchLogix](../../active-directory/saas-apps/merchlogix-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Meta Networks Connector](../../active-directory/saas-apps/meta-networks-connector-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| MicroFocus Novell eDirectory ([LDAP connector](../../active-directory/app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | +| [LDAP](../app-provisioning/on-premises-ldap-connector-configure.md) | ΓùÅ | | +| [LimbleCMMS](../saas-apps/limblecmms-provisioning-tutorial.md) | ΓùÅ | | +| [LinkedIn Elevate](../saas-apps/linkedinelevate-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [LinkedIn Sales Navigator](../saas-apps/linkedinsalesnavigator-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Litmos](../saas-apps/litmos-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Lucid (All Products)](../saas-apps/lucid-all-products-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Lucidchart](../saas-apps/lucidchart-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [LUSID](../saas-apps/LUSID-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Leapsome](../saas-apps/leapsome-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [LogicGate](../saas-apps/logicgate-provisioning-tutorial.md) | ΓùÅ | | +| [Looop](../saas-apps/looop-provisioning-tutorial.md) | ΓùÅ | | +| [LogMeIn](../saas-apps/logmein-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [M-Files](../saas-apps/m-files-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Maptician](../saas-apps/maptician-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Markit Procurement Service](../saas-apps/markit-procurement-service-provisioning-tutorial.md) | ΓùÅ | | +| [MediusFlow](../saas-apps/mediusflow-provisioning-tutorial.md) | ΓùÅ | | +| [MerchLogix](../saas-apps/merchlogix-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Meta Networks Connector](../saas-apps/meta-networks-connector-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| MicroFocus Novell eDirectory ([LDAP connector](../app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | | Microsoft 365 | ΓùÅ | ΓùÅ | | Microsoft Active Directory Domain Services | | ΓùÅ | | Microsoft Azure | ΓùÅ | ΓùÅ | | [Microsoft Entra Domain Services](/entra/identity/domain-services/synchronization) | ΓùÅ | ΓùÅ |-| Microsoft Azure SQL ([SQL connector](../../active-directory/app-provisioning/tutorial-ecma-sql-connector.md) ) | ΓùÅ | | -| Microsoft Lightweight Directory Server (ADAM) ([LDAP connector](../../active-directory/app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | +| Microsoft Azure SQL ([SQL connector](../app-provisioning/tutorial-ecma-sql-connector.md) ) | ΓùÅ | | +| Microsoft Lightweight Directory Server (ADAM) ([LDAP connector](../app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | | Microsoft SharePoint Server (SharePoint) | ΓùÅ | |-| Microsoft SQL Server ([SQL connector](../../active-directory/app-provisioning/tutorial-ecma-sql-connector.md) ) | ΓùÅ | | -| [Mixpanel](../../active-directory/saas-apps/mixpanel-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Mindtickle](../../active-directory/saas-apps/mindtickle-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Miro](../../active-directory/saas-apps/miro-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Monday.com](../../active-directory/saas-apps/mondaycom-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [MongoDB Atlas](../../active-directory/saas-apps/mongodb-cloud-tutorial.md) | | ΓùÅ | -| [Moqups](../../active-directory/saas-apps/moqups-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Mural Identity](../../active-directory/saas-apps/mural-identity-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [MX3 Diagnostics](../../active-directory/saas-apps/mx3-diagnostics-connector-provisioning-tutorial.md) | ΓùÅ | | -| [myPolicies](../../active-directory/saas-apps/mypolicies-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| MySQL ([SQL connector](../../active-directory/app-provisioning/tutorial-ecma-sql-connector.md) ) | ΓùÅ | | -| NetIQ eDirectory ([LDAP connector](../../active-directory/app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | -| [Netpresenter Next](../../active-directory/saas-apps/netpresenter-provisioning-tutorial.md) | ΓùÅ | | -| [Netskope User Authentication](../../active-directory/saas-apps/netskope-administrator-console-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Netsparker Enterprise](../../active-directory/saas-apps/netsparker-enterprise-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [New Relic by Organization](../../active-directory/saas-apps/new-relic-by-organization-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [NordPass](../../active-directory/saas-apps/nordpass-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Notion](../../active-directory/saas-apps/notion-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| Novell eDirectory ([LDAP connector](../../active-directory/app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | -| [Office Space Software](../../active-directory/saas-apps/officespace-software-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Olfeo SAAS](../../active-directory/saas-apps/olfeo-saas-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| Open DJ ([LDAP connector](../../active-directory/app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | -| Open DS ([LDAP connector](../../active-directory/app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | -| [OpenForms](../../active-directory/saas-apps/openforms-provisioning-tutorial.md) | ΓùÅ | | -| [OpenLDAP](../../active-directory/app-provisioning/on-premises-ldap-connector-configure.md) | ΓùÅ | | -| [OpenText Directory Services](../../active-directory/saas-apps/open-text-directory-services-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Oracle Cloud Infrastructure Console](../../active-directory/saas-apps/oracle-cloud-infrastructure-console-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| Oracle Database ([SQL connector](../../active-directory/app-provisioning/tutorial-ecma-sql-connector.md) ) | ΓùÅ | | +| Microsoft SQL Server ([SQL connector](../app-provisioning/tutorial-ecma-sql-connector.md) ) | ΓùÅ | | +| [Mixpanel](../saas-apps/mixpanel-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Mindtickle](../saas-apps/mindtickle-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Miro](../saas-apps/miro-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Monday.com](../saas-apps/mondaycom-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [MongoDB Atlas](../saas-apps/mongodb-cloud-tutorial.md) | | ΓùÅ | +| [Moqups](../saas-apps/moqups-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Mural Identity](../saas-apps/mural-identity-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [MX3 Diagnostics](../saas-apps/mx3-diagnostics-connector-provisioning-tutorial.md) | ΓùÅ | | +| [myPolicies](../saas-apps/mypolicies-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| MySQL ([SQL connector](../app-provisioning/tutorial-ecma-sql-connector.md) ) | ΓùÅ | | +| NetIQ eDirectory ([LDAP connector](../app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | +| [Netpresenter Next](../saas-apps/netpresenter-provisioning-tutorial.md) | ΓùÅ | | +| [Netskope User Authentication](../saas-apps/netskope-administrator-console-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Netsparker Enterprise](../saas-apps/netsparker-enterprise-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [New Relic by Organization](../saas-apps/new-relic-by-organization-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [NordPass](../saas-apps/nordpass-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Notion](../saas-apps/notion-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| Novell eDirectory ([LDAP connector](../app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | +| [Office Space Software](../saas-apps/officespace-software-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Olfeo SAAS](../saas-apps/olfeo-saas-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Oneflow](../saas-apps/oneflow-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| Open DJ ([LDAP connector](../app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | +| Open DS ([LDAP connector](../app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | +| [OpenForms](../saas-apps/openforms-provisioning-tutorial.md) | ΓùÅ | | +| [OpenLDAP](../app-provisioning/on-premises-ldap-connector-configure.md) | ΓùÅ | | +| [OpenText Directory Services](../saas-apps/open-text-directory-services-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Oracle Cloud Infrastructure Console](../saas-apps/oracle-cloud-infrastructure-console-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| Oracle Database ([SQL connector](../app-provisioning/tutorial-ecma-sql-connector.md) ) | ΓùÅ | | | Oracle E-Business Suite | ΓùÅ | ΓùÅ |-| [Oracle Fusion ERP](../../active-directory/saas-apps/oracle-fusion-erp-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [O'Reilly Learning Platform](../../active-directory/saas-apps/oreilly-learning-platform-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Oracle Fusion ERP](../saas-apps/oracle-fusion-erp-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [O'Reilly Learning Platform](../saas-apps/oreilly-learning-platform-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | | Oracle Internet Directory | ΓùÅ | | | Oracle PeopleSoft ERP | ΓùÅ | ΓùÅ |-| Oracle SunONE Directory Server ([LDAP connector](../../active-directory/app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | -| [PagerDuty](../../active-directory/saas-apps/pagerduty-tutorial.md) | | ΓùÅ | -| [Palo Alto Networks Cloud Identity Engine - Cloud Authentication Service](../../active-directory/saas-apps/palo-alto-networks-cloud-identity-engine-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Palo Alto Networks SCIM Connector](../../active-directory/saas-apps/palo-alto-networks-scim-connector-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [PaperCut Cloud Print Management](../../active-directory/saas-apps/papercut-cloud-print-management-provisioning-tutorial.md) | ΓùÅ | | -| [Parsable](../../active-directory/saas-apps/parsable-provisioning-tutorial.md) | ΓùÅ | | -| [Peripass](../../active-directory/saas-apps/peripass-provisioning-tutorial.md) | ΓùÅ | | -| [Pingboard](../../active-directory/saas-apps/pingboard-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Plandisc](../../active-directory/saas-apps/plandisc-provisioning-tutorial.md) | ΓùÅ | | -| [Playvox](../../active-directory/saas-apps/playvox-provisioning-tutorial.md) | ΓùÅ | | -| [Preciate](../../active-directory/saas-apps/preciate-provisioning-tutorial.md) | ΓùÅ | | -| [PrinterLogic SaaS](../../active-directory/saas-apps/printer-logic-saas-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Priority Matrix](../../active-directory/saas-apps/priority-matrix-provisioning-tutorial.md) | ΓùÅ | | -| [ProdPad](../../active-directory/saas-apps/prodpad-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Promapp](../../active-directory/saas-apps/promapp-provisioning-tutorial.md) | ΓùÅ | | -| [Proxyclick](../../active-directory/saas-apps/proxyclick-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Peakon](../../active-directory/saas-apps/peakon-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Proware](../../active-directory/saas-apps/proware-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| RadiantOne Virtual Directory Server (VDS) ([LDAP connector](../../active-directory/app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | -| [Real Links](../../active-directory/saas-apps/real-links-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Reward Gateway](../../active-directory/saas-apps/reward-gateway-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [RFPIO](../../active-directory/saas-apps/rfpio-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Rhombus Systems](../../active-directory/saas-apps/rhombus-systems-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Ring Central](../../active-directory/saas-apps/ringcentral-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Robin](../../active-directory/saas-apps/robin-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Rollbar](../../active-directory/saas-apps/rollbar-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Rouse Sales](../../active-directory/saas-apps/rouse-sales-provisioning-tutorial.md) | ΓùÅ | | -| [Salesforce](../../active-directory/saas-apps/salesforce-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [SafeGuard Cyber](../../active-directory/saas-apps/safeguard-cyber-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Salesforce Sandbox](../../active-directory/saas-apps/salesforce-sandbox-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Samanage](../../active-directory/saas-apps/samanage-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| Oracle SunONE Directory Server ([LDAP connector](../app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | +| [PagerDuty](../saas-apps/pagerduty-tutorial.md) | | ΓùÅ | +| [Palo Alto Networks Cloud Identity Engine - Cloud Authentication Service](../saas-apps/palo-alto-networks-cloud-identity-engine-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Palo Alto Networks SCIM Connector](../saas-apps/palo-alto-networks-scim-connector-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [PaperCut Cloud Print Management](../saas-apps/papercut-cloud-print-management-provisioning-tutorial.md) | ΓùÅ | | +| [Parsable](../saas-apps/parsable-provisioning-tutorial.md) | ΓùÅ | | +| [Peripass](../saas-apps/peripass-provisioning-tutorial.md) | ΓùÅ | | +| [Pingboard](../saas-apps/pingboard-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Plandisc](../saas-apps/plandisc-provisioning-tutorial.md) | ΓùÅ | | +| [Playvox](../saas-apps/playvox-provisioning-tutorial.md) | ΓùÅ | | +| [Postman](../saas-apps/postman-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Preciate](../saas-apps/preciate-provisioning-tutorial.md) | ΓùÅ | | +| [PrinterLogic SaaS](../saas-apps/printer-logic-saas-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Priority Matrix](../saas-apps/priority-matrix-provisioning-tutorial.md) | ΓùÅ | | +| [ProdPad](../saas-apps/prodpad-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Promapp](../saas-apps/promapp-provisioning-tutorial.md) | ΓùÅ | | +| [Proxyclick](../saas-apps/proxyclick-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Peakon](../saas-apps/peakon-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Proware](../saas-apps/proware-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| RadiantOne Virtual Directory Server (VDS) ([LDAP connector](../app-provisioning/on-premises-ldap-connector-configure.md) ) | ΓùÅ | | +| [Real Links](../saas-apps/real-links-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Recnice](../saas-apps/recnice-provisioning-tutorial.md) | ΓùÅ | | +| [Reward Gateway](../saas-apps/reward-gateway-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [RFPIO](../saas-apps/rfpio-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Rhombus Systems](../saas-apps/rhombus-systems-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Ring Central](../saas-apps/ringcentral-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Robin](../saas-apps/robin-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Rollbar](../saas-apps/rollbar-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Rouse Sales](../saas-apps/rouse-sales-provisioning-tutorial.md) | ΓùÅ | | +| [Salesforce](../saas-apps/salesforce-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [SafeGuard Cyber](../saas-apps/safeguard-cyber-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Salesforce Sandbox](../saas-apps/salesforce-sandbox-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Samanage](../saas-apps/samanage-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | | SAML-based apps | | ΓùÅ |-| [SAP Analytics Cloud](../../active-directory/saas-apps/sap-analytics-cloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [SAP Cloud Platform](../../active-directory/saas-apps/sap-cloud-platform-identity-authentication-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [SAP R/3 and ERP](../../active-directory/app-provisioning/on-premises-sap-connector-configure.md) | ΓùÅ | | -| [SAP HANA](../../active-directory/saas-apps/saphana-tutorial.md) | ΓùÅ | ΓùÅ | -| [SAP SuccessFactors to Active Directory](../../active-directory/saas-apps/sap-successfactors-inbound-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [SAP SuccessFactors to Microsoft Entra ID](../../active-directory/saas-apps/sap-successfactors-inbound-provisioning-cloud-only-tutorial.md) | ΓùÅ | ΓùÅ | -| [SAP SuccessFactors Writeback](../../active-directory/saas-apps/sap-successfactors-writeback-tutorial.md) | ΓùÅ | ΓùÅ | -| [SchoolStream ASA](../../active-directory/saas-apps/schoolstream-asa-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [SAP Analytics Cloud](../saas-apps/sap-analytics-cloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [SAP Cloud Platform](../saas-apps/sap-cloud-platform-identity-authentication-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [SAP R/3 and ERP](../app-provisioning/on-premises-sap-connector-configure.md) | ΓùÅ | | +| [SAP HANA](../saas-apps/saphana-tutorial.md) | ΓùÅ | ΓùÅ | +| [SAP SuccessFactors to Active Directory](../saas-apps/sap-successfactors-inbound-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [SAP SuccessFactors to Microsoft Entra ID](../saas-apps/sap-successfactors-inbound-provisioning-cloud-only-tutorial.md) | ΓùÅ | ΓùÅ | +| [SAP SuccessFactors Writeback](../saas-apps/sap-successfactors-writeback-tutorial.md) | ΓùÅ | ΓùÅ | +| [SchoolStream ASA](../saas-apps/schoolstream-asa-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | | [SCIM-based apps in the cloud](../app-provisioning/use-scim-to-provision-users-and-groups.md) | ΓùÅ | | | [SCIM-based apps on-premises](../app-provisioning/on-premises-scim-provisioning.md) | ΓùÅ | |-| [Secure Deliver](../../active-directory/saas-apps/secure-deliver-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [SecureLogin](../../active-directory/saas-apps/secure-login-provisioning-tutorial.md) | ΓùÅ | | -| [Sentry](../../active-directory/saas-apps/sentry-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [ServiceNow](../../active-directory/saas-apps/servicenow-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Segment](../../active-directory/saas-apps/segment-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Shopify Plus](../../active-directory/saas-apps/shopify-plus-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Sigma Computing](../../active-directory/saas-apps/sigma-computing-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Signagelive](../../active-directory/saas-apps/signagelive-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Slack](../../active-directory/saas-apps/slack-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Smartfile](../../active-directory/saas-apps/smartfile-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Smartsheet](../../active-directory/saas-apps/smartsheet-provisioning-tutorial.md) | ΓùÅ | | -| [Smallstep SSH](../../active-directory/saas-apps/smallstep-ssh-provisioning-tutorial.md) | ΓùÅ | | -| [Snowflake](../../active-directory/saas-apps/snowflake-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Soloinsight - CloudGate SSO](../../active-directory/saas-apps/soloinsight-cloudgate-sso-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [SoSafe](../../active-directory/saas-apps/sosafe-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [SpaceIQ](../../active-directory/saas-apps/spaceiq-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Splashtop](../../active-directory/saas-apps/splashtop-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [StarLeaf](../../active-directory/saas-apps/starleaf-provisioning-tutorial.md) | ΓùÅ | | -| [Storegate](../../active-directory/saas-apps/storegate-provisioning-tutorial.md) | ΓùÅ | | -| [SurveyMonkey Enterprise](../../active-directory/saas-apps/surveymonkey-enterprise-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Swit](../../active-directory/saas-apps/swit-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Symantec Web Security Service (WSS)](../../active-directory/saas-apps/symantec-web-security-service.md) | ΓùÅ | ΓùÅ | -| [Tableau Cloud](../../active-directory/saas-apps/tableau-online-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Tailscale](../../active-directory/saas-apps/tailscale-provisioning-tutorial.md) | ΓùÅ | | -| [Talentech](../../active-directory/saas-apps/talentech-provisioning-tutorial.md) | ΓùÅ | | -| [Tanium SSO](../../active-directory/saas-apps/tanium-sso-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Tap App Security](../../active-directory/saas-apps/tap-app-security-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Taskize Connect](../../active-directory/saas-apps/taskize-connect-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Teamgo](../../active-directory/saas-apps/teamgo-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [TeamViewer](../../active-directory/saas-apps/teamviewer-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [TerraTrue](../../active-directory/saas-apps/terratrue-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [ThousandEyes](../../active-directory/saas-apps/thousandeyes-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Tic-Tac Mobile](../../active-directory/saas-apps/tic-tac-mobile-provisioning-tutorial.md) | ΓùÅ | | -| [TimeClock 365](../../active-directory/saas-apps/timeclock-365-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [TimeClock 365 SAML](../../active-directory/saas-apps/timeclock-365-saml-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Templafy SAML2](../../active-directory/saas-apps/templafy-saml-2-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Templafy OpenID Connect](../../active-directory/saas-apps/templafy-openid-connect-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [TheOrgWiki](../../active-directory/saas-apps/theorgwiki-provisioning-tutorial.md) | ΓùÅ | | -| [Thrive LXP](../../active-directory/saas-apps/thrive-lxp-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Torii](../../active-directory/saas-apps/torii-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [TravelPerk](../../active-directory/saas-apps/travelperk-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Tribeloo](../../active-directory/saas-apps/tribeloo-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Twingate](../../active-directory/saas-apps/twingate-provisioning-tutorial.md) | ΓùÅ | | -| [Uber](../../active-directory/saas-apps/uber-provisioning-tutorial.md) | ΓùÅ | | -| [UNIFI](../../active-directory/saas-apps/unifi-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [uniFlow Online](../../active-directory/saas-apps/uniflow-online-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [uni-tel ) | ΓùÅ | | -| [Vault Platform](../../active-directory/saas-apps/vault-platform-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Vbrick Rev Cloud](../../active-directory/saas-apps/vbrick-rev-cloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [V-Client](../../active-directory/saas-apps/v-client-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Velpic](../../active-directory/saas-apps/velpic-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Visibly](../../active-directory/saas-apps/visibly-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Visitly](../../active-directory/saas-apps/visitly-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Vonage](../../active-directory/saas-apps/vonage-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [WATS](../../active-directory/saas-apps/wats-provisioning-tutorial.md) | ΓùÅ | | -| [Webroot Security Awareness Training](../../active-directory/saas-apps/webroot-security-awareness-training-provisioning-tutorial.md) | ΓùÅ | | -| [WEDO](../../active-directory/saas-apps/wedo-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Whimsical](../../active-directory/saas-apps/whimsical-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Workday to Active Directory](../../active-directory/saas-apps/workday-inbound-tutorial.md) | ΓùÅ | ΓùÅ | -| [Workday to Microsoft Entra ID](../../active-directory/saas-apps/workday-inbound-cloud-only-tutorial.md) | ΓùÅ | ΓùÅ | -| [Workday Writeback](../../active-directory/saas-apps/workday-writeback-tutorial.md) | ΓùÅ | ΓùÅ | -| [Workteam](../../active-directory/saas-apps/workteam-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Workplace by Facebook](../../active-directory/saas-apps/workplace-by-facebook-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Workgrid](../../active-directory/saas-apps/workgrid-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Wrike](../../active-directory/saas-apps/wrike-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Xledger](../../active-directory/saas-apps/xledger-provisioning-tutorial.md) | ΓùÅ | | -| [Yellowbox](../../active-directory/saas-apps/yellowbox-provisioning-tutorial.md) | ΓùÅ | | -| [Zapier](../../active-directory/saas-apps/zapier-provisioning-tutorial.md) | ΓùÅ | | -| [Zendesk](../../active-directory/saas-apps/zendesk-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Zenya](../../active-directory/saas-apps/zenya-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Zero](../../active-directory/saas-apps/zero-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Zip](../../active-directory/saas-apps/zip-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Zoom](../../active-directory/saas-apps/zoom-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Zscaler](../../active-directory/saas-apps/zscaler-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Zscaler Beta](../../active-directory/saas-apps/zscaler-beta-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Zscaler One](../../active-directory/saas-apps/zscaler-one-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Zscaler Private Access](../../active-directory/saas-apps/zscaler-private-access-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Zscaler Two](../../active-directory/saas-apps/zscaler-two-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Zscaler Three](../../active-directory/saas-apps/zscaler-three-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | -| [Zscaler ZSCloud](../../active-directory/saas-apps/zscaler-zscloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [ScreenSteps](../saas-apps/screensteps-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Secure Deliver](../saas-apps/secure-deliver-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [SecureLogin](../saas-apps/secure-login-provisioning-tutorial.md) | ΓùÅ | | +| [Sentry](../saas-apps/sentry-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [ServiceNow](../saas-apps/servicenow-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Segment](../saas-apps/segment-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Shopify Plus](../saas-apps/shopify-plus-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Sigma Computing](../saas-apps/sigma-computing-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Signagelive](../saas-apps/signagelive-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Slack](../saas-apps/slack-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Smartfile](../saas-apps/smartfile-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Smartsheet](../saas-apps/smartsheet-provisioning-tutorial.md) | ΓùÅ | | +| [Smallstep SSH](../saas-apps/smallstep-ssh-provisioning-tutorial.md) | ΓùÅ | | +| [Snowflake](../saas-apps/snowflake-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Soloinsight - CloudGate SSO](../saas-apps/soloinsight-cloudgate-sso-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [SoSafe](../saas-apps/sosafe-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [SpaceIQ](../saas-apps/spaceiq-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Splashtop](../saas-apps/splashtop-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [StarLeaf](../saas-apps/starleaf-provisioning-tutorial.md) | ΓùÅ | | +| [Storegate](../saas-apps/storegate-provisioning-tutorial.md) | ΓùÅ | | +| [SurveyMonkey Enterprise](../saas-apps/surveymonkey-enterprise-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Swit](../saas-apps/swit-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Symantec Web Security Service (WSS)](../saas-apps/symantec-web-security-service.md) | ΓùÅ | ΓùÅ | +| [Tableau Cloud](../saas-apps/tableau-online-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Tailscale](../saas-apps/tailscale-provisioning-tutorial.md) | ΓùÅ | | +| [Talentech](../saas-apps/talentech-provisioning-tutorial.md) | ΓùÅ | | +| [Tanium SSO](../saas-apps/tanium-sso-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Tap App Security](../saas-apps/tap-app-security-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Taskize Connect](../saas-apps/taskize-connect-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Teamgo](../saas-apps/teamgo-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [TeamViewer](../saas-apps/teamviewer-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [TerraTrue](../saas-apps/terratrue-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [ThousandEyes](../saas-apps/thousandeyes-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Tic-Tac Mobile](../saas-apps/tic-tac-mobile-provisioning-tutorial.md) | ΓùÅ | | +| [TimeClock 365](../saas-apps/timeclock-365-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [TimeClock 365 SAML](../saas-apps/timeclock-365-saml-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Templafy SAML2](../saas-apps/templafy-saml-2-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Templafy OpenID Connect](../saas-apps/templafy-openid-connect-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [TheOrgWiki](../saas-apps/theorgwiki-provisioning-tutorial.md) | ΓùÅ | | +| [Thrive LXP](../saas-apps/thrive-lxp-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Torii](../saas-apps/torii-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [TravelPerk](../saas-apps/travelperk-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Tribeloo](../saas-apps/tribeloo-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Twingate](../saas-apps/twingate-provisioning-tutorial.md) | ΓùÅ | | +| [Uber](../saas-apps/uber-provisioning-tutorial.md) | ΓùÅ | | +| [UNIFI](../saas-apps/unifi-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [uniFlow Online](../saas-apps/uniflow-online-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [uni-tel ) | ΓùÅ | | +| [Vault Platform](../saas-apps/vault-platform-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Vbrick Rev Cloud](../saas-apps/vbrick-rev-cloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [V-Client](../saas-apps/v-client-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Velpic](../saas-apps/velpic-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Visibly](../saas-apps/visibly-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Visitly](../saas-apps/visitly-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [VMware](../saas-apps/vmware-identity-service-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Vonage](../saas-apps/vonage-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [WATS](../saas-apps/wats-provisioning-tutorial.md) | ΓùÅ | | +| [Webroot Security Awareness Training](../saas-apps/webroot-security-awareness-training-provisioning-tutorial.md) | ΓùÅ | | +| [WEDO](../saas-apps/wedo-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Whimsical](../saas-apps/whimsical-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Workday to Active Directory](../saas-apps/workday-inbound-tutorial.md) | ΓùÅ | ΓùÅ | +| [Workday to Microsoft Entra ID](../saas-apps/workday-inbound-cloud-only-tutorial.md) | ΓùÅ | ΓùÅ | +| [Workday Writeback](../saas-apps/workday-writeback-tutorial.md) | ΓùÅ | ΓùÅ | +| [Workteam](../saas-apps/workteam-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Workplace by Facebook](../saas-apps/workplace-by-facebook-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Workgrid](../saas-apps/workgrid-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Wrike](../saas-apps/wrike-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Xledger](../saas-apps/xledger-provisioning-tutorial.md) | ΓùÅ | | +| [XM Fax and XM SendSecure](../saas-apps/xm-fax-and-xm-send-secure-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Yellowbox](../saas-apps/yellowbox-provisioning-tutorial.md) | ΓùÅ | | +| [Zapier](../saas-apps/zapier-provisioning-tutorial.md) | ΓùÅ | | +| [Zendesk](../saas-apps/zendesk-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Zenya](../saas-apps/zenya-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Zero](../saas-apps/zero-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Zip](../saas-apps/zip-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Zoho One](../saas-apps/zoho-one-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Zoom](../saas-apps/zoom-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Zscaler](../saas-apps/zscaler-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Zscaler Beta](../saas-apps/zscaler-beta-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Zscaler One](../saas-apps/zscaler-one-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Zscaler Private Access](../saas-apps/zscaler-private-access-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Zscaler Two](../saas-apps/zscaler-two-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Zscaler Three](../saas-apps/zscaler-three-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | +| [Zscaler ZSCloud](../saas-apps/zscaler-zscloud-provisioning-tutorial.md) | ΓùÅ | ΓùÅ | ## Partner driven integrations-There is also a healthy partner ecosystem, further expanding the breadth and depth of integrations available with Microsoft Entra ID Governance. Explore the [partner integrations](../../active-directory/app-provisioning/partner-driven-integrations.md) available, including connectors for: +There is also a healthy partner ecosystem, further expanding the breadth and depth of integrations available with Microsoft Entra ID Governance. Explore the [partner integrations](../app-provisioning/partner-driven-integrations.md) available, including connectors for: * Epic * Cerner * IBM RACF There is also a healthy partner ecosystem, further expanding the breadth and dep ## Next steps -To learn more about application provisioning, see [What is application provisioning](../../active-directory/app-provisioning/user-provisioning.md). +To learn more about application provisioning, see [What is application provisioning](../app-provisioning/user-provisioning.md). |
active-directory | Deploy Access Reviews | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/governance/deploy-access-reviews.md | Access reviews activities are recorded and available from the [Microsoft Entra a | | Apply decision | | Date range| Seven days | -For more advanced queries and analysis of access reviews, and to track changes and completion of reviews, export your Microsoft Entra audit logs to [Azure Log Analytics](../reports-monitoring/quickstart-azure-monitor-route-logs-to-storage-account.md) or Azure Event Hubs. When audit logs are stored in Log Analytics, you can use the [powerful analytics language](../reports-monitoring/howto-analyze-activity-logs-log-analytics.md) and build your own dashboards. +For more advanced queries and analysis of access reviews, and to track changes and completion of reviews, export your Microsoft Entra audit logs to [Azure Log Analytics](../reports-monitoring/howto-archive-logs-to-storage-account.md) or Azure Event Hubs. When audit logs are stored in Log Analytics, you can use the [powerful analytics language](../reports-monitoring/howto-analyze-activity-logs-log-analytics.md) and build your own dashboards. ## Next steps |
active-directory | Entitlement Management Logs And Reporting | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/governance/entitlement-management-logs-and-reporting.md | Microsoft Entra ID stores audit events for up to 30 days in the audit log. Howev Before you use the Azure Monitor workbooks, you must configure Microsoft Entra ID to send a copy of its audit logs to Azure Monitor. -Archiving Microsoft Entra audit logs requires you to have Azure Monitor in an Azure subscription. You can read more about the prerequisites and estimated costs of using Azure Monitor in [Microsoft Entra activity logs in Azure Monitor](../reports-monitoring/concept-activity-logs-azure-monitor.md). +Archiving Microsoft Entra audit logs requires you to have Azure Monitor in an Azure subscription. You can read more about the prerequisites and estimated costs of using Azure Monitor in [Microsoft Entra activity logs in Azure Monitor](../reports-monitoring/concept-log-monitoring-integration-options-considerations.md). **Prerequisite role**: Global Administrator Archiving Microsoft Entra audit logs requires you to have Azure Monitor in an Az 1. Check if there's already a setting to send the audit logs to that workspace. -1. If there isn't already a setting, select **Add diagnostic setting**. Use the instructions in [Integrate Microsoft Entra logs with Azure Monitor logs](../reports-monitoring/howto-integrate-activity-logs-with-log-analytics.md) to send the Microsoft Entra audit log to the Azure Monitor workspace. +1. If there isn't already a setting, select **Add diagnostic setting**. Use the instructions in [Integrate Microsoft Entra logs with Azure Monitor logs](../reports-monitoring/howto-integrate-activity-logs-with-azure-monitor-logs.md) to send the Microsoft Entra audit log to the Azure Monitor workspace. ![Diagnostics settings pane](./media/entitlement-management-logs-and-reporting/audit-log-diagnostics-settings.png) If you would like to know the oldest and newest audit events held in Azure Monit AuditLogs | where TimeGenerated > ago(3653d) | summarize OldestAuditEvent=min(TimeGenerated), NewestAuditEvent=max(TimeGenerated) by Type ``` -For more information on the columns that are stored for audit events in Azure Monitor, see [Interpret the Microsoft Entra audit logs schema in Azure Monitor](../reports-monitoring/overview-reports.md). +For more information on the columns that are stored for audit events in Azure Monitor, see [Interpret the Microsoft Entra audit logs schema in Azure Monitor](../reports-monitoring/overview-monitoring-health.md). ## Create custom Azure Monitor queries using Azure PowerShell $subs | ft You can reauthenticate and associate your PowerShell session to that subscription using a command such as `Connect-AzAccount ΓÇôSubscription $subs[0].id`. To learn more about how to authenticate to Azure from PowerShell, including non-interactively, see [Sign in with Azure PowerShell](/powershell/azure/authenticate-azureps). -If you have multiple Log Analytics workspaces in that subscription, then the cmdlet [Get-AzOperationalInsightsWorkspace](/powershell/module/Az.OperationalInsights/Get-AzOperationalInsightsWorkspace) returns the list of workspaces. Then you can find the one that has the Microsoft Entra logs. The `CustomerId` field returned by this cmdlet is the same as the value of the "Workspace ID" displayed in the Microsoft Entra admin center in the Log Analytics workspace overview. +If you have multiple Log Analytics workspaces in that subscription, then the cmdlet [Get-AzOperationalInsightsWorkspace](/powershell/module/az.operationalinsights/get-azoperationalinsightsworkspace) returns the list of workspaces. Then you can find the one that has the Microsoft Entra logs. The `CustomerId` field returned by this cmdlet is the same as the value of the "Workspace ID" displayed in the Microsoft Entra admin center in the Log Analytics workspace overview. ```powershell $wks = Get-AzOperationalInsightsWorkspace $wks | ft CustomerId, Name ``` ### Send the query to the Log Analytics workspace-Finally, once you have a workspace identified, you can use [Invoke-AzOperationalInsightsQuery](/powershell/module/az.operationalinsights/Invoke-AzOperationalInsightsQuery) to send a Kusto query to that workspace. These queries are written in [Kusto query language](/azure/kusto/query/). +Finally, once you have a workspace identified, you can use [Invoke-AzOperationalInsightsQuery](/powershell/module/az.operationalinsights/invoke-azoperationalinsightsquery) to send a Kusto query to that workspace. These queries are written in [Kusto query language](/azure/data-explorer/kusto/query/). For example, you can retrieve the date range of the audit event records from the Log Analytics workspace, with PowerShell cmdlets to send a query like: |
active-directory | Identity Governance Applications Deploy | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/governance/identity-governance-applications-deploy.md | Microsoft Entra ID with Azure Monitor provides several reports to help you under * An administrator, or a catalog owner, can [retrieve the list of users who have access package assignments](entitlement-management-access-package-assignments.md), via the Microsoft Entra admin center, Graph or PowerShell. * You can also send the audit logs to Azure Monitor and view a history of [changes to the access package](entitlement-management-logs-and-reporting.md#view-events-for-an-access-package), in the Microsoft Entra admin center, or via PowerShell.-* You can view the last 30 days of sign-ins to an application in the [sign-ins report](../reports-monitoring/reference-basic-info-sign-in-logs.md) in the Microsoft Entra admin center, or via [Graph](/graph/api/signin-list?view=graph-rest-1.0&tabs=http&preserve-view=true). -* You can also send the [sign in logs to Azure Monitor](../reports-monitoring/concept-activity-logs-azure-monitor.md) to archive sign in activity for up to two years. +* You can view the last 30 days of sign-ins to an application in the [sign-ins report](../reports-monitoring/concept-sign-in-log-activity-details.md) in the Microsoft Entra admin center, or via [Graph](/graph/api/signin-list?view=graph-rest-1.0&tabs=http&preserve-view=true). +* You can also send the [sign in logs to Azure Monitor](../reports-monitoring/concept-log-monitoring-integration-options-considerations.md) to archive sign in activity for up to two years. ## Monitor to adjust entitlement management policies and access as needed |
active-directory | Identity Governance Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/governance/identity-governance-overview.md | In addition to the features listed above, additional Microsoft Entra features fr |Policy and role management|Admin can define Conditional Access policies for run-time access to applications. Resource owners can define policies for user's access via access packages.|[Conditional Access](../conditional-access/overview.md) and [Entitlement management](entitlement-management-overview.md) policies| |Access certification|Admins can enable recurring access recertification for: SaaS apps, on-premises apps, cloud group memberships, Microsoft Entra ID or Azure Resource role assignments. Automatically remove resource access, block guest access and delete guest accounts.|[Access reviews](access-reviews-overview.md), also surfaced in [PIM](../privileged-identity-management/pim-create-roles-and-resource-roles-review.md)| |Fulfillment and provisioning|Automatic provisioning and deprovisioning into Microsoft Entra connected apps, including via SCIM, LDAP, SQL and into SharePoint Online sites. |[user provisioning](../app-provisioning/user-provisioning.md)|-|Reporting and analytics|Admins can retrieve audit logs of recent user provisioning and sign on activity. Integration with Azure Monitor and 'who has access' via access packages.|[Microsoft Entra reports](../reports-monitoring/overview-reports.md) and [monitoring](../reports-monitoring/overview-monitoring.md)| +|Reporting and analytics|Admins can retrieve audit logs of recent user provisioning and sign on activity. Integration with Azure Monitor and 'who has access' via access packages.|[Microsoft Entra reports](../reports-monitoring/overview-monitoring-health.md) and [monitoring](../reports-monitoring/overview-monitoring-health.md)| |Privileged access|Just-in-time and scheduled access, alerting, approval workflows for Microsoft Entra roles (including custom roles) and Azure Resource roles.|[Microsoft Entra PIM](../privileged-identity-management/pim-configure.md)| |Auditing|Admins can be alerted of creation of admin accounts.|[Microsoft Entra PIM alerts](../privileged-identity-management/pim-how-to-configure-security-alerts.md)| |
active-directory | Lifecycle Workflows Deployment | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/governance/lifecycle-workflows-deployment.md | The following information is important information about your organization and t Before you begin planning a Lifecycle Workflow deployment, you should become familiar with the parts of workflow and the terminology around Lifecycle Workflows. -The [Understanding Lifecycle Workflows](understanding-lifecycle-workflows.md) document, uses the portal to explain the parts of a workflow. The [Developer API reference Lifecycle Workflows](lifecycle-workflows-developer-reference.md) document, uses a GRAPH example to explain the parts of a workflow. +The [Understanding Lifecycle Workflows](understanding-lifecycle-workflows.md) document, uses the portal to explain the parts of a workflow. The [Developer API reference Lifecycle Workflows](/graph/api/resources/identitygovernance-workflow) document, uses a GRAPH example to explain the parts of a workflow. You can use this document to become familiar with the parts of workflow prior to deploying them. |
active-directory | Sap | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/governance/sap.md | SAP likely runs critical functions, such as HR and ERP, for your business. At th ### SuccessFactors -Customers who use SAP SuccessFactors can easily bring identities into [Microsoft Entra ID](../../active-directory/saas-apps/sap-successfactors-inbound-provisioning-cloud-only-tutorial.md) or [on-premises Active Directory](../../active-directory/saas-apps/sap-successfactors-inbound-provisioning-tutorial.md) by using native connectors. The connectors support the following scenarios: +Customers who use SAP SuccessFactors can easily bring identities into [Microsoft Entra ID](../saas-apps/sap-successfactors-inbound-provisioning-cloud-only-tutorial.md) or [on-premises Active Directory](../saas-apps/sap-successfactors-inbound-provisioning-tutorial.md) by using native connectors. The connectors support the following scenarios: -* **Hiring new employees**: When a new employee is added to SuccessFactors, a user account is automatically created in Microsoft Entra ID and optionally Microsoft 365 and [other software as a service (SaaS) applications that Microsoft Entra ID supports](../../active-directory/app-provisioning/user-provisioning.md). This process includes write-back of the email address to SuccessFactors. +* **Hiring new employees**: When a new employee is added to SuccessFactors, a user account is automatically created in Microsoft Entra ID and optionally Microsoft 365 and [other software as a service (SaaS) applications that Microsoft Entra ID supports](../app-provisioning/user-provisioning.md). This process includes write-back of the email address to SuccessFactors. * **Employee attribute and profile updates**: When an employee record is updated in SuccessFactors (such as name, title, or manager), the employee's user account is automatically updated in Microsoft Entra ID and optionally Microsoft 365 and other SaaS applications that Microsoft Entra ID supports. * **Employee terminations**: When an employee is terminated in SuccessFactors, the employee's user account is automatically disabled in Microsoft Entra ID and optionally Microsoft 365 and other SaaS applications that Microsoft Entra ID supports. * **Employee rehires**: When an employee is rehired in SuccessFactors, the employee's old account can be automatically reactivated or re-provisioned (depending on your preference) to Microsoft Entra ID and optionally Microsoft 365 and other SaaS applications that Microsoft Entra ID supports. After you set up provisioning for your SAP applications, you can enable SSO for After your users are in Microsoft Entra ID, you can provision accounts into the various SaaS and on-premises SAP applications that they need access to. You have three ways to accomplish this: -* Use the enterprise application in Microsoft Entra ID to configure both single sign-on (SSO) and provisioning to SAP applications such as [SAP Analytics Cloud](../../active-directory/saas-apps/sap-analytics-cloud-provisioning-tutorial.md). With this option, you can apply a consistent set of governance processes across all your applications. -* Use the [SAP Identity Authentication Service (IAS)](../../active-directory/saas-apps/sap-cloud-platform-identity-authentication-provisioning-tutorial.md) enterprise application in Microsoft Entra ID to provision identities into SAP IAS. After you bring all the identities into SAP IAS, you can use SAP IPS to provision the accounts from there into your applications when required. +* Use the enterprise application in Microsoft Entra ID to configure both single sign-on (SSO) and provisioning to SAP applications such as [SAP Analytics Cloud](../saas-apps/sap-analytics-cloud-provisioning-tutorial.md). With this option, you can apply a consistent set of governance processes across all your applications. +* Use the [SAP Identity Authentication Service (IAS)](../saas-apps/sap-cloud-platform-identity-authentication-provisioning-tutorial.md) enterprise application in Microsoft Entra ID to provision identities into SAP IAS. After you bring all the identities into SAP IAS, you can use SAP IPS to provision the accounts from there into your applications when required. * Use the [SAP IPS](https://help.sap.com/docs/IDENTITY_PROVISIONING/f48e822d6d484fa5ade7dda78b64d9f5/f2b2df8a273642a1bf801e99ecc4a043.html) integration to directly export identities from Microsoft Entra ID into your [applications](https://help.sap.com/docs/IDENTITY_PROVISIONING/f48e822d6d484fa5ade7dda78b64d9f5/ab3f641552464c79b94d10b9205fd721.html). When you're using SAP IPS to pull users into your applications, all provisioning configuration is managed in SAP directly. You can still use the enterprise application in Microsoft Entra ID to manage SSO and use [Microsoft Entra ID as the corporate identity provider](https://help.sap.com/docs/IDENTITY_AUTHENTICATION/6d6d63354d1242d185ab4830fc04feb1/058c7b14209f4f2d8de039da4330a1c1.html). ### Provision identities into on-premises SAP systems that SAP IPS doesn't support Customers who have yet to transition from applications such as SAP R/3 and SAP ERP Central Component (SAP ECC) to SAP S/4HANA can still rely on the Microsoft Entra provisioning service to provision user accounts. Within SAP R/3 and SAP ECC, you expose the necessary Business Application Programming Interfaces (BAPIs) for creating, updating, and deleting users. Within Microsoft Entra ID, you have two options: -* Use the lightweight Microsoft Entra provisioning agent and [web services connector](/azure/active-directory/app-provisioning/on-premises-web-services-connector) to [provision users into apps such as SAP ECC](/azure/active-directory/app-provisioning/on-premises-sap-connector-configure). +* Use the lightweight Microsoft Entra provisioning agent and [web services connector](../app-provisioning/on-premises-web-services-connector.md) to [provision users into apps such as SAP ECC](../app-provisioning/on-premises-sap-connector-configure.md). * In scenarios where you need to do more complex group and role management, use [Microsoft Identity Manager](/microsoft-identity-manager/reference/microsoft-identity-manager-2016-ma-ws) to manage access to your legacy SAP applications. ## Trigger custom workflows With separation-of-duties checks in Microsoft Entra ID [entitlement management]( ## Next steps -* [Bring identities from SAP SuccessFactors into Microsoft Entra ID](../../active-directory/saas-apps/sap-successfactors-inbound-provisioning-cloud-only-tutorial.md) -* [Provision accounts in SAP IAS](../../active-directory/saas-apps/sap-cloud-platform-identity-authentication-provisioning-tutorial.md) +* [Bring identities from SAP SuccessFactors into Microsoft Entra ID](../saas-apps/sap-successfactors-inbound-provisioning-cloud-only-tutorial.md) +* [Provision accounts in SAP IAS](../saas-apps/sap-cloud-platform-identity-authentication-provisioning-tutorial.md) |
active-directory | Tutorial Prepare User Accounts | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/governance/tutorial-prepare-user-accounts.md | Once your user(s) has been successfully created in Microsoft Entra ID, you may p ## Additional steps for pre-hire scenario -There are some additional steps that you should be aware of when testing either the [On-boarding users to your organization using Lifecycle workflows with the Microsoft Entra Admin Center](tutorial-onboard-custom-workflow-portal.md) tutorial or the [On-boarding users to your organization using Lifecycle workflows with Microsoft Graph](tutorial-onboard-custom-workflow-graph.md) tutorial. +There are some additional steps that you should be aware of when testing either the [On-boarding users to your organization using Lifecycle workflows with the Microsoft Entra Admin Center](tutorial-onboard-custom-workflow-portal.md) tutorial or the [On-boarding users to your organization using Lifecycle workflows with Microsoft Graph](/graph/tutorial-lifecycle-workflows-onboard-custom-workflow) tutorial. ### Edit the users attributes using the Microsoft Entra admin center The manager attribute is used for email notification tasks. It's used by the li :::image type="content" source="media/tutorial-lifecycle-workflows/graph-get-manager.png" alt-text="Screenshot of getting a manager in Graph explorer." lightbox="media/tutorial-lifecycle-workflows/graph-get-manager.png"::: -For more information about updating manager information for a user in Graph API, see [assign manager](/graph/api/user-post-manager?view=graph-rest-1.0&tabs=http&preserve-view=true) documentation. You can also set this attribute in the Azure Admin center. For more information, see [add or change profile information](../fundamentals/how-to-manage-user-profile-info.md?context=azure%2factive-directory%2fusers-groups-roles%2fcontext%2fugr-context). +For more information about updating manager information for a user in Graph API, see [assign manager](/graph/api/user-post-manager?view=graph-rest-1.0&tabs=http&preserve-view=true) documentation. You can also set this attribute in the Azure Admin center. For more information, see [add or change profile information](../fundamentals/how-to-manage-user-profile-info.md?context=azure/active-directory/users-groups-roles/context/ugr-context). ### Enabling the Temporary Access Pass (TAP) A user with groups and Teams memberships is required before you begin the tutori ## Next steps - [On-boarding users to your organization using Lifecycle workflows with the Microsoft Entra admin center](tutorial-onboard-custom-workflow-portal.md)-- [On-boarding users to your organization using Lifecycle workflows with Microsoft Graph](tutorial-onboard-custom-workflow-graph.md)+- [On-boarding users to your organization using Lifecycle workflows with Microsoft Graph](/graph/tutorial-lifecycle-workflows-onboard-custom-workflow) - [Tutorial: Off-boarding users from your organization using Lifecycle workflows with The Microsoft Entra Admin Center](tutorial-offboard-custom-workflow-portal.md)-- [Tutorial: Off-boarding users from your organization using Lifecycle workflows with Microsoft Graph](tutorial-offboard-custom-workflow-graph.md)+- [Tutorial: Off-boarding users from your organization using Lifecycle workflows with Microsoft Graph](/graph/tutorial-lifecycle-workflows-offboard-custom-workflow) |
active-directory | Understanding Lifecycle Workflows | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/governance/understanding-lifecycle-workflows.md | The following document provides an overview of a workflow created using Lifecycl For a full list of supported delegated and application permissions required to use Lifecycle Workflows, see: [Lifecycle workflows permissions](/graph/permissions-reference#lifecycle-workflows-permissions). -For delegated scenarios, the admin needs one of the following [Microsoft Entra roles](/azure/active-directory/users-groups-roles/directory-assign-admin-roles#available-roles): +For delegated scenarios, the admin needs one of the following [Microsoft Entra roles](../roles/permissions-reference.md): - Global administrator - Global reader |
active-directory | What Is Provisioning | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/governance/what-is-provisioning.md | For more information, see [What is HR driven provisioning?](../app-provisioning/ In Microsoft Entra ID, the term **[app provisioning](../app-provisioning/user-provisioning.md)** refers to automatically creating copies of user identities in the applications that users need access to, for applications that have their own data store, distinct from Microsoft Entra ID or Active Directory. In addition to creating user identities, app provisioning includes the maintenance and removal of user identities from those apps, as the user's status or roles change. Common scenarios include provisioning a Microsoft Entra user into applications like [Dropbox](../saas-apps/dropboxforbusiness-provisioning-tutorial.md), [Salesforce](../saas-apps/salesforce-provisioning-tutorial.md), [ServiceNow](../saas-apps/servicenow-provisioning-tutorial.md), as each of these applications have their own user repository distinct from Microsoft Entra ID. -Microsoft Entra ID also supports provisioning users into applications hosted on-premises or in a virtual machine, without having to open up any firewalls. If your application supports [SCIM](https://aka.ms/scimoverview), or you've built a SCIM gateway to connect to your legacy application, you can use the Microsoft Entra provisioning agent to [directly connect](/azure/active-directory/app-provisioning/on-premises-scim-provisioning) with your application and automate provisioning and deprovisioning. If you have legacy applications that don't support SCIM and rely on an [LDAP](/azure/active-directory/app-provisioning/on-premises-ldap-connector-configure) user store or a [SQL](/azure/active-directory/app-provisioning/on-premises-sql-connector-configure) database, or that have a [SOAP or REST API](../app-provisioning/on-premises-web-services-connector.md), Microsoft Entra ID can support those as well. +Microsoft Entra ID also supports provisioning users into applications hosted on-premises or in a virtual machine, without having to open up any firewalls. If your application supports [SCIM](https://aka.ms/scimoverview), or you've built a SCIM gateway to connect to your legacy application, you can use the Microsoft Entra provisioning agent to [directly connect](../app-provisioning/on-premises-scim-provisioning.md) with your application and automate provisioning and deprovisioning. If you have legacy applications that don't support SCIM and rely on an [LDAP](../app-provisioning/on-premises-ldap-connector-configure.md) user store or a [SQL](../app-provisioning/on-premises-sql-connector-configure.md) database, or that have a [SOAP or REST API](../app-provisioning/on-premises-web-services-connector.md), Microsoft Entra ID can support those as well. For more information, see [What is app provisioning?](../app-provisioning/user-provisioning.md) |
active-directory | Custom Attribute Mapping | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/hybrid/cloud-sync/custom-attribute-mapping.md | If you have extended Active Directory to include custom attributes, you can add To discover and map attributes, click **Add attribute mapping**. The attributes will automatically be discovered and will be available in the drop-down under **source attribute**. Fill in the type of mapping you want and click **Apply**. [![Custom attribute mapping](media/custom-attribute-mapping/schema-1.png)](media/custom-attribute-mapping/schema-1.png#lightbox) -For information on new attributes that are added and updated in Microsoft Entra ID see the [user resource type](/graph/api/resources/user?view=graph-rest-1.0#properties&preserve-view=true) and consider subscribing to [change notifications](/graph/webhooks). +For information on new attributes that are added and updated in Microsoft Entra ID see the [user resource type](/graph/api/resources/user?view=graph-rest-1.0&preserve-view=true#properties) and consider subscribing to [change notifications](/graph/webhooks). For more information on extension attributes, see [Syncing extension attributes for Microsoft Entra Application Provisioning](../../app-provisioning/user-provisioning-sync-attributes-for-mapping.md) |
active-directory | How To Install | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/hybrid/cloud-sync/how-to-install.md | To use *password writeback* and enable the self-service password reset (SSPR) se Set-AADCloudSyncPasswordWritebackConfiguration -Enable $true -Credential $(Get-Credential) ``` -For more information about using password writeback with Microsoft Entra Cloud Sync, see [Tutorial: Enable cloud sync self-service password reset writeback to an on-premises environment (preview)](../../../active-directory/authentication/tutorial-enable-cloud-sync-sspr-writeback.md). +For more information about using password writeback with Microsoft Entra Cloud Sync, see [Tutorial: Enable cloud sync self-service password reset writeback to an on-premises environment (preview)](../../authentication/tutorial-enable-cloud-sync-sspr-writeback.md). ## Install an agent in the US government cloud |
active-directory | Choose Ad Authn | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/hybrid/connect/choose-ad-authn.md | The following diagrams outline the high-level architecture components required f |Is there a health monitoring solution?|Not required|Agent status provided by the [[Microsoft Entra admin center](https://entra.microsoft.com)](tshoot-connect-pass-through-authentication.md)|[Microsoft Entra Connect Health](how-to-connect-health-adfs.md)| |Do users get single sign-on to cloud resources from domain-joined devices within the company network?|Yes with [Microsoft Entra joined devices](../../devices/concept-directory-join.md), [Microsoft Entra hybrid joined devices](../../devices/how-to-hybrid-join.md), the [Microsoft Enterprise SSO plug-in for Apple devices](../../develop/apple-sso-plugin.md), or [Seamless SSO](how-to-connect-sso.md)|Yes with [Microsoft Entra joined devices](../../devices/concept-directory-join.md), [Microsoft Entra hybrid joined devices](../../devices/how-to-hybrid-join.md), the [Microsoft Enterprise SSO plug-in for Apple devices](../../develop/apple-sso-plugin.md), or [Seamless SSO](how-to-connect-sso.md)|Yes| |What sign-in types are supported?|UserPrincipalName + password<br><br>Windows-Integrated Authentication by using [Seamless SSO](how-to-connect-sso.md)<br><br>[Alternate login ID](how-to-connect-install-custom.md)<br><br>[Microsoft Entra joined Devices](../../devices/concept-directory-join.md)<br><br>[Microsoft Entra hybrid joined devices](../../devices/how-to-hybrid-join.md)<br><br>[Certificate and smart card authentication](../../authentication/concept-certificate-based-authentication-smartcard.md)|UserPrincipalName + password<br><br>Windows-Integrated Authentication by using [Seamless SSO](how-to-connect-sso.md)<br><br>[Alternate login ID](how-to-connect-pta-faq.yml)<br><br>[Microsoft Entra joined Devices](../../devices/concept-directory-join.md)<br><br>[Microsoft Entra hybrid joined devices](../../devices/how-to-hybrid-join.md)<br><br>[Certificate and smart card authentication](../../authentication/concept-certificate-based-authentication-smartcard.md)|UserPrincipalName + password<br><br>sAMAccountName + password<br><br>Windows-Integrated Authentication<br><br>[Certificate and smart card authentication](/windows-server/identity/ad-fs/operations/configure-user-certificate-authentication)<br><br>[Alternate login ID](/windows-server/identity/ad-fs/operations/configuring-alternate-login-id)|-|Is Windows Hello for Business supported?|[Key trust model](/windows/security/identity-protection/hello-for-business/hello-identity-verification)<br><br>[Hybrid Cloud Trust](/windows/security/identity-protection/hello-for-business/hello-hybrid-cloud-trust)|[Key trust model](/windows/security/identity-protection/hello-for-business/hello-identity-verification)<br><br>[Hybrid Cloud Trust](/windows/security/identity-protection/hello-for-business/hello-hybrid-cloud-trust)<br><br>*Both require Windows Server 2016 Domain functional level*|[Key trust model](/windows/security/identity-protection/hello-for-business/hello-identity-verification)<br><br>[Hybrid Cloud Trust](/windows/security/identity-protection/hello-for-business/hello-hybrid-cloud-kerberos-trust)<br><br>[Certificate trust model](/windows/security/identity-protection/hello-for-business/hello-key-trust-adfs)| -|What are the multifactor authentication options?|[Microsoft Entra multifactor authentication](/azure/multi-factor-authentication/)<br><br>[Custom Controls with Conditional Access*](../../conditional-access/controls.md)|[Microsoft Entra multifactor authentication](/azure/multi-factor-authentication/)<br><br>[Custom Controls with Conditional Access*](../../conditional-access/controls.md)|[Microsoft Entra multifactor authentication](../../authentication/index.yml)<br><br>[Third-party MFA](/windows-server/identity/ad-fs/operations/configure-additional-authentication-methods-for-ad-fs)<br><br>[Custom Controls with Conditional Access*](../../conditional-access/controls.md)| +|Is Windows Hello for Business supported?|[Key trust model](/windows/security/identity-protection/hello-for-business/hello-identity-verification)<br><br>[Hybrid Cloud Trust](/windows/security/identity-protection/hello-for-business/hello-hybrid-cloud-trust)|[Key trust model](/windows/security/identity-protection/hello-for-business/hello-identity-verification)<br><br>[Hybrid Cloud Trust](/windows/security/identity-protection/hello-for-business/hello-hybrid-cloud-kerberos-trust)<br><br>*Both require Windows Server 2016 Domain functional level*|[Key trust model](/windows/security/identity-protection/hello-for-business/hello-identity-verification)<br><br>[Hybrid Cloud Trust](/windows/security/identity-protection/hello-for-business/hello-hybrid-cloud-kerberos-trust)<br><br>[Certificate trust model](/windows/security/identity-protection/hello-for-business/hello-key-trust-adfs)| +|What are the multifactor authentication options?|[Microsoft Entra multifactor authentication](/azure/multi-factor-authentication/)<br><br>[Custom Controls with Conditional Access*](../../conditional-access/controls.md)|[Microsoft Entra multifactor authentication](../../authentication/index.yml)<br><br>[Custom Controls with Conditional Access*](../../conditional-access/controls.md)|[Microsoft Entra multifactor authentication](../../authentication/index.yml)<br><br>[Third-party MFA](/windows-server/identity/ad-fs/operations/configure-additional-authentication-methods-for-ad-fs)<br><br>[Custom Controls with Conditional Access*](../../conditional-access/controls.md)| |What user account states are supported?|Disabled accounts<br>(up to 30-minute delay)|Disabled accounts<br><br>Account locked out<br><br>Account expired<br><br>Password expired<br><br>Sign-in hours|Disabled accounts<br><br>Account locked out<br><br>Account expired<br><br>Password expired<br><br>Sign-in hours| |What are the Conditional Access options?|[Microsoft Entra Conditional Access, with Microsoft Entra ID P1 or P2](../../conditional-access/overview.md)|[Microsoft Entra Conditional Access, with Microsoft Entra ID P1 or P2](../../conditional-access/overview.md)|[Microsoft Entra Conditional Access, with Microsoft Entra ID P1 or P2](../../conditional-access/overview.md)<br><br>[AD FS claim rules](https://adfshelp.microsoft.com/AadTrustClaims/ClaimsGenerator)| |Is blocking legacy protocols supported?|[Yes](../../conditional-access/overview.md)|[Yes](../../conditional-access/overview.md)|[Yes](/windows-server/identity/ad-fs/operations/access-control-policies-w2k12)| |
active-directory | Four Steps | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/hybrid/connect/four-steps.md | To learn more, go read [Monitor AD FS using Microsoft Entra Connect Health](./ho ### Create custom dashboards for your leadership and your day to day -Organizations that don't have a SIEM solution can use [Azure Monitor workbooks for Microsoft Entra ID](/azure/active-directory/reports-monitoring/howto-use-workbooks). The integration contains pre-built workbooks and templates to help you understand how your users adopt and use Microsoft Entra features, which allows you to gain insights into all the activities within your directory. You can also create your own workbooks and share with your leadership team to report on day-to-day activities. Workbooks are a great way to monitor your business and see all of your most important metrics at a glance. +Organizations that don't have a SIEM solution can use [Azure Monitor workbooks for Microsoft Entra ID](../../reports-monitoring/howto-use-workbooks.md). The integration contains pre-built workbooks and templates to help you understand how your users adopt and use Microsoft Entra features, which allows you to gain insights into all the activities within your directory. You can also create your own workbooks and share with your leadership team to report on day-to-day activities. Workbooks are a great way to monitor your business and see all of your most important metrics at a glance. ### Understand your support call drivers |
active-directory | How To Connect Install Prerequisites | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/hybrid/connect/how-to-connect-install-prerequisites.md | To read more about securing your Active Directory environment, see [Best practic ### Harden your Microsoft Entra Connect server We recommend that you harden your Microsoft Entra Connect server to decrease the security attack surface for this critical component of your IT environment. Following these recommendations will help to mitigate some security risks to your organization. -- We recommend hardening the Microsoft Entra Connect server as a Control Plane (formerly Tier 0) asset by following the guidance provided in [Secure Privileged Access](/security/privileged-access-workstations/overview) and [Active Directory administrative tier model](/windows-server/identity/securing-privileged-access/securing-privileged-access-reference-material).+- We recommend hardening the Microsoft Entra Connect server as a Control Plane (formerly Tier 0) asset by following the guidance provided in [Secure Privileged Access](/security/privileged-access-workstations/overview) and [Active Directory administrative tier model](/security/privileged-access-workstations/privileged-access-access-model). - Restrict administrative access to the Microsoft Entra Connect server to only domain administrators or other tightly controlled security groups. - Create a [dedicated account for all personnel with privileged access](/windows-server/identity/securing-privileged-access/securing-privileged-access). Administrators shouldn't be browsing the web, checking their email, and doing day-to-day productivity tasks with highly privileged accounts. - Follow the guidance provided in [Securing privileged access](/security/privileged-access-workstations/overview). When you use Microsoft Entra Connect to deploy AD FS or the Web Application Prox * Ensure the Windows Remote Management/WS-Management (WinRM) service is running via the Services snap-in. * In an elevated PowerShell command window, use the command `Enable-PSRemoting ΓÇôforce`. * On the machine on which the wizard is running (if the target machine is non-domain joined or is an untrusted domain):- * In an elevated PowerShell command window, use the command `Set-Item.WSMan:\localhost\Client\TrustedHosts ΓÇôValue <DMZServerFQDN> -Force ΓÇôConcatenate`. + * In an elevated PowerShell command window, use the command `Set-Item.WSMan:\localhost\Client\TrustedHosts ΓÇôValue "<DMZServerFQDN>" -Force ΓÇôConcatenate`. * In the server * Add a DMZ WAP host to a machine pool. In the server manager, select **Manage** > **Add Servers**, and then use the **DNS** tab. * On the **Server Manager All Servers** tab, right-click the WAP server, and select **Manage As**. Enter local (not domain) credentials for the WAP machine. |
active-directory | How To Connect Sso | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/hybrid/connect/how-to-connect-sso.md | For more information on how SSO works with Windows 10 using PRT, see: [Primary R - Microsoft 365 Win32 clients (Outlook, Word, Excel, and others) with versions 16.0.8730.xxxx and above are supported using a non-interactive flow. For OneDrive, you'll have to activate the [OneDrive silent config feature](https://techcommunity.microsoft.com/t5/Microsoft-OneDrive-Blog/Previews-for-Silent-Sync-Account-Configuration-and-Bandwidth/ba-p/120894) for a silent sign-on experience. - It can be enabled via Microsoft Entra Connect. - It's a free feature, and you don't need any paid editions of Microsoft Entra ID to use it.-- It's supported on web browser-based clients and Office clients that support [modern authentication](/office365/enterprise/modern-auth-for-office-2013-and-2016) on platforms and browsers capable of Kerberos authentication:+- It's supported on web browser-based clients and Office clients that support [modern authentication](/microsoft-365/enterprise/modern-auth-for-office-2013-and-2016) on platforms and browsers capable of Kerberos authentication: | OS\Browser |Internet Explorer|Microsoft Edge\*\*\*\*|Google Chrome|Mozilla Firefox|Safari| | | | | | | -- |
active-directory | Migrate From Federation To Cloud Authentication | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/hybrid/connect/migrate-from-federation-to-cloud-authentication.md | Install [Microsoft Entra Connect](https://www.microsoft.com/download/details.asp ### Document current federation settings -To find your current federation settings, run [Get-MgDomainFederationConfiguration](/powershell/module/microsoft.graph.identity.directorymanagement/get-mgdomainfederationconfiguration?view=graph-powershell-beta&preserve-view=true). +To find your current federation settings, run [Get-MgDomainFederationConfiguration](/powershell/module/microsoft.graph.identity.directorymanagement/get-mgdomainfederationconfiguration?view=graph-powershell-1.0&viewFallbackFrom=graph-powershell-beta&preserve-view=true). ```powershell Get-MgDomainFederationConfiguration ΓÇôDomainID yourdomain.com |
active-directory | F5 Bigip Deployment Guide | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/manage-apps/f5-bigip-deployment-guide.md | Confirm you can connect to the BIG-IP VM web config and sign in with the credent To connect to the CLI: -- [Azure Bastion service](../../bastion/bastion-overview.md): Connect to VMs in a VNet, from any location+- [Azure Bastion service](/azure/bastion/bastion-overview): Connect to VMs in a VNet, from any location - SSH client, such as PowerShell with the just-in-time (JIT) approach - Serial Console: In the portal, in the VM menu, Support and troubleshooting section. It doesn't support file transfers. - From the internet: Configure the BIG-IP primary IP with a public IP. Add an NSG rule to allow SSH traffic. Restrict your trusted IP source. When the BIG-IP system is provisioned, we recommend a full configuration backup. 10. Save the user configuration set (UCS) archive locally. 11. Select **Download**. -You can create a backup of the entire system disk using [Azure snapshots](../../virtual-machines/windows/snapshot-copy-managed-disk.md). This tool provides contingency for testing between TMOS versions, or rolling back to a fresh system. +You can create a backup of the entire system disk using [Azure snapshots](/azure/virtual-machines/snapshot-copy-managed-disk). This tool provides contingency for testing between TMOS versions, or rolling back to a fresh system. ```PowerShell # Install modules |
active-directory | Manage Consent Requests | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/manage-apps/manage-consent-requests.md | -Microsoft recommends that you [restrict user consent](../../active-directory/manage-apps/configure-user-consent.md) to allow users to consent only for apps from verified publishers, and only for permissions that you select. For apps that don't meet these criteria, the decision-making process is centralized with your organization's security and identity administrator team. +Microsoft recommends that you [restrict user consent](../manage-apps/configure-user-consent.md) to allow users to consent only for apps from verified publishers, and only for permissions that you select. For apps that don't meet these criteria, the decision-making process is centralized with your organization's security and identity administrator team. After you've disabled or restricted user consent, you have several important steps to take to help keep your organization secure as you continue to allow business-critical applications to be used. These steps are crucial to minimize impact on your organization's support team and IT administrators, and to help prevent the use of unmanaged accounts in third-party applications. |
active-directory | Prevent Domain Hints With Home Realm Discovery | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/manage-apps/prevent-domain-hints-with-home-realm-discovery.md | After step 4 is complete all users, except those in `guestHandlingDomain.com`, c ## Configuring policy through Graph Explorer -Manage the [Home Realm Discovery policy](/graph/api/resources/homeRealmDiscoveryPolicy) using [Microsoft Graph](https://developer.microsoft.com/graph/graph-explorer). +Manage the [Home Realm Discovery policy](/graph/api/resources/homerealmdiscoverypolicy) using [Microsoft Graph](https://developer.microsoft.com/graph/graph-explorer). 1. Sign in to Microsoft Graph explorer with one of the roles listed in the prerequisite section. 1. Grant the `Policy.ReadWrite.ApplicationConfiguration` permission. |
active-directory | How Manage User Assigned Managed Identities | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/managed-identities-azure-resources/how-manage-user-assigned-managed-identities.md | In this article, you create a user-assigned managed identity by using Azure Reso You can't list and delete a user-assigned managed identity by using a Resource Manager template. See the following articles to create and list a user-assigned managed identity: -- [List user-assigned managed identity](./how-to-manage-ua-identity-cli.md#list-user-assigned-managed-identities)-- [Delete user-assigned managed identity](./how-to-manage-ua-identity-cli.md#delete-a-user-assigned-managed-identity)+- [List user-assigned managed identity](./how-manage-user-assigned-managed-identities.md?pivots=identity-mi-methods-azcli#list-user-assigned-managed-identities) +- [Delete user-assigned managed identity](./how-manage-user-assigned-managed-identities.md?pivots=identity-mi-methods-azcli#delete-a-user-assigned-managed-identity) ## Template creation and editing Resource Manager templates help you deploy new or modified resources defined by an Azure resource group. Several options are available for template editing and deployment, both local and portal-based. You can: |
active-directory | Managed Identities Status | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/managed-identities-azure-resources/managed-identities-status.md | The following Azure services support managed identities for Azure resources: | Azure Load Testing | [Use managed identities for Azure Load Testing](/azure/load-testing/how-to-use-a-managed-identity) | | Azure Logic Apps | [Authenticate access to Azure resources using managed identities in Azure Logic Apps](/azure/logic-apps/create-managed-service-identity) | | Azure Log Analytics cluster | [Azure Monitor customer-managed key](/azure/azure-monitor/logs/customer-managed-keys)-| Azure Machine Learning Services | [Use Managed identities with Azure Machine Learning](../../machine-learning/how-to-use-managed-identities.md?tabs=python) | +| Azure Machine Learning Services | [Use Managed identities with Azure Machine Learning](/azure/machine-learning/how-to-identity-based-service-authentication?tabs=python) | | Azure Managed Disk | [Use the Azure portal to enable server-side encryption with customer-managed keys for managed disks](/azure/virtual-machines/disks-enable-customer-managed-keys-portal) | | Azure Media services | [Managed identities](/azure/media-services/latest/concept-managed-identities) | | Azure Monitor | [Azure Monitor customer-managed key](/azure/azure-monitor/logs/customer-managed-keys?tabs=portal) | | Azure Policy | [Remediate non-compliant resources with Azure Policy](/azure/governance/policy/how-to/remediate-resources) |-| Microsoft Purview | [Credentials for source authentication in Microsoft Purview](../../purview/manage-credentials.md) | +| Microsoft Purview | [Credentials for source authentication in Microsoft Purview](/purview/manage-credentials) | | Azure Resource Mover | [Move resources across regions (from resource group)](/azure/resource-mover/move-region-within-resource-group) | Azure Site Recovery | [Replicate machines with private endpoints](/azure/site-recovery/azure-to-azure-how-to-enable-replication-private-endpoints#enable-the-managed-identity-for-the-vault) | | Azure Search | [Set up an indexer connection to a data source using a managed identity](/azure/search/search-howto-managed-identities-data-sources) | The following Azure services support managed identities for Azure resources: | Azure Stack Edge | [Manage Azure Stack Edge secrets using Azure Key Vault](/azure/databox-online/azure-stack-edge-gpu-activation-key-vault#recover-managed-identity-access) | Azure Static Web Apps | [Securing authentication secrets in Azure Key Vault](/azure/static-web-apps/key-vault-secrets) | Azure Stream Analytics | [Authenticate Stream Analytics to Azure Data Lake Storage Gen1 using managed identities](/azure/stream-analytics/stream-analytics-managed-identities-adls) |-| Azure Synapse | [Azure Synapse workspace managed identity](../../synapse-analytics/security/synapse-workspace-managed-identity.md) | +| Azure Synapse | [Azure Synapse workspace managed identity](/azure/data-factory/data-factory-service-identity) | | Azure VM image builder | [Configure Azure Image Builder Service permissions using Azure CLI](/azure/virtual-machines/linux/image-builder-permissions-cli#using-managed-identity-for-azure-storage-access)| | Azure Virtual Machine Scale Sets | [Configure managed identities on virtual machine scale set - Azure CLI](qs-configure-cli-windows-vmss.md) | | Azure Virtual Machines | [Secure and use policies on virtual machines in Azure](/azure/virtual-machines/windows/security-policy#managed-identities-for-azure-resources) | |
active-directory | Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/managed-identities-azure-resources/overview.md | There are two types of managed identities: - You authorize the managed identity to have access to one or more services. - The name of the system-assigned service principal is always the same as the name of the Azure resource it is created for. For a deployment slot, the name of its system-assigned identity is ```<app-name>/slots/<slot-name>```. -- **User-assigned**. You may also create a managed identity as a standalone Azure resource. You can [create a user-assigned managed identity](./how-to-manage-ua-identity-portal.md) and assign it to one or more Azure Resources. When you enable a user-assigned managed identity:+- **User-assigned**. You may also create a managed identity as a standalone Azure resource. You can [create a user-assigned managed identity](./how-manage-user-assigned-managed-identities.md?pivots=identity-mi-methods-azp) and assign it to one or more Azure Resources. When you enable a user-assigned managed identity: - A service principal of a special type is created in Microsoft Entra ID for the identity. The service principal is managed separately from the resources that use it. - User-assigned identities can be used by multiple resources. - You authorize the managed identity to have access to one or more services. Resources that support system assigned managed identities allow you to: If you choose a user assigned managed identity instead: -- You can [create, read, update, and delete](./how-to-manage-ua-identity-portal.md) the identities.+- You can [create, read, update, and delete](./how-manage-user-assigned-managed-identities.md?pivots=identity-mi-methods-azp) the identities. - You can use RBAC role assignments to [grant permissions](howto-assign-access-portal.md). - User assigned managed identities can be used on more than one resource. - CRUD operations are available for review in [Azure Activity logs](/azure/azure-monitor/essentials/activity-log). |
active-directory | Qs Configure Rest Vm | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/managed-identities-azure-resources/qs-configure-rest-vm.md | To assign a user-assigned identity to a VM, your account needs the [Virtual Mach az account get-access-token ``` -4. Create a user-assigned managed identity using the instructions found here: [Create a user-assigned managed identity](how-to-manage-ua-identity-rest.md#create-a-user-assigned-managed-identity). +4. Create a user-assigned managed identity using the instructions found here: [Create a user-assigned managed identity](./how-manage-user-assigned-managed-identities.md?pivots=identity-mi-methods-rest#create-a-user-assigned-managed-identity). 5. Create a VM using CURL to call the Azure Resource Manager REST endpoint. The following example creates a VM named *myVM* in the resource group *myResourceGroup* with a user-assigned managed identity `ID1`, as identified in the request body by the value `"identity":{"type":"UserAssigned"}`. Replace `<ACCESS TOKEN>` with the value you received in the previous step when you requested a Bearer access token and the `<SUBSCRIPTION ID>` value as appropriate for your environment. To assign a user-assigned identity to a VM, your account needs the [Virtual Mach az account get-access-token ``` -2. Create a user-assigned managed identity using the instructions found here, [Create a user-assigned managed identity](how-to-manage-ua-identity-rest.md#create-a-user-assigned-managed-identity). +2. Create a user-assigned managed identity using the instructions found here, [Create a user-assigned managed identity](./how-manage-user-assigned-managed-identities.md?pivots=identity-mi-methods-rest#create-a-user-assigned-managed-identity). 3. To ensure you don't delete existing user or system-assigned managed identities that are assigned to the VM, you need to list the identity types assigned to the VM by using the following CURL command. If you have managed identities assigned to the virtual machine scale set, they are listed under in the `identity` value. PATCH https://management.azure.com/subscriptions/<SUBSCRIPTION ID>/resourceGroup For information on how to create, list, or delete user-assigned managed identities using REST see: -- [Create, list, or delete a user-assigned managed identities using REST API calls](how-to-manage-ua-identity-rest.md)+- [Create, list, or delete a user-assigned managed identities using REST API calls](./how-manage-user-assigned-managed-identities.md?pivots=identity-mi-methods-rest) |
active-directory | Qs Configure Rest Vmss | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/managed-identities-azure-resources/qs-configure-rest-vmss.md | In this section, you learn how to add and remove user-assigned managed identity az account get-access-token ``` -4. Create a user-assigned managed identity using the instructions found here: [Create a user-assigned managed identity](how-to-manage-ua-identity-rest.md#create-a-user-assigned-managed-identity). +4. Create a user-assigned managed identity using the instructions found here: [Create a user-assigned managed identity](./how-manage-user-assigned-managed-identities.md?pivots=identity-mi-methods-rest#create-a-user-assigned-managed-identity). 5. Create a virtual machine scale set using CURL to call the Azure Resource Manager REST endpoint. The following example creates a virtual machine scale set named *myVMSS* in the resource group *myResourceGroup* with a user-assigned managed identity `ID1`, as identified in the request body by the value `"identity":{"type":"UserAssigned"}`. Replace `<ACCESS TOKEN>` with the value you received in the previous step when you requested a Bearer access token and the `<SUBSCRIPTION ID>` value as appropriate for your environment. In this section, you learn how to add and remove user-assigned managed identity az account get-access-token ``` -2. Create a user-assigned managed identity using the instructions found here, [Create a user-assigned managed identity](how-to-manage-ua-identity-rest.md#create-a-user-assigned-managed-identity). +2. Create a user-assigned managed identity using the instructions found here, [Create a user-assigned managed identity](./how-manage-user-assigned-managed-identities.md?pivots=identity-mi-methods-rest#create-a-user-assigned-managed-identity). 3. To ensure you don't delete existing user or system-assigned managed identities that are assigned to the virtual machine scale set, you need to list the identity types assigned to the virtual machine scale set by using the following CURL command. If you have managed identities assigned to the virtual machine scale set, they are listed in the `identity` value. PATCH https://management.azure.com/subscriptions/<SUBSCRIPTION ID>/resourceGroup For information on how to create, list, or delete user-assigned managed identities using REST see: -- [Create, list, or delete a user-assigned managed identity using REST API calls](how-to-manage-ua-identity-rest.md)+- [Create, list, or delete a user-assigned managed identity using REST API calls](./how-manage-user-assigned-managed-identities.md?pivots=identity-mi-methods-rest) |
active-directory | Qs Configure Template Windows Vm | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/managed-identities-azure-resources/qs-configure-template-windows-vm.md | The following example shows you how to remove a system-assigned managed identity In this section, you assign a user-assigned managed identity to an Azure VM using Azure Resource Manager template. > [!NOTE]-> To create a user-assigned managed identity using an Azure Resource Manager Template, see [Create a user-assigned managed identity](how-to-manage-ua-identity-arm.md#create-a-user-assigned-managed-identity). +> To create a user-assigned managed identity using an Azure Resource Manager Template, see [Create a user-assigned managed identity](./how-manage-user-assigned-managed-identities.md?pivots=identity-mi-methods-arm#create-a-user-assigned-managed-identity). ### Assign a user-assigned managed identity to an Azure VM |
active-directory | Qs Configure Template Windows Vmss | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/managed-identities-azure-resources/qs-configure-template-windows-vmss.md | If you have a virtual machine scale set that no longer needs a system-assigned m In this section, you assign a user-assigned managed identity to a virtual machine scale set using Azure Resource Manager template. > [!Note]-> To create a user-assigned managed identity using an Azure Resource Manager Template, see [Create a user-assigned managed identity](how-to-manage-ua-identity-arm.md#create-a-user-assigned-managed-identity). +> To create a user-assigned managed identity using an Azure Resource Manager Template, see [Create a user-assigned managed identity](./how-manage-user-assigned-managed-identities.md?pivots=identity-mi-methods-arm#create-a-user-assigned-managed-identity). ### Assign a user-assigned managed identity to a virtual machine scale set |
active-directory | Tutorial Linux Vm Access Arm | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/managed-identities-azure-resources/tutorial-linux-vm-access-arm.md | In this quickstart, you learned how to use a system-assigned managed identity to > [!div class="nextstepaction"] >[Azure Resource Manager](/azure/azure-resource-manager/management/overview)->[Create, list or delete a user-assigned managed identity using Azure PowerShell](how-to-manage-ua-identity-powershell.md) +>[Create, list or delete a user-assigned managed identity using Azure PowerShell](./how-manage-user-assigned-managed-identities.md?pivots=identity-mi-methods-powershell) |
active-directory | Tutorial Linux Vm Access Storage Access Key | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/managed-identities-azure-resources/tutorial-linux-vm-access-storage-access-key.md | For detailed steps, see [Assign Azure roles using the Azure portal](/azure/role- For the remainder of the tutorial, we will work from the VM we created earlier. -To complete these steps, you will need an SSH client. If you are using Windows, you can use the SSH client in the [Windows Subsystem for Linux](/windows/wsl/install-win10). If you need assistance configuring your SSH client's keys, see [How to Use SSH keys with Windows on Azure](/azure/virtual-machines/linux/ssh-from-windows), or [How to create and use an SSH public and private key pair for Linux VMs in Azure](/azure/virtual-machines/linux/mac-create-ssh-keys). +To complete these steps, you will need an SSH client. If you are using Windows, you can use the SSH client in the [Windows Subsystem for Linux](/windows/wsl/install). If you need assistance configuring your SSH client's keys, see [How to Use SSH keys with Windows on Azure](/azure/virtual-machines/linux/ssh-from-windows), or [How to create and use an SSH public and private key pair for Linux VMs in Azure](/azure/virtual-machines/linux/mac-create-ssh-keys). 1. In the Azure portal, navigate to **Virtual Machines**, go to your Linux virtual machine, then from the **Overview** page click **Connect** at the top. Copy the string to connect to your VM. 2. Connect to your VM using your SSH client. |
active-directory | Tutorial Windows Vm Access Storage Sas | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/managed-identities-azure-resources/tutorial-windows-vm-access-storage-sas.md | For this request we'll use the follow HTTP request parameters to create the SAS } ``` -These parameters are included in the POST body of the request for the SAS credential. For more information on the parameters for creating a SAS credential, see the [List Service SAS REST reference](/rest/api/storagerp/storageaccounts/listservicesas). +These parameters are included in the POST body of the request for the SAS credential. For more information on the parameters for creating a SAS credential, see the [List Service SAS REST reference](/rest/api/storagerp/storage-accounts/list-service-sas). First, convert the parameters to JSON, then call the storage `listServiceSas` endpoint to create the SAS credential: |
active-directory | Concept Pim For Groups | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/privileged-identity-management/concept-pim-for-groups.md | Role-assignable groups benefit from extra protections comparing to non-role-assi To learn more about Microsoft Entra built-in roles and their permissions, see [Microsoft Entra built-in roles](../roles/permissions-reference.md). -Microsoft Entra role-assignable group feature is not part of Microsoft Entra Privileged Identity Management (Microsoft Entra PIM). For more information on licensing, see [Microsoft Entra ID Governance licensing fundamentals](../../active-directory/governance/licensing-fundamentals.md) . +Microsoft Entra role-assignable group feature is not part of Microsoft Entra Privileged Identity Management (Microsoft Entra PIM). For more information on licensing, see [Microsoft Entra ID Governance licensing fundamentals](../governance/licensing-fundamentals.md) . ## Relationship between role-assignable groups and PIM for Groups |
active-directory | Groups Assign Member Owner | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/privileged-identity-management/groups-assign-member-owner.md | -In Azure Active Directory, formerly known as Microsoft Entra ID, you can use Privileged Identity Management (PIM) to manage just-in-time membership in the group or just-in-time ownership of the group. +In Microsoft Entra ID, formerly known as Azure Active Directory, you can use Privileged Identity Management (PIM) to manage just-in-time membership in the group or just-in-time ownership of the group. When a membership or ownership is assigned, the assignment: When a membership or ownership is assigned, the assignment: - Can't be removed within five minutes of it being assigned >[!NOTE]->Every user who is eligible for membership in or ownership of a PIM for Groups must have a Microsoft Entra Premuim P2 or Microsoft Entra ID Governance license. For more information, see [License requirements to use Privileged Identity Management](subscription-requirements.md). +>Every user who is eligible for membership in or ownership of a PIM for Groups must have a Microsoft Entra Premium P2 or Microsoft Entra ID Governance license. For more information, see [License requirements to use Privileged Identity Management](../governance/licensing-fundamentals.md). ## Assign an owner or member of a group |
active-directory | Pim Apis | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/privileged-identity-management/pim-apis.md | Find more details about APIs that allow to manage assignments in the documentati - [PIM for Microsoft Entra roles API reference](/graph/api/resources/privilegedidentitymanagementv3-overview) - [PIM for Azure resource roles API reference](/rest/api/authorization/privileged-role-eligibility-rest-sample) - [PIM for Groups API reference](/graph/api/resources/privilegedidentitymanagement-for-groups-api-overview)-- [PIM Alerts for Microsoft Entra roles API reference](/graph/api/resources/privilegedidentitymanagementv3-overview?view=graph-rest-beta#building-blocks-of-the-pim-alerts-apis&preserve-view=true)+- [PIM Alerts for Microsoft Entra roles API reference](/graph/api/resources/privilegedidentitymanagementv3-overview?view=graph-rest-beta&preserve-view=true#building-blocks-of-the-pim-alerts-apis) - [PIM Alerts for Azure Resources API reference](/rest/api/authorization/role-management-alert-rest-sample) |
active-directory | Pim Configure | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/privileged-identity-management/pim-configure.md | When you use B2B collaboration, you can invite an external user to your organiza ## Next steps -- [License requirements to use Privileged Identity Management](subscription-requirements.md)+- [License requirements to use Privileged Identity Management](../governance/licensing-fundamentals.md) - [Securing privileged access for hybrid and cloud deployments in Microsoft Entra ID](../roles/security-planning.md?toc=/azure/active-directory/privileged-identity-management/toc.json) - [Deploy Privileged Identity Management](pim-deployment-plan.md) |
active-directory | Pim Create Roles And Resource Roles Review | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/privileged-identity-management/pim-create-roles-and-resource-roles-review.md | The need for access to privileged Azure resource and Microsoft Entra roles by em [!INCLUDE [entra-id-license-pim.md](../../../includes/entra-id-license-pim.md)] -For more information about licenses for PIM, refer to [License requirements to use Privileged Identity Management](subscription-requirements.md). +For more information about licenses for PIM, refer to [License requirements to use Privileged Identity Management](../governance/licensing-fundamentals.md). To create access reviews for Azure resources, you must be assigned to the [Owner](/azure/role-based-access-control/built-in-roles#owner) or the [User Access Administrator](/azure/role-based-access-control/built-in-roles#user-access-administrator) role for the Azure resources. To create access reviews for Microsoft Entra roles, you must be assigned to the [Global Administrator](../roles/permissions-reference.md#global-administrator) or the [Privileged Role Administrator](../roles/permissions-reference.md#privileged-role-administrator) role. |
active-directory | Pim Getting Started | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/privileged-identity-management/pim-getting-started.md | To use Privileged Identity Management, you must have one of the following licens - [!INCLUDE [entra-id-license-pim.md](../../../includes/entra-id-license-pim.md)] -For more information, see [License requirements to use Privileged Identity Management](subscription-requirements.md). +For more information, see [License requirements to use Privileged Identity Management](../governance/licensing-fundamentals.md). > [!Note] > When a user who is active in a privileged role in a Microsoft Entra organization with a Premium P2 license goes to **Roles and administrators** in Microsoft Entra ID and selects a role (or even just visits Privileged Identity Management): |
active-directory | Pim Resource Roles Assign Roles | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/privileged-identity-management/pim-resource-roles-assign-roles.md | Follow these steps to make a user eligible for an Azure resource role. ## Assign a role using ARM API -Privileged Identity Management supports Azure Resource Manager (ARM) API commands to manage Azure resource roles, as documented in the [PIM ARM API reference](/rest/api/authorization/roleeligibilityschedulerequests). For the permissions required to use the PIM API, see [Understand the Privileged Identity Management APIs](pim-apis.md). +Privileged Identity Management supports Azure Resource Manager (ARM) API commands to manage Azure resource roles, as documented in the [PIM ARM API reference](/rest/api/authorization/role-eligibility-schedule-requests). For the permissions required to use the PIM API, see [Understand the Privileged Identity Management APIs](pim-apis.md). The following example is a sample HTTP request to create an eligible assignment for an Azure role. |
active-directory | Pim Troubleshoot | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/privileged-identity-management/pim-troubleshoot.md | Assign the User Access Administrator role to the Privileged identity Management ## Next steps -- [License requirements to use Privileged Identity Management](subscription-requirements.md)+- [License requirements to use Privileged Identity Management](../governance/licensing-fundamentals.md) - [Securing privileged access for hybrid and cloud deployments in Microsoft Entra ID](../roles/security-planning.md?toc=/azure/active-directory/privileged-identity-management/toc.json) - [Deploy Privileged Identity Management](pim-deployment-plan.md) |
active-directory | Concept Diagnostic Settings Logs Options | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/reports-monitoring/concept-diagnostic-settings-logs-options.md | The `AuditLogs` report capture changes to applications, groups, users, and licen ### Sign-in logs -The `SignInLogs` send the interactive sign-in logs, which are logs generated by your users signing in. Sign-in logs are generated by users providing their username and password on a Microsoft Entra sign-in screen or passing an MFA challenge. For more information, see [Interactive user sign-ins](concept-all-sign-ins.md#interactive-user-sign-ins). +The `SignInLogs` send the interactive sign-in logs, which are logs generated by your users signing in. Sign-in logs are generated by users providing their username and password on a Microsoft Entra sign-in screen or passing an MFA challenge. For more information, see [Interactive user sign-ins](./concept-sign-ins.md#interactive-user-sign-ins). ### Non-interactive sign-in logs -The `NonInteractiveUserSIgnInLogs` are sign-ins done on behalf of a user, such as by a client app. The device or client uses a token or code to authenticate or access a resource on behalf of a user. For more information, see [Non-interactive user sign-ins](concept-all-sign-ins.md#non-interactive-user-sign-ins). +The `NonInteractiveUserSIgnInLogs` are sign-ins done on behalf of a user, such as by a client app. The device or client uses a token or code to authenticate or access a resource on behalf of a user. For more information, see [Non-interactive user sign-ins](./concept-sign-ins.md#non-interactive-user-sign-ins). ### Service principal sign-in logs -If you need to review sign-in activity for apps or service principals, the `ServicePrincipalSignInLogs` may be a good option. In these scenarios, certificates or client secrets are used for authentication. For more information, see [Service principal sign-ins](concept-all-sign-ins.md#service-principal-sign-ins). +If you need to review sign-in activity for apps or service principals, the `ServicePrincipalSignInLogs` may be a good option. In these scenarios, certificates or client secrets are used for authentication. For more information, see [Service principal sign-ins](./concept-sign-ins.md#service-principal-sign-ins). ### Managed identity sign-in logs -The `ManagedIdentitySignInLogs` provide similar insights as the service principal sign-in logs, but for managed identities, where Azure manages the secrets. For more information, see [Managed identity sign-ins](concept-all-sign-ins.md#managed-identity-for-azure-resources-sign-ins). +The `ManagedIdentitySignInLogs` provide similar insights as the service principal sign-in logs, but for managed identities, where Azure manages the secrets. For more information, see [Managed identity sign-ins](./concept-sign-ins.md#managed-identity-sign-ins). ### Provisioning logs The `NetworkAccessTrafficLogs` logs are associated with Microsoft Entra Internet ## Next steps -- [Learn about the sign-in logs](concept-all-sign-ins.md)+- [Learn about the sign-in logs](./concept-sign-ins.md) - [Explore how to access the activity logs](howto-access-activity-logs.md) |
active-directory | Concept Log Monitoring Integration Options Considerations | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/reports-monitoring/concept-log-monitoring-integration-options-considerations.md | With the data sample captured, multiply accordingly to find out how large the fi To get an idea of how much a log integration could cost for your organization, you can enable an integration for a day or two. Use this option if your budget allows for the temporary increase. -To enable a log integration, follow the steps in the [Integrate activity logs with Azure Monitor logs](howto-integrate-activity-logs-with-log-analytics.md) article. If possible, create a new resource group for the logs and endpoint you want to try out. Having a devoted resource group makes it easy to view the cost analysis and then delete it when you're done. +To enable a log integration, follow the steps in the [Integrate activity logs with Azure Monitor logs](./howto-integrate-activity-logs-with-azure-monitor-logs.md) article. If possible, create a new resource group for the logs and endpoint you want to try out. Having a devoted resource group makes it easy to view the cost analysis and then delete it when you're done. With the integration enabled, navigate to **Azure portal** > **Cost Management** > **Cost analysis**. There are several ways to analyze costs. This [Cost Management quickstart](/azure/cost-management-billing/costs/quick-acm-cost-analysis) should help you get started. The figures in the following screenshot are used for example purposes and are not intended to reflect actual amounts. Once you have an estimate for the GB/day that will be sent to an endpoint, enter ## Next steps * [Create a storage account](/azure/storage/common/storage-account-create)-* [Archive activity logs to a storage account](quickstart-azure-monitor-route-logs-to-storage-account.md) -* [Route activity logs to an event hub](./tutorial-azure-monitor-stream-logs-to-event-hub.md) -* [Integrate activity logs with Azure Monitor](howto-integrate-activity-logs-with-log-analytics.md) +* [Archive activity logs to a storage account](./howto-archive-logs-to-storage-account.md) +* [Route activity logs to an event hub](./howto-stream-logs-to-event-hub.md) +* [Integrate activity logs with Azure Monitor](./howto-integrate-activity-logs-with-azure-monitor-logs.md) |
active-directory | Concept Sign Ins | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/reports-monitoring/concept-sign-ins.md | -Reviewing sign-in errors and patterns provides valuable insight into how your users access applications and services. The sign-in logs provided by Microsoft Entra ID are a powerful type of [activity log](overview-reports.md) that you can analyze. This article explains how to access and utilize the sign-in logs. +Reviewing sign-in errors and patterns provides valuable insight into how your users access applications and services. The sign-in logs provided by Microsoft Entra ID are a powerful type of [activity log](./overview-monitoring-health.md) that you can analyze. This article explains how to access and utilize the sign-in logs. The preview view of the sign-in logs includes interactive and non-interactive user sign-ins as well as service principal and managed identity sign-ins. You can still view the classic sign-in logs, which only include interactive sign-ins. There are several reports available in **Usage & insights**. Some of these repor ### Microsoft 365 activity logs -You can view Microsoft 365 activity logs from the [Microsoft 365 admin center](/office365/admin/admin-overview/about-the-admin-center). Microsoft 365 activity and Microsoft Entra activity logs share a significant number of directory resources. Only the Microsoft 365 admin center provides a full view of the Microsoft 365 activity logs. +You can view Microsoft 365 activity logs from the [Microsoft 365 admin center](/microsoft-365/admin/admin-overview/admin-center-overview). Microsoft 365 activity and Microsoft Entra activity logs share a significant number of directory resources. Only the Microsoft 365 admin center provides a full view of the Microsoft 365 activity logs. You can access the Microsoft 365 activity logs programmatically by using the [Office 365 Management APIs](/office/office-365-management-api/office-365-management-apis-overview). ## Next steps -- [Basic info in the Microsoft Entra sign-in logs](reference-basic-info-sign-in-logs.md)+- [Basic info in the Microsoft Entra sign-in logs](./concept-sign-in-log-activity-details.md) - [How to download logs in Microsoft Entra ID](howto-download-logs.md) |
active-directory | Howto Access Activity Logs | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/reports-monitoring/howto-access-activity-logs.md | The SIEM tools you can integrate with your event hub can provide analysis and mo 1. Browse to **Identity** > **Monitoring & health** > **Diagnostic settings**. 1. Choose the logs you want to stream, select the **Stream to an event hub** option, and complete the fields. - [Set up an Event Hubs namespace and an event hub](/azure/event-hubs/event-hubs-create)- - [Learn more about streaming activity logs to an event hub](tutorial-azure-monitor-stream-logs-to-event-hub.md) + - [Learn more about streaming activity logs to an event hub](./howto-stream-logs-to-event-hub.md) Your independent security vendor should provide you with instructions on how to ingest data from Azure Event Hubs into their tool. Integrating Microsoft Entra logs with Azure Monitor logs provides a centralized 1. Browse to **Identity** > **Monitoring & health** > **Diagnostic settings**. 1. Choose the logs you want to stream, select the **Send to Log Analytics workspace** option, and complete the fields. 1. Browse to **Identity** > **Monitoring & health** > **Log Analytics** and begin querying the data.- - [Integrate Microsoft Entra logs with Azure Monitor logs](howto-integrate-activity-logs-with-log-analytics.md) + - [Integrate Microsoft Entra logs with Azure Monitor logs](./howto-integrate-activity-logs-with-azure-monitor-logs.md) - [Learn how to query using Log Analytics](howto-analyze-activity-logs-log-analytics.md) ## Monitor events with Microsoft Sentinel Use the following basic steps to access the reports in the Microsoft Entra admin 1. Browse to **Identity** > **Monitoring & health** > **Audit logs**/**Sign-in logs**/**Provisioning logs**. 1. Adjust the filter according to your needs.- - [Learn how to filter activity logs](quickstart-filter-audit-log.md) + - [Learn how to filter activity logs](./howto-customize-filter-logs.md) - [Explore the Microsoft Entra audit log categories and activities](reference-audit-activities.md) - - [Learn about basic info in the Microsoft Entra sign-in logs](reference-basic-info-sign-in-logs.md) + - [Learn about basic info in the Microsoft Entra sign-in logs](./concept-sign-in-log-activity-details.md) <a name='azure-ad-identity-protection-reports'></a> The right solution for your long-term storage depends on your budget and what yo - Download logs for manual storage - Integrate logs with Azure Monitor logs -[Azure Storage](/azure/storage/common/storage-introduction) is the right solution if you aren't planning on querying your data often. For more information, see [Archive directory logs to a storage account](quickstart-azure-monitor-route-logs-to-storage-account.md). +[Azure Storage](/azure/storage/common/storage-introduction) is the right solution if you aren't planning on querying your data often. For more information, see [Archive directory logs to a storage account](./howto-archive-logs-to-storage-account.md). -If you plan to query the logs often to run reports or perform analysis on the stored logs, you should [integrate your data with Azure Monitor logs](howto-integrate-activity-logs-with-log-analytics.md). +If you plan to query the logs often to run reports or perform analysis on the stored logs, you should [integrate your data with Azure Monitor logs](./howto-integrate-activity-logs-with-azure-monitor-logs.md). If your budget is tight, and you need a cheap method to create a long-term backup of your activity logs, you can [manually download your logs](howto-download-logs.md). The user interface of the activity logs in the portal provides you with an option to download the data as **JSON** or **CSV**. One trade off of the manual download is that it requires more manual interaction. If you're looking for a more professional solution, use either Azure Storage or Azure Monitor. Use the following basic steps to archive or download your activity logs. ## Next steps -- [Stream logs to an event hub](tutorial-azure-monitor-stream-logs-to-event-hub.md)-- [Archive logs to a storage account](quickstart-azure-monitor-route-logs-to-storage-account.md)-- [Integrate logs with Azure Monitor logs](howto-integrate-activity-logs-with-log-analytics.md)+- [Stream logs to an event hub](./howto-stream-logs-to-event-hub.md) +- [Archive logs to a storage account](./howto-archive-logs-to-storage-account.md) +- [Integrate logs with Azure Monitor logs](./howto-integrate-activity-logs-with-azure-monitor-logs.md) |
active-directory | Howto Analyze Activity Logs Log Analytics | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/reports-monitoring/howto-analyze-activity-logs-log-analytics.md | -After you [integrate Microsoft Entra activity logs with Azure Monitor logs](howto-integrate-activity-logs-with-log-analytics.md), you can use the power of Log Analytics and Azure Monitor logs to gain insights into your environment. +After you [integrate Microsoft Entra activity logs with Azure Monitor logs](./howto-integrate-activity-logs-with-azure-monitor-logs.md), you can use the power of Log Analytics and Azure Monitor logs to gain insights into your environment. * Compare your Microsoft Entra sign-in logs against security logs published by Microsoft Defender for Cloud. For more information on Microsoft Entra built-in roles, see [Microsoft Entra bui ## Access Log Analytics -To view the Microsoft Entra ID Log Analytics, you must already be sending your activity logs from Microsoft Entra ID to a Log Analytics workspace. This process is covered in the [How to integrate activity logs with Azure Monitor](howto-integrate-activity-logs-with-log-analytics.md) article. +To view the Microsoft Entra ID Log Analytics, you must already be sending your activity logs from Microsoft Entra ID to a Log Analytics workspace. This process is covered in the [How to integrate activity logs with Azure Monitor](./howto-integrate-activity-logs-with-azure-monitor-logs.md) article. [!INCLUDE [portal updates](~/articles/active-directory/includes/portal-update.md)] |
active-directory | Howto Download Logs | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/reports-monitoring/howto-download-logs.md | The following screenshot shows the download window from the audit and sign-in lo The following screenshot shows menu options for the provisioning log download process. ![Screenshot of the provisioning log download button options.](./media/howto-download-logs/provisioning-logs-download.png) -If your tenant has enabled the [sign-in logs preview](concept-all-sign-ins.md), more options are available after selecting **Download**. The sign-in logs preview include interactive and non-interactive user sign-ins, service principal sign-ins, and managed identity sign-ins. +If your tenant has enabled the [sign-in logs preview](./concept-sign-ins.md), more options are available after selecting **Download**. The sign-in logs preview include interactive and non-interactive user sign-ins, service principal sign-ins, and managed identity sign-ins. ![Screenshot of the download options for the sign-in logs preview.](media/howto-download-logs/sign-in-preview-download-options.png) ## Next steps -- [Integrate Microsoft Entra logs with Azure Monitor](howto-integrate-activity-logs-with-log-analytics.md)+- [Integrate Microsoft Entra logs with Azure Monitor](./howto-integrate-activity-logs-with-azure-monitor-logs.md) - [Access Microsoft Entra logs using the Graph API](quickstart-access-log-with-graph-api.md) |
active-directory | Howto Manage Inactive User Accounts | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/reports-monitoring/howto-manage-inactive-user-accounts.md | The following details relate to the `lastSignInDateTime` property. - Each interactive sign-in attempt results in an update of the underlying data store. Typically, sign-ins show up in the related sign-in report within 6 hours. -- To generate a `lastSignInDateTime` timestamp, you must attempt a sign-in. Either a failed or successful sign-in attempt, as long as it's recorded in the [Microsoft Entra sign-in logs](concept-all-sign-ins.md), generates a `lastSignInDateTime` timestamp. The value of the `lastSignInDateTime` property may be blank if:+- To generate a `lastSignInDateTime` timestamp, you must attempt a sign-in. Either a failed or successful sign-in attempt, as long as it's recorded in the [Microsoft Entra sign-in logs](./concept-sign-ins.md), generates a `lastSignInDateTime` timestamp. The value of the `lastSignInDateTime` property may be blank if: - The last attempted sign-in of a user took place before April 2020. - The affected user account was never used for a sign-in attempt. |
active-directory | Howto Use Recommendations | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/reports-monitoring/howto-use-recommendations.md | GET /directory/recommendations/{recommendationId}/impactedResources ## Next steps - [Review the Microsoft Entra recommendations overview](overview-recommendations.md)-- [Learn about Service Health notifications](overview-service-health-notifications.md)+- [Learn about Service Health notifications](/azure/service-health/service-health-portal-update) |
active-directory | Overview Monitoring Health | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/reports-monitoring/overview-monitoring-health.md | Activity logs help you understand the behavior of users in your organization. Th - [**Audit logs**](concept-audit-logs.md) include the history of every task performed in your tenant. -- [**Sign-in logs**](concept-all-sign-ins.md) capture the sign-in attempts of your users and client applications.+- [**Sign-in logs**](./concept-sign-ins.md) capture the sign-in attempts of your users and client applications. - [**Provisioning logs**](concept-provisioning-logs.md) provide information around users provisioned in your tenant through a third party service. Reviewing Microsoft Entra activity logs is the first step in maintaining and imp Monitoring Microsoft Entra activity logs requires routing the log data to a monitoring and analysis solution. Endpoints include Azure Monitor logs, Microsoft Sentinel, or a third-party solution third-party Security Information and Event Management (SIEM) tool. - [Stream logs to an event hub to integrate with third-party SIEM tools.](howto-stream-logs-to-event-hub.md)-- [Integrate logs with Azure Monitor logs.](howto-integrate-activity-logs-with-log-analytics.md)+- [Integrate logs with Azure Monitor logs.](./howto-integrate-activity-logs-with-azure-monitor-logs.md) - [Analyze logs with Azure Monitor logs and Log Analytics.](howto-analyze-activity-logs-log-analytics.md) |
active-directory | Overview Workbooks | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/reports-monitoring/overview-workbooks.md | Public workbook templates are built, updated, and deprecated to reflect the need ## Next steps -- Learn [how to use Azure Workbooks for Microsoft Entra ID](howto-use-azure-monitor-workbooks.md)+- Learn [how to use Azure Workbooks for Microsoft Entra ID](./howto-use-workbooks.md) - [Create your own workbook](/azure/azure-monitor/visualize/workbooks-create-workbook) - Create a [Log Analytics workspace](/azure/azure-monitor/logs/quick-create-workspace) |
active-directory | Reference Powershell Reporting | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/reports-monitoring/reference-powershell-reporting.md | The following image shows an example for this command. ## Next steps -- [Microsoft Entra reports overview](overview-reports.md).+- [Microsoft Entra reports overview](./overview-monitoring-health.md). - [Audit logs report](concept-audit-logs.md). - [Programmatic access to Microsoft Entra reports](./howto-configure-prerequisites-for-reporting-api.md) |
active-directory | Reference Reports Data Retention | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/reports-monitoring/reference-reports-data-retention.md | In this article, you learn about the data retention policies for the different a | Microsoft Entra Edition | Collection Start | | :-- | :-- | | Microsoft Entra ID P1 <br /> Microsoft Entra ID P2 <br /> Microsoft Entra Workload ID Premium | When you sign up for a subscription |-| Microsoft Entra ID Free| The first time you open [Microsoft Entra ID](https://portal.azure.com/#blade/Microsoft_AAD_IAM/ActiveDirectoryMenuBlade/Overview) or use the [reporting APIs](./overview-reports.md) | +| Microsoft Entra ID Free| The first time you open [Microsoft Entra ID](https://portal.azure.com/#blade/Microsoft_AAD_IAM/ActiveDirectoryMenuBlade/Overview) or use the [reporting APIs](./overview-monitoring-health.md) | If you already have activities data with your free license, then you can see it immediately on upgrade. If you donΓÇÖt have any data, then it will take up to three days for the data to show up in the reports after you upgrade to a premium license. For security signals, the collection process starts when you opt-in to use the **Identity Protection Center**. If you already have activities data with your free license, then you can see it | Sign-ins | Seven days | 30 days | 30 days | | Microsoft Entra multifactor authentication usage | 30 days | 30 days | 30 days | -You can retain the audit and sign-in activity data for longer than the default retention period outlined in the previous table by routing it to an Azure storage account using Azure Monitor. For more information, see [Archive Microsoft Entra logs to an Azure storage account](quickstart-azure-monitor-route-logs-to-storage-account.md). +You can retain the audit and sign-in activity data for longer than the default retention period outlined in the previous table by routing it to an Azure storage account using Azure Monitor. For more information, see [Archive Microsoft Entra logs to an Azure storage account](./howto-archive-logs-to-storage-account.md). **Security signals** You can retain the audit and sign-in activity data for longer than the default r ## Next steps -- [Stream logs to an event hub](tutorial-azure-monitor-stream-logs-to-event-hub.md)+- [Stream logs to an event hub](./howto-stream-logs-to-event-hub.md) - [Learn how to download Microsoft Entra logs](howto-download-logs.md) |
active-directory | Workbook Legacy Authentication | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/reports-monitoring/workbook-legacy-authentication.md | This workbook supports multiple filters: - To learn more about identity protection, see [What is identity protection](../identity-protection/overview-identity-protection.md). -- For more information about Microsoft Entra workbooks, see [How to use Microsoft Entra workbooks](howto-use-azure-monitor-workbooks.md).+- For more information about Microsoft Entra workbooks, see [How to use Microsoft Entra workbooks](./howto-use-workbooks.md). |
active-directory | Workbook Risk Analysis | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/reports-monitoring/workbook-risk-analysis.md | Risky Users: - To learn more about identity protection, see [What is identity protection](../identity-protection/overview-identity-protection.md). -- For more information about Microsoft Entra workbooks, see [How to use Microsoft Entra workbooks](howto-use-azure-monitor-workbooks.md).+- For more information about Microsoft Entra workbooks, see [How to use Microsoft Entra workbooks](./howto-use-workbooks.md). |
active-directory | Custom Create | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/roles/custom-create.md | Like built-in roles, custom roles are assigned by default at the default organiz - Feel free to share with us on the [Microsoft Entra administrative roles forum](https://feedback.azure.com/d365community/forum/22920db1-ad25-ec11-b6e6-000d3a4f0789). - For more about role permissions, see [Microsoft Entra built-in roles](permissions-reference.md).-- For default user permissions, see a [comparison of default guest and member user permissions](../fundamentals/users-default-permissions.md?context=azure%2factive-directory%2froles%2fcontext%2fugr-context).+- For default user permissions, see a [comparison of default guest and member user permissions](../fundamentals/users-default-permissions.md?context=azure/active-directory/roles/context/ugr-context). |
active-directory | M365 Workload Docs | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/roles/m365-workload-docs.md | All products in Microsoft 365 can be managed with administrative roles in Micros > [!div class="mx-tableFixed"] > | Microsoft 365 service | Role content | API content | > | - | | -- |-> | Admin roles in Office 365 and Microsoft 365 business plans | [Microsoft 365 admin roles](/office365/admin/add-users/about-admin-roles) | Not available | +> | Admin roles in Office 365 and Microsoft 365 business plans | [Microsoft 365 admin roles](/microsoft-365/admin/add-users/about-admin-roles) | Not available | > | Microsoft Entra ID and Microsoft Entra ID Protection| [Microsoft Entra built-in roles](permissions-reference.md) | [Graph API](/graph/api/overview)<br>[Fetch role assignments](/graph/api/directoryrole-list) |-> | Exchange Online| [Exchange role-based access control](/exchange/understanding-role-based-access-control-exchange-2013-help) | [PowerShell for Exchange](/powershell/module/exchange/role-based-access-control/add-managementroleentry)<br>[Fetch role assignments](/powershell/module/exchange/role-based-access-control/get-rolegroup) | +> | Exchange Online| [Exchange role-based access control](/exchange/understanding-role-based-access-control-exchange-2013-help) | [PowerShell for Exchange](/powershell/module/exchange/add-managementroleentry)<br>[Fetch role assignments](/powershell/module/exchange/get-rolegroup) | > | SharePoint Online | [Microsoft Entra built-in roles](permissions-reference.md)<br>Also [About the SharePoint admin role in Microsoft 365](/sharepoint/sharepoint-admin-role) | [Graph API](/graph/api/overview)<br>[Fetch role assignments](/graph/api/directoryrole-list) | > | Teams/Skype for Business | [Microsoft Entra built-in roles](permissions-reference.md) | [Graph API](/graph/api/overview)<br>[Fetch role assignments](/graph/api/directoryrole-list) | > | Security & Compliance Center (Office 365 Advanced Threat Protection, Exchange Online Protection, Information Protection) | [Office 365 admin roles](/microsoft-365/security/office-365-security/scc-permissions) | [Exchange PowerShell](/powershell/module/exchange/add-managementroleentry)<br>[Fetch role assignments](/powershell/module/exchange/get-rolegroup) | |
active-directory | Security Planning | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/roles/security-planning.md | Securing privileged access requires changes to: * Processes, administrative practices, and knowledge management * Technical components such as host defenses, account protections, and identity management -Secure your privileged access in a way that is managed and reported in the Microsoft services you care about. If you have on-premises administrator accounts, see the guidance for on-premises and hybrid privileged access in Active Directory at [Securing Privileged Access](/windows-server/identity/securing-privileged-access/securing-privileged-access). +Secure your privileged access in a way that is managed and reported in the Microsoft services you care about. If you have on-premises administrator accounts, see the guidance for on-premises and hybrid privileged access in Active Directory at [Securing Privileged Access](/security/privileged-access-workstations/overview). > [!NOTE] > The guidance in this article refers primarily to features of Microsoft Entra ID that are included in Microsoft Entra ID P1 and P2. Microsoft Entra ID P2 is included in the EMS E5 suite and Microsoft 365 E5 suite. This guidance assumes your organization already has Microsoft Entra ID P2 licenses purchased for your users. If you do not have these licenses, some of the guidance might not apply to your organization. Also, throughout this article, the term Global Administrator means the same thing as "company administrator" or "tenant administrator." The increase in "bring your own device" and work from home policies and the grow * Identify the users who have administrative roles and the services where they can manage. * Use Microsoft Entra PIM to find out which users in your organization have administrator access to Microsoft Entra ID.-* Beyond the roles defined in Microsoft Entra ID, Microsoft 365 comes with a set of administrator roles that you can assign to users in your organization. Each administrator role maps to common business functions, and gives people in your organization permissions to do specific tasks in the [Microsoft 365 admin center](https://admin.microsoft.com). Use the Microsoft 365 admin center to find out which users in your organization have administrator access to Microsoft 365, including via roles not managed in Microsoft Entra ID. For more information, see [About Microsoft 365 administrator roles](https://support.office.com/article/About-Office-365-admin-roles-da585eea-f576-4f55-a1e0-87090b6aaa9d) and [Security practices for Office 365](/office365/servicedescriptions/office-365-platform-service-description/office-365-securitycompliance-center). +* Beyond the roles defined in Microsoft Entra ID, Microsoft 365 comes with a set of administrator roles that you can assign to users in your organization. Each administrator role maps to common business functions, and gives people in your organization permissions to do specific tasks in the [Microsoft 365 admin center](https://admin.microsoft.com). Use the Microsoft 365 admin center to find out which users in your organization have administrator access to Microsoft 365, including via roles not managed in Microsoft Entra ID. For more information, see [About Microsoft 365 administrator roles](https://support.office.com/article/About-Office-365-admin-roles-da585eea-f576-4f55-a1e0-87090b6aaa9d) and [Security practices for Office 365](/office365/servicedescriptions/microsoft-365-service-descriptions/microsoft-365-tenantlevel-services-licensing-guidance/microsoft-365-security-compliance-licensing-guidance). * Do the inventory in services your organization relies on, such as Azure, Intune, or Dynamics 365. * Ensure that your accounts that are used for administration purposes: |
active-directory | Amazon Web Service Tutorial | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/saas-apps/amazon-web-service-tutorial.md | Follow these steps to enable Microsoft Entra SSO. | | | | | RoleSessionName | user.userprincipalname | `https://aws.amazon.com/SAML/Attributes` | | Role | user.assignedroles | `https://aws.amazon.com/SAML/Attributes` |- | SessionDuration | "provide a value between 900 seconds (15 minutes) to 43200 seconds (12 hours)" | `https://aws.amazon.com/SAML/Attributes` | + | SessionDuration | user.sessionduration | `https://aws.amazon.com/SAML/Attributes` | > [!NOTE] > AWS expects roles for users assigned to the application. Please set up these roles in Microsoft Entra ID so that users can be assigned the appropriate roles. To understand how to configure roles in Microsoft Entra ID, see [here](../develop/howto-add-app-roles-in-azure-ad-apps.md#app-roles-ui) |
active-directory | Configure Cmmc Level 2 Access Control | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/standards/configure-cmmc-level-2-access-control.md | The following table provides a list of practice statement and objectives, and Mi | AC.L2-3.1.9<br><br>**Practice statement:** Provide privacy and security notices consistent with applicable CUI rules.<br><br>**Objectives:**<br>Determine if:<br>[a.] privacy and security notices required by CUI-specified rules are identified, consistent, and associated with the specific CUI category; and<br>[b.] privacy and security notices are displayed. | With Microsoft Entra ID, you can deliver notification or banner messages for all apps that require and record acknowledgment before granting access. You can granularly target these terms of use policies to specific users (Member or Guest). You can also customize them per application via Conditional Access policies.<br><br>**Conditional Access** <br>[What is Conditional Access in Microsoft Entra ID?](../conditional-access/overview.md)<br><br>**Terms of use**<br>[Microsoft Entra terms of use](../conditional-access/terms-of-use.md)<br>[View report of who has accepted and declined](../conditional-access/terms-of-use.md) | | AC.L2-3.1.10<br><br>**Practice statement:** Use session lock with pattern-hiding displays to prevent access and viewing of data after a period of inactivity.<br><br>**Objectives:**<br>Determine if:<br>[a.] the period of inactivity after which the system initiates a session lock is defined;<br>[b.] access to the system and viewing of data is prevented by initiating a session lock after the defined period of inactivity; and<br>[c.] previously visible information is concealed via a pattern-hiding display after the defined period of inactivity. | Implement device lock by using a Conditional Access policy to restrict access to compliant or Microsoft Entra hybrid joined devices. Configure policy settings on the device to enforce device lock at the OS level with MDM solutions such as Intune. Microsoft Intune, Configuration Manager, or group policy objects can also be considered in hybrid deployments. For unmanaged devices, configure the Sign-In Frequency setting to force users to reauthenticate.<br>[Require device to be marked as compliant](../conditional-access/concept-conditional-access-grant.md)<br>[Grant controls in Conditional Access policy - Require Microsoft Entra hybrid joined device](../conditional-access/concept-conditional-access-grant.md)<br>[User sign-in frequency](../conditional-access/howto-conditional-access-session-lifetime.md)<br><br>Configure devices for maximum minutes of inactivity until the screen locks ([Android](/mem/intune/configuration/device-restrictions-android), [iOS](/mem/intune/configuration/device-restrictions-ios), [Windows 10](/mem/intune/configuration/device-restrictions-windows-10)).| | AC.L2-3.1.11<br><br>**Practice statement:** Terminate (automatically) a user session after a defined condition.<br><br>**Objectives:**<br>Determine if:<br>[a.] conditions requiring a user session to terminate are defined; and<br>[b.] a user session is automatically terminated after any of the defined conditions occur. | Enable Continuous Access Evaluation (CAE) for all supported applications. For application that don't support CAE, or for conditions not applicable to CAE, implement policies in Microsoft Defender for Cloud Apps to automatically terminate sessions when conditions occur. Additionally, configure Microsoft Entra ID Protection to evaluate user and sign-in Risk. Use Conditional Access with Identity protection to allow user to automatically remediate risk.<br>[Continuous access evaluation in Microsoft Entra ID](../conditional-access/concept-continuous-access-evaluation.md)<br>[Control cloud app usage by creating policies](/defender-cloud-apps/control-cloud-apps-with-policies)<br>[What is Microsoft Entra ID Protection?](../identity-protection/overview-identity-protection.md)-|AC.L2-3.1.12<br><br>**Practice statement:** Monitor and control remote access sessions.<br><br>**Objectives:**<br>Determine if:<br>[a.] remote access sessions are permitted;<br>[b.] the types of permitted remote access are identified;<br>[c.] remote access sessions are controlled; and<br>[d.] remote access sessions are monitored. | In todayΓÇÖs world, users access cloud-based applications almost exclusively remotely from unknown or untrusted networks. It's critical to securing this pattern of access to adopt zero trust principals. To meet these controls requirements in a modern cloud world we must verify each access request explicitly, implement least privilege and assume breach.<br><br>Configure named locations to delineate internal vs external networks. Configure Conditional Access app control to route access via Microsoft Defender for Cloud Apps. Configure Defender for Cloud Apps to control and monitor all sessions.<br>[Zero Trust Deployment Guide for Microsoft Entra ID](https://www.microsoft.com/security/blog/2020/04/30/zero-trust-deployment-guide-azure-active-directory/)<br>[Location condition in Microsoft Entra Conditional Access](../conditional-access/location-condition.md)<br>[Deploy Cloud App Security Conditional Access App Control for Microsoft Entra apps](/defender-cloud-apps/proxy-deployment-aad)<br>[What is Microsoft Defender for Cloud Apps?](/cloud-app-security/what-is-cloud-app-security)<br>[Monitor alerts raised in Microsoft Defender for Cloud Apps](/microsoft-365/security/defender/investigate-alerts) | +|AC.L2-3.1.12<br><br>**Practice statement:** Monitor and control remote access sessions.<br><br>**Objectives:**<br>Determine if:<br>[a.] remote access sessions are permitted;<br>[b.] the types of permitted remote access are identified;<br>[c.] remote access sessions are controlled; and<br>[d.] remote access sessions are monitored. | In todayΓÇÖs world, users access cloud-based applications almost exclusively remotely from unknown or untrusted networks. It's critical to securing this pattern of access to adopt zero trust principals. To meet these controls requirements in a modern cloud world we must verify each access request explicitly, implement least privilege and assume breach.<br><br>Configure named locations to delineate internal vs external networks. Configure Conditional Access app control to route access via Microsoft Defender for Cloud Apps. Configure Defender for Cloud Apps to control and monitor all sessions.<br>[Zero Trust Deployment Guide for Microsoft Entra ID](https://www.microsoft.com/security/blog/2020/04/30/zero-trust-deployment-guide-azure-active-directory/)<br>[Location condition in Microsoft Entra Conditional Access](../conditional-access/location-condition.md)<br>[Deploy Cloud App Security Conditional Access App Control for Microsoft Entra apps](/defender-cloud-apps/proxy-deployment-aad)<br>[What is Microsoft Defender for Cloud Apps?](/defender-cloud-apps/what-is-defender-for-cloud-apps)<br>[Monitor alerts raised in Microsoft Defender for Cloud Apps](/microsoft-365/security/defender/investigate-alerts) | | AC.L2-3.1.13<br><br>**Practice statement:** Employ cryptographic mechanisms to protect the confidentiality of remote access sessions.<br><br>**Objectives:**<br>Determine if:<br>[a.] cryptographic mechanisms to protect the confidentiality of remote access sessions are identified; and<br>[b.] cryptographic mechanisms to protect the confidentiality of remote access sessions are implemented. | All Microsoft Entra customer-facing web services are secured with the Transport Layer Security (TLS) protocol and are implemented using FIPS-validated cryptography.<br>[Microsoft Entra Data Security Considerations (microsoft.com)](https://azure.microsoft.com/resources/azure-active-directory-data-security-considerations/) |-| AC.L2-3.1.14<br><br>**Practice statement:** Route remote access via managed access control points.<br><br>**Objectives:**<br>Determine if:<br>[a.] managed access control points are identified and implemented; and<br>[b.] remote access is routed through managed network access control points. | Configure named locations to delineate internal vs external networks. Configure Conditional Access app control to route access via Microsoft Defender for Cloud Apps. Configure Defender for Cloud Apps to control and monitor all sessions. Secure devices used by privileged accounts as part of the privileged access story.<br>[Location condition in Microsoft Entra Conditional Access](../conditional-access/location-condition.md)<br>[Session controls in Conditional Access policy](../conditional-access/concept-conditional-access-session.md)<br>[Securing privileged access overview](/security/compass/overview) | +| AC.L2-3.1.14<br><br>**Practice statement:** Route remote access via managed access control points.<br><br>**Objectives:**<br>Determine if:<br>[a.] managed access control points are identified and implemented; and<br>[b.] remote access is routed through managed network access control points. | Configure named locations to delineate internal vs external networks. Configure Conditional Access app control to route access via Microsoft Defender for Cloud Apps. Configure Defender for Cloud Apps to control and monitor all sessions. Secure devices used by privileged accounts as part of the privileged access story.<br>[Location condition in Microsoft Entra Conditional Access](../conditional-access/location-condition.md)<br>[Session controls in Conditional Access policy](../conditional-access/concept-conditional-access-session.md)<br>[Securing privileged access overview](/security/privileged-access-workstations/overview) | | AC.L2-3.1.15<br><br>**Practice statement:** Authorize remote execution of privileged commands and remote access to security-relevant information.<br><br>**Objectives:**<br>Determine if:<br>[a.] privileged commands authorized for remote execution are identified;<br>[b.] security-relevant information authorized to be accessed remotely is identified;<br>[c.] the execution of the identified privileged commands via remote access is authorized; and<br>[d.] access to the identified security-relevant information via remote access is authorized. | Conditional Access is the Zero Trust control plane to target policies for access to your apps when combined with authentication context. You can apply different policies in those apps. Secure devices used by privileged accounts as part of the privileged access story. Configure Conditional Access policies to require the use of these secured devices by privileged users when performing privileged commands.<br>[Cloud apps, actions, and authentication context in Conditional Access policy](../conditional-access/concept-conditional-access-cloud-apps.md)<br>[Securing privileged access overview](/security/privileged-access-workstations/overview)<br>[Filter for devices as a condition in Conditional Access policy](../conditional-access/concept-condition-filters-for-devices.md) | | AC.L2-3.1.18<br><br>**Practice statement:** Control connection of mobile devices.<br><br>**Objectives:**<br>Determine if:<br>[a.] mobile devices that process, store, or transmit CUI are identified;<br>[b.] mobile device connections are authorized; and<br>[c.] mobile device connections are monitored and logged. | Configure device management policies via MDM (such as Microsoft Intune), Configuration Manager, or group policy objects (GPO) to enforce mobile device configuration and connection profile. Configure Conditional Access policies to enforce device compliance.<br><br>**Conditional Access**<br>[Require device to be marked as compliant](../conditional-access/concept-conditional-access-grant.md)<br>[Require Microsoft Entra hybrid joined device](../conditional-access/concept-conditional-access-grant.md)<br><br>**InTune**<br>[Device compliance policies in Microsoft Intune](/mem/intune/protect/device-compliance-get-started)<br>[What is app management in Microsoft Intune?](/mem/intune/apps/app-management) | | AC.L2-3.1.19<br><br>**Practice statement:** Encrypt CUI on mobile devices and mobile computing platforms.<br><br>**Objectives:**<br>Determine if:<br>[a.] mobile devices and mobile computing platforms that process, store, or transmit CUI are identified; and<br>[b.] encryption is employed to protect CUI on identified mobile devices and mobile computing platforms. | **Managed Device**<br>Configure Conditional Access policies to enforce compliant or Microsoft Entra hybrid joined device and to ensure managed devices are configured appropriately via device management solution to encrypt CUI.<br><br>**Unmanaged Device**<br>Configure Conditional Access policies to require app protection policies.<br>[Grant controls in Conditional Access policy - Require device to be marked as compliant](../conditional-access/concept-conditional-access-grant.md)<br>[Grant controls in Conditional Access policy - Require Microsoft Entra hybrid joined device](../conditional-access/concept-conditional-access-grant.md)<br>[Grant controls in Conditional Access policy - Require app protection policy](../conditional-access/concept-conditional-access-grant.md) | |
active-directory | Configure Cmmc Level 2 Additional Controls | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/standards/configure-cmmc-level-2-additional-controls.md | The following table provides a list of practice statement and objectives, and Mi | CMMC practice statement and objectives | Microsoft Entra guidance and recommendations | | - | - |-| SC.L2-3.13.3<br><br>**Practice statement:** Separate user functionality form system management functionality. <br><br>**Objectives:**<br>Determine if:<br>[a.] user functionality is identified;<br>[b.] system management functionality is identified; and<br>[c.] user functionality is separated from system management functionality. | Maintain separate user accounts in Microsoft Entra ID for everyday productivity use and administrative or system/privileged management. Privileged accounts should be cloud-only or managed accounts and not synchronized from on-premises to protect the cloud environment from on-premises compromise. System/privileged access should only be permitted from a security hardened privileged access workstation (PAW). Configure Conditional Access device filters to restrict access to administrative applications from PAWs that are enabled using Azure Virtual Desktops.<br>[Why are privileged access devices important](/security/compass/privileged-access-devices)<br>[Device Roles and Profiles](/security/compass/privileged-access-devices)<br>[Filter for devices as a condition in Conditional Access policy](../conditional-access/concept-condition-filters-for-devices.md)<br>[Azure Virtual Desktop](https://azure.microsoft.com/products/virtual-desktop/) | +| SC.L2-3.13.3<br><br>**Practice statement:** Separate user functionality form system management functionality. <br><br>**Objectives:**<br>Determine if:<br>[a.] user functionality is identified;<br>[b.] system management functionality is identified; and<br>[c.] user functionality is separated from system management functionality. | Maintain separate user accounts in Microsoft Entra ID for everyday productivity use and administrative or system/privileged management. Privileged accounts should be cloud-only or managed accounts and not synchronized from on-premises to protect the cloud environment from on-premises compromise. System/privileged access should only be permitted from a security hardened privileged access workstation (PAW). Configure Conditional Access device filters to restrict access to administrative applications from PAWs that are enabled using Azure Virtual Desktops.<br>[Why are privileged access devices important](/security/compass/privileged-access-devices)<br>[Device Roles and Profiles](/security/privileged-access-workstations/privileged-access-devices)<br>[Filter for devices as a condition in Conditional Access policy](../conditional-access/concept-condition-filters-for-devices.md)<br>[Azure Virtual Desktop](https://azure.microsoft.com/products/virtual-desktop/) | | SC.L2-3.13.4<br><br>**Practice statement:** Prevent unauthorized and unintended information transfer via shared system resources.<br><br>**Objectives:**<br>Determine if:<br>[a.] unauthorized and unintended information transfer via shared system resources is prevented. | Configure device management policies via MDM (such as Microsoft Intune), Configuration Manager, or group policy objects (GPO) to ensure devices are compliant with system hardening procedures. Include compliance with company policy regarding software patches to prevent attackers from exploiting flaws.<br><br>Configure Conditional Access policies to enforce device compliance.<br><br>**Conditional Access**<br>[Require device to be marked as compliant](../conditional-access/concept-conditional-access-grant.md)<br>[Require Microsoft Entra hybrid joined device](../conditional-access/concept-conditional-access-grant.md)<br><br>**InTune**<br>[Device compliance policies in Microsoft Intune](/mem/intune/protect/device-compliance-get-started) | | SC.L2-3.13.13<br><br>**Practice statement:** Control and monitor the use of mobile code.<br><br>**Objectives:**<br>Determine if:<br>[a.] use of mobile code is controlled; and<br>[b.] use of mobile code is monitored. | Configure device management policies via MDM (such as Microsoft Intune), Configuration Manager, or group policy objects (GPO) to disable the use of mobile code. Where use of mobile code is required monitor the use with endpoint security such as Microsoft Defender for Endpoint.<br><br>Configure Conditional Access policies to enforce device compliance.<br><br>**Conditional Access**<br>[Require device to be marked as compliant](../conditional-access/concept-conditional-access-grant.md)<br>[Require Microsoft Entra hybrid joined device](../conditional-access/concept-conditional-access-grant.md)<br><br>**InTune**<br>[Device compliance policies in Microsoft Intune](/mem/intune/protect/device-compliance-get-started)<br><br>**Defender for Endpoint**<br>[Microsoft Defender for Endpoint](/microsoft-365/security/defender-endpoint/microsoft-defender-endpoint?view=o365-worldwide&preserve-view=true) | |
active-directory | Fedramp Access Controls | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/standards/fedramp-access-controls.md | Each row in the following table provides prescriptive guidance to help you devel | **AC-6(7)**<p><p>**The organization:**<br>**(a.)** Reviews [*FedRAMP Assignment: at a minimum, annually*] the privileges assigned to [*FedRAMP Assignment: all users with privileges*] to validate the need for such privileges; and<br>**(b.)** Reassigns or removes privileges, if necessary, to correctly reflect organizational mission/business needs. | **Review and validate all users with privileged access every year. Ensure privileges are reassigned (or removed if necessary) to align with organizational mission and business requirements.**<p>Use Microsoft Entra entitlement management with access reviews for privileged users to verify if privileged access is required. <p>Access reviews<br><li>[What is Microsoft Entra entitlement management?](../governance/entitlement-management-overview.md)<br><li>[Create an access review of Microsoft Entra roles in Privileged Identity Management](../privileged-identity-management/pim-create-roles-and-resource-roles-review.md)<br><li>[Review access of an access package in Microsoft Entra entitlement management](../governance/entitlement-management-access-reviews-review-access.md) | | **AC-7 Unsuccessful Login Attempts**<p><p>**The organization:**<br>**(a.)** Enforces a limit of [*FedRAMP Assignment: not more than three (3)*] consecutive invalid logon attempts by a user during a [*FedRAMP Assignment: fifteen (15) minutes*]; and<br>**(b.)** Automatically [Selection: locks the account/node for a [*FedRAMP Assignment: minimum of three (3) hours or until unlocked by an administrator]; delays next logon prompt according to [Assignment: organization-defined delay algorithm*]] when the maximum number of unsuccessful attempts is exceeded. | **Enforce a limit of no more than three consecutive failed login attempts on customer-deployed resources within a 15-minute period. Lock the account for a minimum of three hours or until unlocked by an administrator.**<p>Enable custom smart lockout settings. Configure lockout threshold and lockout duration in seconds to implement these requirements. <p>Smart lockout<br><li>[Protect user accounts from attacks with Microsoft Entra smart lockout](../authentication/howto-password-smart-lockout.md)<br><li>[Manage Microsoft Entra smart lockout values](../authentication/howto-password-smart-lockout.md) | | **AC-8 System Use Notification**<p><p>**The information system:**<br>**(a.)** Displays to users [*Assignment: organization-defined system use notification message or banner (FedRAMP Assignment: see additional Requirements and Guidance)*] before granting access to the system that provides privacy and security notices consistent with applicable federal laws, Executive Orders, directives, policies, regulations, standards, and guidance and states that:<br>(1.) Users are accessing a U.S. Government information system;<br>(2.) Information system usage may be monitored, recorded, and subject to audit;<br>(3.) Unauthorized use of the information system is prohibited and subject to criminal and civil penalties; and<br>(4.) Use of the information system indicates consent to monitoring and recording;<p><p>**(b.)** Retains the notification message or banner on the screen until users acknowledge the usage conditions and take explicit actions to log on to or further access the information system; and<p><p>**(c.)** For publicly accessible systems:<br>(1.) Displays system use information [*Assignment: organization-defined conditions (FedRAMP Assignment: see additional Requirements and Guidance)*], before granting further access;<br>(2.) Displays references, if any, to monitoring, recording, or auditing that are consistent with privacy accommodations for such systems that generally prohibit those activities; and<br>(3.) Includes a description of the authorized uses of the system.<p><p>**AC-8 Additional FedRAMP Requirements and Guidance:**<br>**Requirement:** The service provider shall determine elements of the cloud environment that require the System Use Notification control. The elements of the cloud environment that require System Use Notification are approved and accepted by the JAB/AO.<br>**Requirement:** The service provider shall determine how System Use Notification is going to be verified and provide appropriate periodicity of the check. The System Use Notification verification and periodicity are approved and accepted by the JAB/AO.<br>**Guidance:** If performed as part of a Configuration Baseline check, then the % of items requiring setting that are checked and that pass (or fail) check can be provided.<br>**Requirement:** If not performed as part of a Configuration Baseline check, then there must be documented agreement on how to provide results of verification and the necessary periodicity of the verification by the service provider. The documented agreement on how to provide verification of the results are approved and accepted by the JAB/AO. | **Display and require user acknowledgment of privacy and security notices before granting access to information systems.**<p>With Microsoft Entra ID, you can deliver notification or banner messages for all apps that require and record acknowledgment before granting access. You can granularly target these terms of use policies to specific users (Member or Guest). You can also customize them per application via Conditional Access policies.<p>Terms of use<br><li>[Microsoft Entra terms of use](../conditional-access/terms-of-use.md)<br><li>[View report of who has accepted and declined](../conditional-access/terms-of-use.md) |-| **AC-10 Concurrent Session Control**<br>The information system limits the number of concurrent sessions for each [*Assignment: organization-defined account and/or account type*] to [*FedRAMP Assignment: three (3) sessions for privileged access and two (2) sessions for non-privileged access*].|**Limit concurrent sessions to three sessions for privileged access and two for nonprivileged access.** <p>Currently, users connect from multiple devices, sometimes simultaneously. Limiting concurrent sessions leads to a degraded user experience and provides limited security value. A better approach to address the intent behind this control is to adopt a zero-trust security posture. Conditions are explicitly validated before a session is created and continually validated throughout the life of a session. <p>In addition, use the following compensating controls. <p>Use Conditional Access policies to restrict access to compliant devices. Configure policy settings on the device to enforce user sign-in restrictions at the OS level with MDM solutions such as Intune. Endpoint Manager or group policy objects can also be considered in hybrid deployments.<p> Use Privileged Identity Management to further restrict and control privileged accounts. <p> Configure smart account lockout for invalid sign-in attempts.<p>**Implementation guidance** <p>Zero trust<br><li> [Securing identity with Zero Trust](/security/zero-trust/identity)<br><li>[Continuous access evaluation in Microsoft Entra ID](../conditional-access/concept-continuous-access-evaluation.md)<p>Conditional Access<br><li>[What is Conditional Access in Microsoft Entra ID?](../conditional-access/overview.md)<br><li>[Require device to be marked as compliant](../conditional-access/concept-conditional-access-grant.md)<br><li>[User sign-in frequency](../conditional-access/howto-conditional-access-session-lifetime.md)<p>Device policies<br><li>[Other smart card Group Policy settings and registry keys](/windows/security/identity-protection/smart-cards/smart-card-group-policy-and-registry-settings)<br><li>[Microsoft Endpoint Manager overview](/mem/endpoint-manager-overview)<p>Resources<br><li>[What is Microsoft Entra Privileged Identity Management?](../privileged-identity-management/pim-configure.md)<br><li>[Protect user accounts from attacks with Microsoft Entra smart lockout](../authentication/howto-password-smart-lockout.md)<p>See AC-12 for more session reevaluation and risk mitigation guidance. | +| **AC-10 Concurrent Session Control**<br>The information system limits the number of concurrent sessions for each [*Assignment: organization-defined account and/or account type*] to [*FedRAMP Assignment: three (3) sessions for privileged access and two (2) sessions for non-privileged access*].|**Limit concurrent sessions to three sessions for privileged access and two for nonprivileged access.** <p>Currently, users connect from multiple devices, sometimes simultaneously. Limiting concurrent sessions leads to a degraded user experience and provides limited security value. A better approach to address the intent behind this control is to adopt a zero-trust security posture. Conditions are explicitly validated before a session is created and continually validated throughout the life of a session. <p>In addition, use the following compensating controls. <p>Use Conditional Access policies to restrict access to compliant devices. Configure policy settings on the device to enforce user sign-in restrictions at the OS level with MDM solutions such as Intune. Endpoint Manager or group policy objects can also be considered in hybrid deployments.<p> Use Privileged Identity Management to further restrict and control privileged accounts. <p> Configure smart account lockout for invalid sign-in attempts.<p>**Implementation guidance** <p>Zero trust<br><li> [Securing identity with Zero Trust](/security/zero-trust/deploy/identity)<br><li>[Continuous access evaluation in Microsoft Entra ID](../conditional-access/concept-continuous-access-evaluation.md)<p>Conditional Access<br><li>[What is Conditional Access in Microsoft Entra ID?](../conditional-access/overview.md)<br><li>[Require device to be marked as compliant](../conditional-access/concept-conditional-access-grant.md)<br><li>[User sign-in frequency](../conditional-access/howto-conditional-access-session-lifetime.md)<p>Device policies<br><li>[Other smart card Group Policy settings and registry keys](/windows/security/identity-protection/smart-cards/smart-card-group-policy-and-registry-settings)<br><li>[Microsoft Endpoint Manager overview](/mem/endpoint-manager-overview)<p>Resources<br><li>[What is Microsoft Entra Privileged Identity Management?](../privileged-identity-management/pim-configure.md)<br><li>[Protect user accounts from attacks with Microsoft Entra smart lockout](../authentication/howto-password-smart-lockout.md)<p>See AC-12 for more session reevaluation and risk mitigation guidance. | | **AC-11 Session Lock**<br>**The information system:**<br>**(a)** Prevents further access to the system by initiating a session lock after [*FedRAMP Assignment: fifteen (15) minutes*] of inactivity or upon receiving a request from a user; and<br>**(b)** Retains the session lock until the user reestablishes access using established identification and authentication procedures.<p><p>**AC-11(1)**<br>The information system conceals, via the session lock, information previously visible on the display with a publicly viewable image. | **Implement a session lock after a 15-minute period of inactivity or upon receiving a request from a user. Retain the session lock until the user reauthenticates. Conceal previously visible information when a session lock is initiated.**<p> Implement device lock by using a Conditional Access policy to restrict access to compliant devices. Configure policy settings on the device to enforce device lock at the OS level with MDM solutions such as Intune. Endpoint Manager or group policy objects can also be considered in hybrid deployments. For unmanaged devices, configure the Sign-In Frequency setting to force users to reauthenticate.<p>Conditional Access<br><li>[Require device to be marked as compliant](../conditional-access/concept-conditional-access-grant.md)<br><li>[User sign-in frequency](../conditional-access/howto-conditional-access-session-lifetime.md)<p>MDM policy<br><li>Configure devices for maximum minutes of inactivity until the screen locks ([Android](/mem/intune/configuration/device-restrictions-android), [iOS](/mem/intune/configuration/device-restrictions-ios), [Windows 10](/mem/intune/configuration/device-restrictions-windows-10)). | | **AC-12 Session Termination**<br>The information system automatically terminates a user session after [*Assignment: organization-defined conditions or trigger events requiring session disconnect*].| **Automatically terminate user sessions when organizational defined conditions or trigger events occur.**<p>Implement automatic user session reevaluation with Microsoft Entra features such as risk-based Conditional Access and continuous access evaluation. You can implement inactivity conditions at a device level as described in AC-11.<p>Resources<br><li>[Sign-in risk-based Conditional Access](../conditional-access/howto-conditional-access-policy-risk.md)<br><li>[User risk-based Conditional Access](../conditional-access/howto-conditional-access-policy-risk-user.md)<br><li>[Continuous access evaluation](../conditional-access/concept-continuous-access-evaluation.md) | **AC-12(1)**<br>**The information system:**<br>**(a.)** Provides a logout capability for user-initiated communications sessions whenever authentication is used to gain access to [Assignment: organization-defined information resources]; and<br>**(b.)** Displays an explicit logout message to users indicating the reliable termination of authenticated communications sessions.<p><p>**AC-8 Additional FedRAMP Requirements and Guidance:**<br>**Guidance:** Testing for logout functionality (OTG-SESS-006) [Testing for logout functionality](https://owasp.org/www-project-web-security-testing-guide/latest/4-Web_Application_Security_Testing/06-Session_Management_Testing/06-Testing_for_Logout_Functionality) | **Provide a logout capability for all sessions and display an explicit logout message.** <p>All Microsoft Entra ID surfaced web interfaces provide a logout capability for user-initiated communications sessions. When SAML applications are integrated with Microsoft Entra ID, implement single sign-out. <p>Logout capability<br><li>When the user selects [Sign-out everywhere](https://aka.ms/mysignins), all current issued tokens are revoked. <p>Display message<br>Microsoft Entra ID automatically displays a message after user-initiated logout.<br><p>![Screenshot that shows an access control message.](medi) | |
active-directory | Hipaa Audit Controls | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/standards/hipaa-audit-controls.md | The following content provides the safeguard controls guidance from HIPAA. Find | Recommendation | Action | | - | - |-| Scan environment for ePHI data | [Microsoft Purview](../../purview/overview.md) can be enabled in audit mode to scan what ePHI is sitting in the data estate, and the resources that are being used to store that data. This information helps in establishing data classification and labeling the sensitivity of the data.</br>In addition, using [Content Explorer](/microsoft-365/compliance/data-classification-content-explorer) provides visibility into where the sensitive data is located. This information helps start the labeling journey from manually applying labeling or labeling recommendations on the client-side to service-side autolabeling. | +| Scan environment for ePHI data | [Microsoft Purview](/purview/governance-solutions-overview) can be enabled in audit mode to scan what ePHI is sitting in the data estate, and the resources that are being used to store that data. This information helps in establishing data classification and labeling the sensitivity of the data.</br>In addition, using [Content Explorer](/purview/data-classification-content-explorer) provides visibility into where the sensitive data is located. This information helps start the labeling journey from manually applying labeling or labeling recommendations on the client-side to service-side autolabeling. | | Enable Priva to safeguard Microsoft 365 data | [Microsoft Priva](/privacy/priva/priva-overview) evaluate ePHI data stored in Microsoft 365, scanning, and evaluating for sensitive information. | |Enable Azure Security benchmark |[Microsoft cloud security benchmark](/security/benchmark/azure/introduction) provides control for data protection across Azure services and provides a baseline for implementation for services that store ePHI. Audit mode provides those recommendations and remediation steps to secure the environment. | | Enable Defender Vulnerability Management | [Microsoft Defender Vulnerability management](/azure/defender-for-cloud/remediate-vulnerability-findings-vm) is a built-in module in **Microsoft Defender for Endpoint**. The module helps you identify and discover vulnerabilities and misconfigurations in real-time. The module also helps you prioritize presenting the findings in a dashboard, and reports across devices, VMs and databases. | |
active-directory | Memo 22 09 Multi Factor Authentication | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/standards/memo-22-09-multi-factor-authentication.md | Some agencies are modernizing their authentication credentials. There are multip * See, [Overview of Microsoft Entra certificate-based authentication](../authentication/concept-certificate-based-authentication.md) * **Windows Hello for Business** has phishing-resistant multifactor authentication * See, [Windows Hello for Business Deployment Overview](/windows/security/identity-protection/hello-for-business/hello-deployment-guide)- * See, [Windows Hello for Business](/windows/security/identity-protection/hello-for-business/hello-overview) + * See, [Windows Hello for Business](/windows/security/identity-protection/hello-for-business/) ### Protection from external phishing |
active-directory | Nist About Authenticator Assurance Levels | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/standards/nist-about-authenticator-assurance-levels.md | The standard includes AAL requirements for the following categories: In general, AAL1 isn't recommended because it accepts password-only solutions, the most easily compromised authentication. For more information, see the blog post, [Your Pa$$word doesn't matter](https://techcommunity.microsoft.com/t5/azure-active-directory-identity/your-pa-word-doesn-t-matter/ba-p/731984). -While NIST doesn't require verifier impersonation (credential phishing) resistance until AAL3, we advise you to address this threat at all levels. You can select authenticators that provide verifier impersonation resistance, such as requiring devices are joined to Microsoft Entra ID or hybrid Microsoft Entra ID. If you're using Office 365, you can use Office 365 Advanced Threat Protection, and its [anti-phishing policies](/microsoft-365/security/office-365-security/set-up-anti-phishing-policies). +While NIST doesn't require verifier impersonation (credential phishing) resistance until AAL3, we advise you to address this threat at all levels. You can select authenticators that provide verifier impersonation resistance, such as requiring devices are joined to Microsoft Entra ID or hybrid Microsoft Entra ID. If you're using Office 365, you can use Office 365 Advanced Threat Protection, and its [anti-phishing policies](/microsoft-365/security/office-365-security/anti-phishing-policies-about). As you evaluate the needed NIST AAL for your organization, consider whether your entire organization must meet NIST standards. If there are specific user groups and resources that can be segregated, you can apply NIST AAL configurations to those user groups and resources. |
active-directory | Nist Authenticator Assurance Level 3 | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/standards/nist-authenticator-assurance-level-3.md | Authenticators are required to be: Microsoft Entra joined and Microsoft Entra hybrid joined devices meet this requirement when: -* You run [Windows in a FIPS-140 approved mode](/windows/security/threat-protection/fips-140-validation) +* You run [Windows in a FIPS-140 approved mode](/windows/security/security-foundations/certification/fips-140-validation) * On a machine with a TPM that's FIPS 140 Level 1 Overall, or higher, with FIPS 140 Level 3 Physical Security |
active-directory | Pci Requirement 8 | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/standards/pci-requirement-8.md | For more information about Microsoft Entra authentication methods that meet PCI |**8.3.1** All user access to system components for users and administrators is authenticated via at least one of the following authentication factors: </br> Something you know, such as a password or passphrase. </br> Something you have, such as a token device or smart card. </br> Something you are, such as a biometric element.|[Microsoft Entra ID requires passwordless methods to meet the PCI requirements](https://microsoft.sharepoint-df.com/:w:/t/MicrosoftTechnicalContributorProgram-PCIDSSDocumentation/ETlhHVraW_NPsMGM-mFZlfgB4OPry8BxGizhQ4qItfGCFw?e=glcZ8y) </br> See holistic passwordless deployment. [Plan a passwordless authentication deployment in Microsoft Entra ID](../authentication/howto-authentication-passwordless-deployment.md)| |**8.3.2** Strong cryptography is used to render all authentication factors unreadable during transmission and storage on all system components.|Cryptography used by Microsoft Entra ID is compliant with [PCI definition of Strong Cryptography](https://www.pcisecuritystandards.org/glossary/#glossary-s). [Microsoft Entra Data protection considerations](../fundamentals/data-protection-considerations.md)| |**8.3.3** User identity is verified before modifying any authentication factor.|Microsoft Entra ID requires users to authenticate to update their authentication methods using self-service, such as mysecurityinfo portal and the self-service password reset (SSPR) portal. [Set up security info from a sign-in page](https://support.microsoft.com/en-us/topic/28180870-c256-4ebf-8bd7-5335571bf9a8) </br> [Common Conditional Access policy: Securing security info registration](../conditional-access/howto-conditional-access-policy-registration.md) </br> [Microsoft Entra self-service password reset](../authentication/concept-sspr-howitworks.md) </br> Administrators with privileged roles can modify authentication factors: Global, Password, User, Authentication, and Privileged Authentication. [Least privileged roles by task in Microsoft Entra ID](../roles/delegate-by-task.md). Microsoft recommends you enable JIT access and governance, for privileged access using [Microsoft Entra Privileged Identity Management](../privileged-identity-management/pim-configure.md)|-|**8.3.4** Invalid authentication attempts are limited by: </br> Locking out the user ID after not more than 10 attempts. </br> Setting the lockout duration to a minimum of 30 minutes or until the userΓÇÖs identity is confirmed.|Deploy Windows Hello for Business for Windows devices that support hardware Trusted Platform Modules (TPM) 2.0 or higher. </br> For Windows Hello for Business, lockout relates to the device. The gesture, PIN, or biometric, unlocks access to the local TPM. Administrators configure the lockout behavior with GPO or Intune policies. [TPM Group Policy settings](/windows/security/information-protection/tpm/trusted-platform-module-services-group-policy-settings) </br> [Manage Windows Hello for Business on devices at the time devices enroll with Intune](/mem/intune/protect/windows-hello) </br> [TPM fundamentals](/windows/security/information-protection/tpm/tpm-fundamentals) </br> Windows Hello for Business works for on-premises authentication to Active Directory and cloud resources on Microsoft Entra ID. </br> For FIDO2 security keys, brute-force protection is related to the key. The gesture, PIN or biometric, unlocks access to the local key storage. Administrators configure Microsoft Entra ID to allow registration of FIDO2 security keys from manufacturers that align to PCI requirements. [Enable passwordless security key sign-in](../authentication/howto-authentication-passwordless-security-key.md) </br></br> **Microsoft Authenticator App** </br> To mitigate brute force attacks using Microsoft Authenticator app passwordless sign in, enable number matching and more context. </br> Microsoft Entra ID generates a random number in the authentication flow. The user types it in the authenticator app. The mobile app authentication prompt shows the location, the request IP address, and the request application. [How to use number matching in MFA notifications](../authentication/how-to-mfa-number-match.md) </br> [How to use additional context in Microsoft Authenticator notifications](../authentication/how-to-mfa-additional-context.md)| +|**8.3.4** Invalid authentication attempts are limited by: </br> Locking out the user ID after not more than 10 attempts. </br> Setting the lockout duration to a minimum of 30 minutes or until the userΓÇÖs identity is confirmed.|Deploy Windows Hello for Business for Windows devices that support hardware Trusted Platform Modules (TPM) 2.0 or higher. </br> For Windows Hello for Business, lockout relates to the device. The gesture, PIN, or biometric, unlocks access to the local TPM. Administrators configure the lockout behavior with GPO or Intune policies. [TPM Group Policy settings](/windows/security/hardware-security/tpm/trusted-platform-module-services-group-policy-settings) </br> [Manage Windows Hello for Business on devices at the time devices enroll with Intune](/mem/intune/protect/windows-hello) </br> [TPM fundamentals](/windows/security/hardware-security/tpm/tpm-fundamentals) </br> Windows Hello for Business works for on-premises authentication to Active Directory and cloud resources on Microsoft Entra ID. </br> For FIDO2 security keys, brute-force protection is related to the key. The gesture, PIN or biometric, unlocks access to the local key storage. Administrators configure Microsoft Entra ID to allow registration of FIDO2 security keys from manufacturers that align to PCI requirements. [Enable passwordless security key sign-in](../authentication/howto-authentication-passwordless-security-key.md) </br></br> **Microsoft Authenticator App** </br> To mitigate brute force attacks using Microsoft Authenticator app passwordless sign in, enable number matching and more context. </br> Microsoft Entra ID generates a random number in the authentication flow. The user types it in the authenticator app. The mobile app authentication prompt shows the location, the request IP address, and the request application. [How to use number matching in MFA notifications](../authentication/how-to-mfa-number-match.md) </br> [How to use additional context in Microsoft Authenticator notifications](../authentication/how-to-mfa-additional-context.md)| |**8.3.5** If passwords/passphrases are used as authentication factors to meet Requirement 8.3.1, they're set and reset for each user as follows: </br> Set to a unique value for first-time use and upon reset. </br> Forced to be changed immediately after the first use.|Not applicable to Microsoft Entra ID.| |**8.3.6** If passwords/passphrases are used as authentication factors to meet Requirement 8.3.1, they meet the following minimum level of complexity: </br> A minimum length of 12 characters (or IF the system doesn't support 12 characters, a minimum length of eight characters). </br> Contain both numeric and alphabetic characters.|Not applicable to Microsoft Entra ID.| |**8.3.7** Individuals aren't allowed to submit a new password/passphrase that is the same as any of the last four passwords/passphrases used.|Not applicable to Microsoft Entra ID.| |
active-directory | Admin Api | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/verifiable-credentials/admin-api.md | The API is protected through Microsoft Entra ID and uses OAuth2 bearer tokens. T ### User bearer tokens -The app registration needs to have the API Permission for `Verifiable Credentials Service Admin` and then when acquiring the access token the app should use scope `6a8b4b39-c021-437c-b060-5a14a3fd65f3/full_access`. The access token must be for a user with the [global administrator](../../active-directory/roles/permissions-reference.md#global-administrator) or the [authentication policy administrator](../../active-directory/roles/permissions-reference.md#authentication-policy-administrator) role. A user with role [global reader](../../active-directory/roles/permissions-reference.md#global-reader) can perform read-only API calls. +The app registration needs to have the API Permission for `Verifiable Credentials Service Admin` and then when acquiring the access token the app should use scope `6a8b4b39-c021-437c-b060-5a14a3fd65f3/full_access`. The access token must be for a user with the [global administrator](../roles/permissions-reference.md#global-administrator) or the [authentication policy administrator](../roles/permissions-reference.md#authentication-policy-administrator) role. A user with role [global reader](../roles/permissions-reference.md#global-reader) can perform read-only API calls. ### Application bearer tokens The `Verifiable Credentials Service Admin` service supports the following applic | VerifiableCredential.Credential.Revoke | Permission to [revoke a previously issued credential](how-to-issuer-revoke.md) | | VerifiableCredential.Network.Read | Permission to read entries from the [Verified ID Network](vc-network-api.md) | -The app registration needs to have the API Permission for `Verifiable Credentials Service Admin` and permissions required from the above table. When acquiring the access token, via the [client credentials flow](../../active-directory/develop/v2-oauth2-client-creds-grant-flow.md), the app should use scope `6a8b4b39-c021-437c-b060-5a14a3fd65f3/.default`. +The app registration needs to have the API Permission for `Verifiable Credentials Service Admin` and permissions required from the above table. When acquiring the access token, via the [client credentials flow](../develop/v2-oauth2-client-creds-grant-flow.md), the app should use scope `6a8b4b39-c021-437c-b060-5a14a3fd65f3/.default`. ## Onboarding |
active-directory | Get Started Request Api | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/verifiable-credentials/get-started-request-api.md | To get an access token, your app must be registered with the Microsoft identity ### Get an access token -Use the [OAuth 2.0 client credentials grant flow](../../active-directory/develop/v2-oauth2-client-creds-grant-flow.md) to acquire the access token by using the Microsoft identity platform. Use a trusted library for this purpose. In this tutorial, we use the Microsoft Authentication Library [MSAL](../../active-directory/develop/msal-overview.md). MSAL simplifies adding authentication and authorization to an app that can call a secure web API. +Use the [OAuth 2.0 client credentials grant flow](../develop/v2-oauth2-client-creds-grant-flow.md) to acquire the access token by using the Microsoft identity platform. Use a trusted library for this purpose. In this tutorial, we use the Microsoft Authentication Library [MSAL](../develop/msal-overview.md). MSAL simplifies adding authentication and authorization to an app that can call a secure web API. # [HTTP](#tab/http) |
active-directory | Verifiable Credentials Configure Tenant Quick | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/verifiable-credentials/verifiable-credentials-configure-tenant-quick.md | Specifically, you learn how to: ## Prerequisites -- Ensure that you have the [global administrator](../../active-directory/roles/permissions-reference.md#global-administrator) or the [authentication policy administrator](../../active-directory/roles/permissions-reference.md#authentication-policy-administrator) permission for the directory you want to configure. If you're not the global administrator, you need the [application administrator](../../active-directory/roles/permissions-reference.md#application-administrator) permission to complete the app registration including granting admin consent.+- Ensure that you have the [global administrator](../roles/permissions-reference.md#global-administrator) or the [authentication policy administrator](../roles/permissions-reference.md#authentication-policy-administrator) permission for the directory you want to configure. If you're not the global administrator, you need the [application administrator](../roles/permissions-reference.md#application-administrator) permission to complete the app registration including granting admin consent. - Ensure that you have a custom domain registered for the Microsoft Entra tenant. If you don't have one registered, the setup defaults to the manual setup experience. ## Set up Verified ID |
active-directory | Verifiable Credentials Configure Tenant | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/verifiable-credentials/verifiable-credentials-configure-tenant.md | The following diagram illustrates the Verified ID architecture and the component ## Prerequisites - You need an Azure tenant with an active subscription. If you don't have an Azure subscription, [create one for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F).-- Ensure that you have the [global administrator](../../active-directory/roles/permissions-reference.md#global-administrator) or the [authentication policy administrator](../../active-directory/roles/permissions-reference.md#authentication-policy-administrator) permission for the directory you want to configure. If you're not the global administrator, you need the [application administrator](../../active-directory/roles/permissions-reference.md#application-administrator) permission to complete the app registration including granting admin consent.+- Ensure that you have the [global administrator](../roles/permissions-reference.md#global-administrator) or the [authentication policy administrator](../roles/permissions-reference.md#authentication-policy-administrator) permission for the directory you want to configure. If you're not the global administrator, you need the [application administrator](../roles/permissions-reference.md#application-administrator) permission to complete the app registration including granting admin consent. - Ensure that you have the [contributor](/azure/role-based-access-control/built-in-roles#contributor) role for the Azure subscription or the resource group where you are deploying Azure Key Vault. ## Create a key vault |
active-directory | Verifiable Credentials Faq | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/verifiable-credentials/verifiable-credentials-faq.md | The tutorials for deploying and running the [samples](verifiable-credentials-con - Dotnet - [Publish to App Service](/azure/app-service/quickstart-dotnetcore?tabs=net60&pivots=development-environment-vs#2-publish-your-web-app) - Node - [Deploy to App Service](/azure/app-service/quickstart-nodejs?tabs=linux&pivots=development-environment-vscode#deploy-to-azure)-- Java - [Deploy to App Service](../../app-service/quickstart-java.md?tabs=javase&pivots=platform-linux-development-environment-maven#4deploy-the-app). You need to add the maven plugin for Azure App Service to the sample.+- Java - [Deploy to App Service](/azure/app-service/quickstart-java?tabs=javase&pivots=platform-linux-development-environment-maven#4deploy-the-app). You need to add the maven plugin for Azure App Service to the sample. - Python - [Deploy using Visual Studio Code](/azure/app-service/quickstart-python?tabs=flask%2Cwindows%2Cazure-cli%2Cvscode-deploy%2Cdeploy-instructions-azportal%2Cterminal-bash%2Cdeploy-instructions-zip-azcli#3deploy-your-application-code-to-azure) Regardless of which language of the sample you are using, they will pickup the Azure AppService hostname `https://something.azurewebsites.net` and use it as the public endpoint. You don't need to configure something extra to make it work. If you make changes to the code or configuration, you need to redeploy the sample to Azure AppServices. Troubleshooting/debugging will not be as easy as running the sample on your local machine, where traces to the console window shows you errors, but you can achieve almost the same by using the [Log Stream](/azure/app-service/troubleshoot-diagnostic-logs#stream-logs). |
active-directory | Whats New | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/verifiable-credentials/whats-new.md | Instructions for setting up place of work verification on LinkedIn available [he ## February 2023 -- *Public preview* - Entitlement Management customers can now create access packages that leverage Microsoft Entra Verified ID [learn more](../../active-directory/governance/entitlement-management-verified-id-settings.md)+- *Public preview* - Entitlement Management customers can now create access packages that leverage Microsoft Entra Verified ID [learn more](../governance/entitlement-management-verified-id-settings.md) - The Request Service API can now do revocation check for verifiable credentials presented that was issued with [StatusList2021](https://w3c.github.io/vc-status-list-2021/) or the [RevocationList2020](https://w3c-ccg.github.io/vc-status-rl-2020/) status list types. Instructions for setting up place of work verification on LinkedIn available [he ## November 2022 -- Microsoft Entra Verified ID now reports events in the [audit log](../../active-directory/reports-monitoring/concept-audit-logs.md). Only management changes made via the Admin API are currently logged. Issuance or presentations of verifiable credentials aren't reported in the audit log. The log entries have a service name of `Verified ID` and the activity will be `Create authority`, `Update contract`, etc. +- Microsoft Entra Verified ID now reports events in the [audit log](../reports-monitoring/concept-audit-logs.md). Only management changes made via the Admin API are currently logged. Issuance or presentations of verifiable credentials aren't reported in the audit log. The log entries have a service name of `Verified ID` and the activity will be `Create authority`, `Update contract`, etc. ## September 2022 |
active-directory | Workload Identity Federation Create Trust | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/active-directory/workload-identities/workload-identity-federation-create-trust.md | az rest -m DELETE -u 'https://graph.microsoft.com/applications/f6475511-fd81-49 - To learn how to use workload identity federation for GitHub Actions, see [Configure a GitHub Actions workflow to get an access token](/azure/developer/github/connect-from-azure). - Read the [GitHub Actions documentation](https://docs.github.com/actions/deployment/security-hardening-your-deployments/configuring-openid-connect-in-azure) to learn more about configuring your GitHub Actions workflow to get an access token from Microsoft identity provider and access Azure resources. - For more information, read about how Microsoft Entra ID uses the [OAuth 2.0 client credentials grant](../develop/v2-oauth2-client-creds-grant-flow.md#third-case-access-token-request-with-a-federated-credential) and a client assertion issued by another IdP to get a token.-- For information about the required format of JWTs created by external identity providers, read about the [assertion format](/azure/active-directory/develop/active-directory-certificate-credentials#assertion-format).+- For information about the required format of JWTs created by external identity providers, read about the [assertion format](../develop/certificate-credentials.md#assertion-format). |
advisor | Advisor Get Started | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/advisor/advisor-get-started.md | |
ai-services | Use Your Data | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/openai/concepts/use-your-data.md | There is an [upload limit](../quotas-limits.md), and there are some caveats abou There are three different sources of data that you can use with Azure OpenAI on your data. * Blobs in an Azure storage container that you provide-* Local files uploaded using the Azure OpenAI Studio -* URLs/web addresses. +* Local files uploaded using the Azure OpenAI Studio Once data is ingested, an [Azure Cognitive Search](/azure/search/search-what-is-azure-search) index in your search resource gets created to integrate the information with Azure OpenAI models. Once data is ingested, an [Azure Cognitive Search](/azure/search/search-what-is- Using the Azure OpenAI Studio, you can upload files from your machine. The service then stores the files to an Azure storage container and performs ingestion from the container. -**Data ingestion from URLs** --A crawling component first crawls the provided URL and stores its contents to an Azure Storage Container. The service then performs ingestion from the container. - ### Troubleshooting failed ingestion jobs To troubleshoot a failed job, always look out for errors or warnings specified either in the API response or Azure OpenAI studio. Here are some of the common errors and warnings: After you approve the request in your search service, you can start using the [c ### Storage accounts -Storage accounts in virtual networks, firewalls, and private endpoints are currently not supported by Azure OpenAI on your data. +Storage accounts in virtual networks, firewalls, and private endpoints are supported by Azure OpenAI on your data. To use a storage account in a private network: ++1. Ensure you have the system assigned managed identity principal enabled for your Azure OpenAI and Azure Cognitive Search resources. + 1. Using the Azure portal, navigate to your resource, and select **Identity** from the navigation menu on the left side of the screen. + 1. Set **Status** to **On**. + 1. Perform these steps for both of your Azure OpenAI and Azure Cognitive Search resources. ++ :::image type="content" source="../media/use-your-data/managed-identity.png" alt-text="A screenshot showing managed identity settings in the Azure portal." lightbox="../media/use-your-data/managed-identity.png"::: ++1. Navigate back to your storage account. Select **Access Control (IAM)** for your resource. Select **Add**, then **Add role assignment**. In the window that appears, add the **Storage Data Contributor** role to the storage resource for your Azure OpenAI and search resource's managed identity. + 1. Assign access to **Managed Identity**. + 1. If you have multiple search resources, Perform this step for each search resource. ++ :::image type="content" source="../media/use-your-data/add-role-assignment.png" alt-text="A screenshot showing the role assignment option in the Azure portal." lightbox="../media/use-your-data/add-role-assignment.png"::: ++1. If your storage account hasn't already been network restricted, go to networking tab and select **Enabled from selected virtual networks and IP addresses**. ++ :::image type="content" source="../media/use-your-data/enable-virtual-network.png" alt-text="A screenshot showing the option for enabling virtual networks in the Azure portal." lightbox="../media/use-your-data/enable-virtual-network.png"::: ## Azure Role-based access controls (Azure RBAC) |
ai-services | Use Your Data Quickstart | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/openai/use-your-data-quickstart.md | zone_pivot_groups: openai-use-your-data ::: zone-end ++[Reference](https://platform.openai.com/docs/api-reference?lang=python) | [Source code](https://github.com/openai/openai-python) | [Package (pypi)](https://pypi.org/project/openai/) | [Samples](https://github.com/openai/openai-cookbook/) ++The links above reference the OpenAI API for Python. There is no Azure-specific OpenAI Python SDK. [Learn how to switch between the OpenAI services and Azure OpenAI services](/azure/ai-services/openai/how-to/switching-endpoints). ++++[Reference](https://pkg.go.dev/github.com/Azure/azure-sdk-for-go) | [Source code](https://github.com/Azure/azure-sdk-for-go) | [Package (Go)](https://pkg.go.dev/github.com/azure/azure-dev) | [Samples](https://github.com/azure-samples/azure-sdk-for-go-samples) ++ In this quickstart you can use your own data with Azure OpenAI models. Using Azure OpenAI's models on your data can provide you with a powerful conversational AI platform that enables faster and more accurate communication. In this quickstart you can use your own data with Azure OpenAI models. Using Azu - Your chat model can use version `gpt-35-turbo (0301)`, `gpt-35-turbo-16k`, `gpt-4`, and `gpt-4-32k`. You can view or change your model version in [Azure OpenAI Studio](./how-to/working-with-models.md#model-updates). -- Be sure that you are assigned at least the [Cognitive Services Contributor](./how-to/role-based-access-control.md#cognitive-services-contributor) role for the Azure OpenAI resource. +- Be sure that you are assigned at least the [Cognitive Services Contributor](./how-to/role-based-access-control.md#cognitive-services-contributor) role for the Azure OpenAI resource. ::: zone pivot="programming-language-javascript" In this quickstart you can use your own data with Azure OpenAI models. Using Azu ::: zone-end ++++++ ::: zone pivot="rest-api" [!INCLUDE [REST API quickstart](includes/use-your-data-rest.md)] |
aks | Load Balancer Standard | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/aks/load-balancer-standard.md | The following annotations are supported for Kubernetes services with type `LoadB > [!NOTE] > `service.beta.kubernetes.io/azure-load-balancer-disable-tcp-reset` was deprecated in Kubernetes 1.18 and removed in 1.20. +### Customize the load balancer health probe +| Annotation | Value | Description | +| - | -- | -- | +| `service.beta.kubernetes.io/azure-load-balancer-health-probe-interval` | Health probe interval | | +| `service.beta.kubernetes.io/azure-load-balancer-health-probe-num-of-probe` | The minimum number of unhealthy responses of health probe | | +| `service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path` | Request path of the health probe | | +| `service.beta.kubernetes.io/port_{port}_no_lb_rule` | true/false | {port} is the port number in the service. When it is set to true, no lb rule and health probe rule for this port will be generated. health check service should not be exposed to the public internet(e.g. istio/envoy health check service)| +| `service.beta.kubernetes.io/port_{port}_no_probe_rule` | true/false | {port} is the port number in the service. When it is set to true, no health probe rule for this port will be generated. | +| `service.beta.kubernetes.io/port_{port}_health-probe_protocol` | Health probe protocol | {port} is the port number in the service. Explicit protocol for the health probe for the service port {port}, overriding port.appProtocol if set.| +| `service.beta.kubernetes.io/port_{port}_health-probe_port` | port number or port name in service manifest | {port} is the port number in the service. Explicit port for the health probe for the service port {port}, overriding the default value. | +| `service.beta.kubernetes.io/port_{port}_health-probe_interval` | Health probe interval | {port} is port number of service. | +| `service.beta.kubernetes.io/port_{port}_health-probe_num-of-probe` | The minimum number of unhealthy responses of health probe | {port} is port number of service. | +| `service.beta.kubernetes.io/port_{port}_health-probe_request-path` | Request path of the health probe | {port} is port number of service. | ++As documented [here](../load-balancer/load-balancer-custom-probe-overview.md), Tcp, Http and Https are three protocols supported by load balancer service. ++Currently, the default protocol of the health probe varies among services with different transport protocols, app protocols, annotations and external traffic policies. ++1. for local services, HTTP and /healthz would be used. The health probe will query NodeHealthPort rather than actual backend service +1. for cluster TCP services, TCP would be used. +1. for cluster UDP services, no health probes. ++> [!NOTE] +> For local services with PLS integration and PLS proxy protocol enabled, the default HTTP+/healthz health probe does not work. Thus health probe can be customized the same way as cluster services to support this scenario. ++Since v1.20, service annotation `service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path` is introduced to determine the health probe behavior. ++* For clusters <=1.23, `spec.ports.appProtocol` would only be used as probe protocol when `service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path` is also set. +* For clusters >1.24, `spec.ports.appProtocol` would be used as probe protocol and `/` would be used as default probe request path (`service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path` could be used to change to a different request path). ++Note that the request path would be ignored when using TCP or the `spec.ports.appProtocol` is empty. More specifically: ++| loadbalancer sku | `externalTrafficPolicy` | spec.ports.Protocol | spec.ports.AppProtocol | `service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path` | LB Probe Protocol | LB Probe Request Path | +| - | -- | - | - | -- | | | +| standard | local | any | any | any | http | `/healthz` | +| standard | cluster | udp | any | any | null | null | +| standard | cluster | tcp | | (ignored) | tcp | null | +| standard | cluster | tcp | tcp | (ignored) | tcp | null | +| standard | cluster | tcp | http/https | | TCP(<=1.23) or http/https(>=1.24) | null(<=1.23) or `/`(>=1.24) | +| standard | cluster | tcp | http/https | `/custom-path` | http/https | `/custom-path` | +| standard | cluster | tcp | unsupported protocol | `/custom-path` | tcp | null | +| basic | local | any | any | any | http | `/healthz` | +| basic | cluster | tcp | | (ignored) | tcp | null | +| basic | cluster | tcp | tcp | (ignored) | tcp | null | +| basic | cluster | tcp | http | | TCP(<=1.23) or http/https(>=1.24) | null(<=1.23) or `/`(>=1.24) | +| basic | cluster | tcp | http | `/custom-path` | http | `/custom-path` | +| basic | cluster | tcp | unsupported protocol | `/custom-path` | tcp | null | ++Since v1.21, two service annotations `service.beta.kubernetes.io/azure-load-balancer-health-probe-interval` and `load-balancer-health-probe-num-of-probe` are introduced, which customize the configuration of health probe. If `service.beta.kubernetes.io/azure-load-balancer-health-probe-interval` is not set, Default value of 5 is applied. If `load-balancer-health-probe-num-of-probe` is not set, Default value of 2 is applied. And total probe should be less than 120 seconds. +++### Custom Load Balancer health probe for port +Different ports in a service may require different health probe configurations. This could be because of service design (such as a single health endpoint controlling multiple ports), or Kubernetes features like the [MixedProtocolLBService](https://kubernetes.io/docs/concepts/services-networking/service/#load-balancers-with-mixed-protocol-types). ++The following annotations can be used to customize probe configuration per service port. ++| port specific annotation | global probe annotation | Usage | +| - | | - | +| service.beta.kubernetes.io/port_{port}_no_lb_rule | N/A (no equivalent globally) | if set true, no lb rules and probe rules will be generated | +| service.beta.kubernetes.io/port_{port}_no_probe_rule | N/A (no equivalent globally) | if set true, no probe rules will be generated | +| service.beta.kubernetes.io/port_{port}_health-probe_protocol | N/A (no equivalent globally) | Set the health probe protocol for this service port (e.g. Http, Https, Tcp) | +| service.beta.kubernetes.io/port_{port}_health-probe_port | N/A (no equivalent globally) | Sets the health probe port for this service port (e.g. 15021) | +| service.beta.kubernetes.io/port_{port}_health-probe_request-path | service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path | For Http or Https, sets the health probe request path. Defaults to / | +| service.beta.kubernetes.io/port_{port}_health-probe_num-of-probe | service.beta.kubernetes.io/azure-load-balancer-health-probe-num-of-probe | Number of consecutive probe failures before the port is considered unhealthy | +| service.beta.kubernetes.io/port_{port}_health-probe_interval | service.beta.kubernetes.io/azure-load-balancer-health-probe-interval | The amount of time between probe attempts | ++For following manifest, probe rule for port httpsserver is different from the one for httpserver because annoations for port httpsserver are specified. ++```yaml +apiVersion: v1 +kind: Service +metadata: + name: appservice + annotations: + service.beta.kubernetes.io/azure-load-balancer-health-probe-num-of-probe: "5" + service.beta.kubernetes.io/port_443_health-probe_num-of-probe: "4" +spec: + type: LoadBalancer + selector: + app: server + ports: + - name: httpserver + protocol: TCP + port: 80 + targetPort: 30102 + - name: httpsserver + protocol: TCP + appProtocol: HTTPS + port: 443 + targetPort: 30104 +``` ++In this manifest, the https ports use a different node port, an HTTP readiness check at port 10256 on /healthz(healthz endpoint of kube-proxy). +```yaml +apiVersion: v1 +kind: Service +metadata: + name: istio + annotations: + service.beta.kubernetes.io/azure-load-balancer-internal: "true" + service.beta.kubernetes.io/port_443_health-probe_protocol: "http" + service.beta.kubernetes.io/port_443_health-probe_port: "10256" + service.beta.kubernetes.io/port_443_health-probe_request-path: "/healthz" +spec: + ports: + - name: https + protocol: TCP + port: 443 + targetPort: 8443 + nodePort: 30104 + appProtocol: https + selector: + app: istio-ingressgateway + gateway: istio-ingressgateway + istio: ingressgateway + type: LoadBalancer + sessionAffinity: None + externalTrafficPolicy: Local + ipFamilies: + - IPv4 + ipFamilyPolicy: SingleStack + allocateLoadBalancerNodePorts: true + internalTrafficPolicy: Cluster +``` ++In this manifest, the https ports use a different health probe endpoint, an HTTP readiness check at port 30000 on /healthz/ready. +```yaml +apiVersion: v1 +kind: Service +metadata: + name: istio + annotations: + service.beta.kubernetes.io/azure-load-balancer-internal: "true" + service.beta.kubernetes.io/port_443_health-probe_protocol: "http" + service.beta.kubernetes.io/port_443_health-probe_port: "30000" + service.beta.kubernetes.io/port_443_health-probe_request-path: "/healthz/ready" +spec: + ports: + - name: https + protocol: TCP + port: 443 + targetPort: 8443 + appProtocol: https + selector: + app: istio-ingressgateway + gateway: istio-ingressgateway + istio: ingressgateway + type: LoadBalancer + sessionAffinity: None + externalTrafficPolicy: Local + ipFamilies: + - IPv4 + ipFamilyPolicy: SingleStack + allocateLoadBalancerNodePorts: true + internalTrafficPolicy: Cluster +``` + ## Troubleshooting SNAT If you know that you're starting many outbound TCP or UDP connections to the same destination IP address and port, and you observe failing outbound connections or support notifies you that you're exhausting SNAT ports (preallocated ephemeral ports used by PAT), you have several general mitigation options. Review these options and decide what's best for your scenario. It's possible that one or more can help manage your scenario. For detailed information, review the [outbound connections troubleshooting guide](../load-balancer/troubleshoot-outbound-connection.md). |
aks | Node Auto Repair | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/aks/node-auto-repair.md | AKS engineers investigate alternative remediations if auto-repair is unsuccessfu ## Node auto-drain -[Scheduled events][scheduled-events] can occur on the underlying VMs in any of your node pools. For [spot node pools][spot-node-pools], scheduled events may cause a *preempt* node event for the node. Certain node events, such as *preempt*, cause AKS node auto-drain to attempt a cordon and drain of the affected node. This process enables rescheduling for any affected workloads on that node. You might notice the node receives a taint with `"remediator.aks.microsoft.com/unschedulable"`, because of `"kubernetes.azure.com/scalesetpriority: spot"`. +[Scheduled events][scheduled-events] can occur on the underlying VMs in any of your node pools. For [spot node pools][spot-node-pools], scheduled events may cause a *preempt* node event for the node. Certain node events, such as *preempt*, cause AKS node auto-drain to attempt a cordon and drain of the affected node. This process enables rescheduling for any affected workloads on that node. You might notice the node receives a taint with `"remediator.kubernetes.azure.com/unschedulable"`, because of `"kubernetes.azure.com/scalesetpriority: spot"`. The following table shows the node events and actions they cause for AKS node auto-drain: |
api-management | Api Management Gateways Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/api-management-gateways-overview.md | The following table compares features available in the managed gateway versus th | API threat detection with [Defender for APIs](protect-with-defender-for-apis.md) | ✔️ | ❌ | ❌ | <sup>1</sup> Depends on how the gateway is deployed, but is the responsibility of the customer.<br/>-<sup>2</sup> Connectivity to the self-hosted gateway v2 [configuration endpoint](self-hosted-gateway-overview.md#fqdn-dependencies) requires DNS resolution of the default endpoint hostname; custom domain name is currently not supported.<br/> -<sup>3</sup> Requires configuration of local CA certificates.<br/> +<sup>2</sup> Connectivity to the self-hosted gateway v2 [configuration endpoint](self-hosted-gateway-overview.md#fqdn-dependencies) requires DNS resolution of the endpoint hostname.<br/> ### Backend APIs For estimated maximum gateway throughput in the API Management service tiers, se * In environments such as [Kubernetes](how-to-self-hosted-gateway-on-kubernetes-in-production.md), add multiple gateway replicas to handle expected usage. * Optionally [configure autoscaling](how-to-self-hosted-gateway-on-kubernetes-in-production.md#autoscaling) to meet traffic demands. -## Next steps +## Related content - Learn more about [API Management in a Hybrid and multicloud World](https://aka.ms/hybrid-and-multi-cloud-api-management) - Learn more about using the [capacity metric](api-management-capacity.md) for scaling decisions |
api-management | Api Version Retirement Sep 2023 | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/breaking-changes/api-version-retirement-sep-2023.md | After 30 September 2023, if you prefer not to update your tools, scripts, and pr * **ARM, Bicep, or Terraform templates** - Update the template to use API version 2021-08-01 or later. -* **Azure CLI** - Run `az version` to check your version. If you're running version 2.38.0 or later, no action is required. Use the `az upgrade` command to upgrade the Azure CLI if necessary. For more information, see [How to update the Azure CLI](/cli/azure/update-azure-cli). +* **Azure CLI** - Run `az version` to check your version. If you're running version 2.42.0 or later, no action is required. Use the `az upgrade` command to upgrade the Azure CLI if necessary. For more information, see [How to update the Azure CLI](/cli/azure/update-azure-cli). * **Azure PowerShell** - Run `Get-Module -ListAvailable -Name Az` to check your version. If you're running version 8.1.0 or later, no action is required. Use `Update-Module -Name Az -Repository PSGallery` to update the module if necessary. For more information, see [Install the Azure Az PowerShell module](/powershell/azure/install-azure-powershell). |
api-management | How To Self Hosted Gateway On Kubernetes In Production | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/how-to-self-hosted-gateway-on-kubernetes-in-production.md | Starting with version 2.1.5 or above, the self-hosted gateway provides observabi - [API Inspector](api-management-howto-api-inspector.md) will show additional steps when HTTP(S) proxy is being used and its related interactions. - Verbose logs are provided to provide indication of the request proxy behavior. +> [!NOTE] +> Due to a known issue with HTTP proxies using basic authentication, using certificate revocation list (CRL) validation is not supported. Learn more in our [Self-Hosted Gateway settings reference](self-hosted-gateway-settings-reference.md) how to configure it appropriately. + > [!Warning] > Ensure that the [infrastructure requirements](self-hosted-gateway-overview.md#fqdn-dependencies) have been met and that the self-hosted gateway can still connect to them or certain functionality will not work properly. |
api-management | Self Hosted Gateway Settings Reference | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/self-hosted-gateway-settings-reference.md | This guidance helps you provide the required information to define how to authen | certificates.local.ca.enabled | Indication whether or not the self-hosted gateway should use local CA certificates that are mounted. It's required to run the self-hosted gateway as root or with user ID 1001. | No | `false` | v2.0+ | | net.server.tls.ciphers.allowed-suites | Comma-separated list of ciphers to use for TLS connection between API client and the self-hosted gateway. | No | `TLS_AES_256_GCM_SHA384,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_DHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_DHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_DHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384,TLS_DHE_RSA_WITH_AES_256_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256,TLS_DHE_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_DHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_DHE_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_CBC_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_128_CBC_SHA` | v2.0+ | | net.client.tls.ciphers.allowed-suites | Comma-separated list of ciphers to use for TLS connection between the self-hosted gateway and the backend. | No | `TLS_AES_256_GCM_SHA384,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_DHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_DHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_DHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384,TLS_DHE_RSA_WITH_AES_256_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256,TLS_DHE_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_DHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_DHE_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_CBC_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_128_CBC_SHA` | v2.0+ |+| security.certificate-revocation.validation.enabled | Provides capability to turn certificate revocation list validation on/off | No | `false` | v2.3.6+ | ## Sovereign clouds |
api-management | V2 Service Tiers Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/v2-service-tiers-overview.md | A: A Standard v2 service instance can be integrated with a VNet to provide secur A: No, such a deployment is only supported in the Premium tier. +### Q: Is a Premium v2 tier planned? ++A: Yes, a Premium v2 preview is planned and will be announced separately. + ## Related content -* Learn more about the API Management [tiers](api-management-features.md). +* Learn more about the API Management [tiers](api-management-features.md). |
api-management | Virtual Network Reference | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/api-management/virtual-network-reference.md | When an API Management service instance is hosted in a VNet, the ports in the fo | * / 6381 - 6383 | Inbound & Outbound | TCP | VirtualNetwork / VirtualNetwork | Access internal Azure Cache for Redis service for [caching](api-management-caching-policies.md) policies between machines (optional) | External & Internal | | * / 4290 | Inbound & Outbound | UDP | VirtualNetwork / VirtualNetwork | Sync Counters for [Rate Limit](rate-limit-policy.md) policies between machines (optional) | External & Internal | | * / 6390 | Inbound | TCP | AzureLoadBalancer / VirtualNetwork | **Azure Infrastructure Load Balancer** | External & Internal |+| * / 443 | Inbound | TCP | AzureTrafficManager / VirtualNetwork | **Azure Traffic Manager** routing for multi-region deployment | External | ### [stv1](#tab/stv1) When an API Management service instance is hosted in a VNet, the ports in the fo | * / 6381 - 6383 | Inbound & Outbound | TCP | VirtualNetwork / VirtualNetwork | Access internal Azure Cache for Redis service for [caching](api-management-caching-policies.md) policies between machines (optional) | External & Internal | | * / 4290 | Inbound & Outbound | UDP | VirtualNetwork / VirtualNetwork | Sync Counters for [Rate Limit](rate-limit-policy.md) policies between machines (optional) | External & Internal | | * / * | Inbound | TCP | AzureLoadBalancer / VirtualNetwork | **Azure Infrastructure Load Balancer** (required for Premium SKU, optional for other SKUs) | External & Internal |+| * / 443 | Inbound | TCP | AzureTrafficManager / VirtualNetwork | **Azure Traffic Manager** routing for multi-region deployment | External only | |
app-service | Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/app-service/environment/overview.md | App Service Environment v3 is available in the following regions: | -- | :--: | :-: | :-: | | | App Service Environment v3 | App Service Environment v3 | App Service Environment v1/v2 | | US DoD Central | ✅ | | ✅ |-| US DoD East | | | ✅ | +| US DoD East | ✅ | | ✅ | | US Gov Arizona | ✅ | | ✅ | | US Gov Iowa | | | ✅ | | US Gov Texas | ✅ | | ✅ | |
attestation | Attestation Token Examples | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/attestation/attestation-token-examples.md | -Attestation policy is used to process the attestation evidence and determine whether Azure Attestation will issue an attestation token. Attestation token generation can be controlled with custom policies. Below are some examples of an attestation policy. +Attestation policy is used to process the attestation evidence and determines whether Azure Attestation issues an attestation token. Attestation token generation can be controlled with custom policies. Here are some examples of an attestation token. ## Sample JWT generated for SGX attestation Attestation policy is used to process the attestation evidence and determine whe }.[Signature] ``` -Some of the claims used above are considered deprecated but are fully supported. It is recommended that all future code and tooling use the non-deprecated claim names. See [claims issued by Azure Attestation](claim-sets.md) for more information. +Some of the claims used here are considered deprecated but are fully supported. It is recommended that all future code and tooling use the non-deprecated claim names. For more information, see [claims issued by Azure Attestation](claim-sets.md). -The below claims will appear only in the attestation token generated for Intel® Xeon® Scalable processor-based server platforms. The claims will not appear if the SGX enclave is not configured with [Key Separation and Sharing Support](https://github.com/openenclave/openenclave/issues/3054) +The below claims appear only in the attestation token generated for Intel® Xeon® Scalable processor-based server platforms. The claims do not appear if the SGX enclave is not configured with [Key Separation and Sharing Support](https://github.com/openenclave/openenclave/issues/3054) **x-ms-sgx-config-id** The below claims will appear only in the attestation token generated for Intel® } ``` +## Sample JWT generated for TDX attestation ++The definitions of below claims are available in [Azure Attestation TDX EAT profile](trust-domain-extensions-eat-profile.md) ++``` +{ + "attester_tcb_status": "UpToDate", + "dbgstat": "disabled", + "eat_profile": "https://aka.ms/maa-eat-profile-tdxvm", + "exp": 1697706287, + "iat": 1697677487, + "intuse": "generic", + "iss": "https://maasand001.eus.attest.azure.net", + "jti": "5f65006d573bc1c04f67820348c20f5d8da72ddbbd4d6c03da8de9f11b5cf29b", + "nbf": 1697677487, + "tdx_mrconfigid": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", + "tdx_mrowner": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", + "tdx_mrownerconfig": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", + "tdx_mrseam": "2fd279c16164a93dd5bf373d834328d46008c2b693af9ebb865b08b2ced320c9a89b4869a9fab60fbe9d0c5a5363c656", + "tdx_mrsignerseam": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", + "tdx_mrtd": "5be56d418d33661a6c21da77c9503a07e430b35eb92a0bd042a6b3c4e79b3c82bb1c594e770d0d129a0724669f1e953f", + "tdx_report_data": "93c6db49f2318387bcebdad0275e206725d948f9000d900344aa44abaef145960000000000000000000000000000000000000000000000000000000000000000", + "tdx_rtmr0": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", + "tdx_rtmr1": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", + "tdx_rtmr2": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", + "tdx_rtmr3": "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", + "tdx_seam_attributes": "0000000000000000", + "tdx_seamsvn": 3, + "tdx_td_attributes": "0000000000000000", + "tdx_td_attributes_debug": false, + "tdx_td_attributes_key_locker": false, + "tdx_td_attributes_perfmon": false, + "tdx_td_attributes_protection_keys": false, + "tdx_td_attributes_septve_disable": false, + "tdx_tee_tcb_svn": "03000600000000000000000000000000", + "tdx_xfam": "e718060000000000", + "x-ms-attestation-type": "tdxvm", + "x-ms-compliance-status": "azure-compliant-cvm", + "x-ms-policy-hash": "B56nbp5slhw66peoRYkpdq1WykMkEworvdol08hnMXE", + "x-ms-runtime": { + "test-claim-name": "test-claim-value" + }, + "x-ms-ver": "1.0" +} +``` + ## Next steps - [View examples of an attestation policy](policy-examples.md) |
attestation | Trust Domain Extensions Eat Profile | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/attestation/trust-domain-extensions-eat-profile.md | + + Title: Azure Attestation EAT profile for TDX +description: Azure Attestation EAT profile for TDX ++++ Last updated : 10/18/2023+++++# Azure Attestation EAT profile for Intel® Trust Domain Extensions (TDX) ++This profile outlines claims for an [Intel® Trust Domain Extensions (TDX)](https://www.intel.com/content/www/us/en/developer/tools/trust-domain-extensions/overview.html) attestation result generated as an Entity Attestation Token (EAT) by Azure Attestation. ++The profile includes claims from the IETF [JWT](https://datatracker.ietf.org/doc/html/rfc7519) specification, the [EAT](https://datatracker.ietf.org/doc/html/draft-ietf-rats-eat-21)) specification, Intel's TDX specification and Microsoft specific claims. ++## JWT claims ++The complete definitions of the following claims are available in the JWT specification. ++**iat** - The "iat" (issued at) claim identifies the time at which the JWT was issued. ++**exp** - The "exp" (expiration time) claim identifies the expiration time on or after which the JWT MUST NOT be accepted for processing. ++**iss** - The "iss" (issuer) claim identifies the principal that issued the JWT. ++**jti** - The "jti" (JWT ID) claim provides a unique identifier for the JWT. ++**nbf** - The "nbf" (not before) claim identifies the time before which the JWT MUST NOT be accepted for processing. ++## EAT claims ++The complete definitions of the following claims are available in the EAT specification. ++**eat_profile** - The "eat_profile" claim identifies an EAT profile by either a URL or an OID. ++**dbgstat** - The "dbgstat" claim applies to entity-wide or submodule-wide debug facilities of the entity like [JTAG] and diagnostic hardware built into chips. ++**intuse** - The "intuse" claim provides an indication to an EAT consumer about the intended usage of the token. ++## TDX claims ++The complete definitions of the claims are available in the section A.3.2 TD Quote Body of [Intel® TDX DCAP Quoting Library API](https://download.01.org/intel-sgx/latest/dcap-latest/linux/docs/Intel_TDX_DCAP_Quoting_Library_API.pdf) specification. ++**tdx_mrsignerseam** - A 96-character hexadecimal string that represents a byte array of length 48 containing the measurement of the TDX module signer. ++**tdx_mrseam** - A 96-character hexadecimal string that represents a byte array of length 48 containing the measurement of the Intel TDX module. ++**tdx_mrtd** - A 96-character hexadecimal string that represents a byte array of length 48 containing the measurement of the initial contents of the TDX. ++**tdx_rtmr0** - A 96-character hexadecimal string that represents a byte array of length 48 containing the runtime extendable measurement register. ++**tdx_rtmr1** - A 96-character hexadecimal string that represents a byte array of length 48 containing the runtime extendable measurement register. ++**tdx_rtmr2** - A 96-character hexadecimal string that represents a byte array of length 48 containing the runtime extendable measurement register. ++**tdx_rtmr3** - A 96-character hexadecimal string that represents a byte array of length 48 containing the runtime extendable measurement register. ++**tdx_mrconfigid** - A 96-character hexadecimal string that represents a byte array of length 48 containing the software-defined ID for non-owner-defined configuration of the TDX, e.g., runtime or Operating System (OS) configuration. ++**tdx_mrowner** - A 96-character hexadecimal string that represents a byte array of length 48 containing the software-defined ID for the TDX's owner. ++**tdx_mrownerconfig** - A 96-character hexadecimal string that represents a byte array of length 48 containing the software-defined ID for owner-defined configuration of the TDX, e.g., specific to the workload rather than the runtime or OS. ++**tdx_report_data** - A 128-character hexadecimal string that represents a byte array of length 64. In this context, the TDX has the flexibility to include 64 bytes of custom data in a TDX Report. For instance, this space can be used to hold a nonce, a public key, or a hash of a larger block of data. ++**tdx_seam_attributes** - A 16 character hexadecimal string that represents a byte array of length 8 containing additional configuration of the TDX module. ++**tdx_tee_tcb_svn** - A 32 character hexadecimal string that represents a byte array of length 16 describing the Trusted Computing Base (TCB) Security Version Numbers (SVNs) of TDX. ++**tdx_xfam** - A 16 character hexadecimal string that represents a byte array of length 8 containing a mask of CPU extended features that the TDX is allowed to use. ++**tdx_seamsvn** - A number that represents the Intel TDX module SVN. The complete definition of the claim is available in section 3.1 SEAM_SIGSTRUCT: INTEL® TDX MODULE SIGNATURE STRUCTURE of [Intel® TDX Loader Interface Specification](https://cdrdv2.intel.com/v1/dl/getContent/733584) ++**tdx_td_attributes** - A 16 character hexadecimal string that represents a byte array of length 8. These are the attributes associated with the Trust Domain (TD). The complete definitions of the claims mentioned below are available in the section A.3.4. TD Attributes of [Intel® TDX DCAP Quoting Library API](https://download.01.org/intel-sgx/latest/dcap-latest/linux/docs/Intel_TDX_DCAP_Quoting_Library_API.pdf) specification. ++**tdx_td_attributes_debug** - A boolean value that indicates whether the TD runs in TD debug mode (set to 1) or not (set to 0). In TD debug mode, the CPU state and private memory are accessible by the host VMM. ++**tdx_td_attributes_key_locker** - A boolean value that indicates whether the TD is allowed to use Key Locker. ++**tdx_td_attributes_perfmon** - A boolean value that indicates whether the TD is allowed to use Perfmon and PERF_METRICS capabilities. ++**tdx_td_attributes_protection_keys** - A boolean value that indicates whether the TD is allowed to use Supervisor Protection Keys. ++**tdx_td_attributes_septve_disable** - A boolean value that determines whether to disable EPT violation conversion to #VE on TD access of PENDING pages. ++## Attester claims ++**attester_tcb_status** - A string value that represents the TCB level status of the platform being evaluated. See tcbStatus in [Intel® Trusted Services API Management Developer Portal](https://api.portal.trustedservices.intel.com/documentation). ++## Microsoft specific claims ++**x-ms-attestation-type** - A string value that represents the attestation type. ++**x-ms-policy-hash** - Hash of Azure Attestation evaluation policy computed as BASE64URL(SHA256(UTF8(BASE64URL(UTF8(policy text))))). ++**x-ms-runtime** - JSON object containing "claims" that are defined and generated within the attested environment. This is a specialization of the “enclave held data” concept, where the “enclave held data” is specifically formatted as a UTF-8 encoding of well formed JSON. ++**x-ms-ver** - JWT schema version (expected to be "1.0") +} |
azure-functions | Durable Functions Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-functions/durable/durable-functions-overview.md | app.get("raiseEventToOrchestration", async function (request, context) { # [Python](#tab/v1-model) ```python-import azure.functions as func import azure.durable_functions as df -myApp = df.DFApp(http_auth_level=func.AuthLevel.ANONYMOUS) -# An HTTP-Triggered Function with a Durable Functions Client binding -@myApp.route(route="orchestrators/{functionName}") -@myApp.durable_client_input(client_name="client") -async def main(client): +async def main(client: str): + durable_client = df.DurableOrchestrationClient(client) is_approved = True- await client.raise_event(instance_id, "ApprovalEvent", is_approved) + await durable_client.raise_event(instance_id, "ApprovalEvent", is_approved) ``` |
azure-functions | Functions Reference Node | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-functions/functions-reference-node.md | The `HttpRequest` object has the following properties: | **`params`** | `Record<string, string>` | Route parameter keys and values. | | **`user`** | `HttpRequestUser | null` | Object representing logged-in user, either through Functions authentication, SWA Authentication, or null when no such user is logged in. | | **`body`** | `Buffer | string | any` | If the media type is "application/octet-stream" or "multipart/*", `body` is a Buffer. If the value is a JSON parse-able string, `body` is the parsed object. Otherwise, `body` is a string. |-| **`rawBody`** | `Buffer | string` | If the media type is "application/octet-stream" or "multipart/*", `rawBody` is a Buffer. Otherwise, `rawBody` is a string. The only difference between `body` and `rawBody` is that `rawBody` doesn't JSON parse a string body. | +| **`rawBody`** | `string` | The body as a string. Despite the name, this property doesn't return a Buffer. | | **`bufferBody`** | `Buffer` | The body as a buffer. | ::: zone-end |
azure-monitor | Agent Windows | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-monitor/agents/agent-windows.md | The following steps install and configure the Log Analytics agent in Azure and A 6. On the **Azure Log Analytics** page, perform the following: 1. Paste the **Workspace ID** and **Workspace Key (Primary Key)** that you copied earlier. If the computer should report to a Log Analytics workspace in Azure Government cloud, select **Azure US Government** from the **Azure Cloud** drop-down list. 2. If the computer needs to communicate through a proxy server to the Log Analytics service, click **Advanced** and provide the URL and port number of the proxy server. If your proxy server requires authentication, type the username and password to authenticate with the proxy server and then click **Next**. -7. Click **Next** once you have completed providing the necessary configuration settings.<br><br> :::image type="content" source="media/agent-windows/log-analytics-mma-setup-laworkspace.png" lightbox="media/agent-windows/log-analytics-mma-setup-laworkspace.png" alt-text="paste Workspace ID and Primary Key":::<br><br> +7. Click **Next** once you have completed providing the necessary configuration settings. + <!-- convertborder later --> + :::image type="content" source="media/agent-windows/log-analytics-mma-setup-laworkspace.png" lightbox="media/agent-windows/log-analytics-mma-setup-laworkspace.png" alt-text="paste Workspace ID and Primary Key" border="false":::<br><br> 8. On the **Ready to Install** page, review your choices and then click **Install**. 9. On the **Configuration completed successfully** page, click **Finish**. To retrieve the product code from the agent install package directly, you can us After installation of the agent is finished, you can verify that it's successfully connected and reporting in two ways. -From the computer in **Control Panel**, find the item **Microsoft Monitoring Agent**. Select it, and on the **Azure Log Analytics** tab, the agent should display a message stating *The Microsoft Monitoring Agent has successfully connected to the Microsoft Operations Management Suite service.*<br><br> :::image type="content" source="media/agent-windows/log-analytics-mma-laworkspace-status.png" lightbox="media/agent-windows/log-analytics-mma-laworkspace-status.png" alt-text="Screenshot that shows the MMA connection status to Log Analytics message."::: +From the computer in **Control Panel**, find the item **Microsoft Monitoring Agent**. Select it, and on the **Azure Log Analytics** tab, the agent should display a message stating *The Microsoft Monitoring Agent has successfully connected to the Microsoft Operations Management Suite service.* +<!-- convertborder later --> You can also perform a log query in the Azure portal: |
azure-monitor | Azure Monitor Agent Data Collection Endpoint | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-monitor/agents/azure-monitor-agent-data-collection-endpoint.md | Add the data collection endpoints to a new or existing [Azure Monitor Private Li > Other Azure Monitor resources like the Log Analytics workspaces configured in your data collection rules that you want to send data to must be part of this same AMPLS resource. For your data collection endpoints, ensure the **Accept access from public networks not connected through a Private Link Scope** option is set to **No** on the **Network Isolation** tab of your endpoint resource in the Azure portal. This setting ensures that public internet access is disabled and network communication only happens via private links.-+<!-- convertborder later --> ### Associate DCEs to target machines Associate the data collection endpoints to the target resources by editing the data collection rule in the Azure portal. On the **Resources** tab, select **Enable Data Collection Endpoints**. Select a DCE for each virtual machine. See [Configure data collection for Azure Monitor Agent](../agents/data-collection-rule-azure-monitor-agent.md).-+<!-- convertborder later --> ## Next steps |
azure-monitor | Azure Monitor Agent Manage | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-monitor/agents/azure-monitor-agent-manage.md | You can choose to use the individual policies from the preceding policy initiati The initiatives or policies will apply to each virtual machine as it's created. A [remediation task](../../governance/policy/how-to/remediate-resources.md) deploys the policy definitions in the initiative to existing resources, so you can configure Azure Monitor Agent for any resources that were already created. When you create the assignment by using the Azure portal, you have the option of creating a remediation task at the same time. For information on the remediation, see [Remediate non-compliant resources with Azure Policy](../../governance/policy/how-to/remediate-resources.md).-+<!-- convertborder later --> ## Next steps |
azure-monitor | Azure Monitor Agent Migration Tools | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-monitor/agents/azure-monitor-agent-migration-tools.md | You can access the workbook **[here](https://portal.azure.com/#view/AppInsightsE :::image type="content" source="media/azure-monitor-migration-tools/ama-migration-helper.png" lightbox="media/azure-monitor-migration-tools/ama-migration-helper.png" alt-text="Screenshot of the Azure Monitor Agent Migration Helper workbook. The screenshot highlights the Subscription and Workspace dropdowns and shows the Azure Virtual Machines tab, on which you can track which agent is deployed on each virtual machine."::: **Automatic Migration Recommendations**-+<!-- convertborder later --> ## Installing and using DCR Config Generator Azure Monitor Agent relies only on [data collection rules (DCRs)](../essentials/data-collection-rule-overview.md) for configuration, whereas Log Analytics Agent inherits its configuration from Log Analytics workspaces. |
azure-monitor | Azure Monitor Agent Windows Client | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-monitor/agents/azure-monitor-agent-windows-client.md | -This article provides instructions and guidance for using the client installer for Azure Monitor Agent. It also explains how to leverage Data Collection Rules on Windows client devices. +This article provides instructions and guidance for using the client installer for Azure Monitor Agent. It also explains how to use Data Collection Rules on Windows client devices. Using the new client installer described here, you can now collect telemetry data from your Windows client devices in addition to servers and virtual machines. Both the [extension](./azure-monitor-agent-manage.md#virtual-machine-extension-details) and this installer use Data Collection rules to configure the **same underlying agent**. > [!NOTE]-> This article provides specific guidance for installing the Azure Monitor agent on Windows client devices, subject to [limitations below](#limitations). For standard installation and management guidance for the agent, refer [the agent extension management guidance here](./azure-monitor-agent-manage.md) +> This article provides specific guidance for installing the Azure Monitor agent on Windows client devices, subject to the [limitations](#limitations). For standard installation and management guidance for the agent, refer [the agent extension management guidance here](./azure-monitor-agent-manage.md) ### Comparison with virtual machine extension Here is a comparison between client installer and VM extension for Azure Monitor agent: Here is a comparison between client installer and VM extension for Azure Monitor | Central configuration | Via Data collection rules | Same | | Associating config rules to agents | DCRs associates directly to individual VM resources | DCRs associate to Monitored Object (MO), which maps to all devices within the Microsoft Entra tenant | | Data upload to Log Analytics | Via Log Analytics endpoints | Same |-| Feature support | All features documented [here](./azure-monitor-agent-overview.md) | Features dependent on AMA agent extension that don't require additional extensions. This includes support for Sentinel Windows Event filtering | +| Feature support | All features documented [here](./azure-monitor-agent-overview.md) | Features dependent on AMA agent extension that don't require more extensions. This includes support for Sentinel Windows Event filtering | | [Networking options](./azure-monitor-agent-overview.md#networking) | Proxy support, Private link support | Proxy support only | Here is a comparison between client installer and VM extension for Azure Monitor | On-premises servers | No | [Virtual machine extension](./azure-monitor-agent-manage.md#virtual-machine-extension-details) (with Azure Arc agent) | Installs the agent using Azure extension framework, provided for on-premises by installing Arc agent | ## Limitations-1. The Windows client installer supports latest Windows machines only that are **Microsoft Entra joined** or Microsoft Entra hybrid joined. More information under [prerequisites](#prerequisites) below -2. The Data Collection rules can only target the Microsoft Entra tenant scope, i.e. all DCRs associated to the tenant (via Monitored Object) will apply to all Windows client machines within that tenant with the agent installed using this client installer. **Granular targeting using DCRs is not supported** for Windows client devices yet +1. The Windows client installer supports latest Windows machines only that are **Microsoft Entra joined** or Microsoft Entra hybrid joined. For more information, see the [prerequisites](#prerequisites). +2. The Data Collection rules can only target the Microsoft Entra tenant scope. That is, all DCRs associated to the tenant (via Monitored Object) will apply to all Windows client machines within that tenant with the agent installed using this client installer. **Granular targeting using DCRs is not supported** for Windows client devices yet 3. No support for Windows machines connected via **Azure private links** 4. The agent installed using the Windows client installer is designed mainly for Windows desktops or workstations that are **always connected**. While the agent can be installed via this method on laptops, it is not optimized for battery consumption and network limitations on a laptop. Here is a comparison between client installer and VM extension for Azure Monitor 7. Before using any PowerShell cmdlet, ensure cmdlet related PowerShell module is installed and imported. ## Install the agent-1. Download the Windows MSI installer for the agent using [this link](https://go.microsoft.com/fwlink/?linkid=2192409). You can also download it from **Monitor** > **Data Collection Rules** > **Create** experience on Azure portal (shown below): - :::image type="content" source="media/azure-monitor-agent-windows-client/azure-monitor-agent-client-installer-portal.png" lightbox="media/azure-monitor-agent-windows-client/azure-monitor-agent-client-installer-portal.png" alt-text="Diagram shows download agent link on Azure portal."::: +1. Download the Windows MSI installer for the agent using [this link](https://go.microsoft.com/fwlink/?linkid=2192409). You can also download it from **Monitor** > **Data Collection Rules** > **Create** experience on Azure portal (shown in the following screenshot): + <!-- convertborder later --> + :::image type="content" source="media/azure-monitor-agent-windows-client/azure-monitor-agent-client-installer-portal.png" lightbox="media/azure-monitor-agent-windows-client/azure-monitor-agent-client-installer-portal.png" alt-text="Diagram shows download agent link on Azure portal." border="false"::: 2. Open an elevated admin command prompt window and change directory to the location where you downloaded the installer. 3. To install with **default settings**, run the following command: ```cli msiexec /i AzureMonitorAgentClientSetup.msi /qn ```-4. To install with custom file paths, [network proxy settings](./azure-monitor-agent-overview.md#proxy-configuration), or on a Non-Public Cloud use the command below with the values from the following table: +4. To install with custom file paths, [network proxy settings](./azure-monitor-agent-overview.md#proxy-configuration), or on a Non-Public Cloud use the following command with the values from the following table: ```cli msiexec /i AzureMonitorAgentClientSetup.msi /qn DATASTOREDIR="C:\example\folder" Here is a comparison between client installer and VM extension for Azure Monitor ## Create and associate a 'Monitored Object' You need to create a 'Monitored Object' (MO) that creates a representation for the Microsoft Entra tenant within Azure Resource Manager (ARM). This ARM entity is what Data Collection Rules are then associated with. **This Monitored Object needs to be created only once for any number of machines in a single Microsoft Entra tenant**. Currently this association is only **limited** to the Microsoft Entra tenant scope, which means configuration applied to the Microsoft Entra tenant will be applied to all devices that are part of the tenant and running the agent installed via the client installer. Agents installed as virtual machine extension will not be impacted by this.-The image below demonstrates how this works: +The following image demonstrates how this works: +<!-- convertborder later --> --Then, proceed with the instructions below to create and associate them to a Monitored Object, using REST APIs or PowerShell commands. +Then, proceed with the following instructions to create and associate them to a Monitored Object, using REST APIs or PowerShell commands. ### Permissions required-Since MO is a tenant level resource, the scope of the permission would be higher than a subscription scope. Therefore, an Azure tenant admin may be needed to perform this step. [Follow these steps to elevate Microsoft Entra tenant admin as Azure Tenant Admin](../../role-based-access-control/elevate-access-global-admin.md). It will give the Microsoft Entra admin 'owner' permissions at the root scope. This is needed for all methods described below in this section. +Since MO is a tenant level resource, the scope of the permission would be higher than a subscription scope. Therefore, an Azure tenant admin may be needed to perform this step. [Follow these steps to elevate Microsoft Entra tenant admin as Azure Tenant Admin](../../role-based-access-control/elevate-access-global-admin.md). It gives the Microsoft Entra admin 'owner' permissions at the root scope. This is needed for all methods described in the following section. ### Using REST APIs PUT https://management.azure.com/providers/microsoft.insights/providers/microsof After this step is complete, **reauthenticate** your session and **reacquire** your ARM bearer token. #### 2. Create Monitored Object-This step creates the Monitored Object for the Microsoft Entra tenant scope. It will be used to represent client devices that are signed with that Microsoft Entra tenant identity. +This step creates the Monitored Object for the Microsoft Entra tenant scope. It's used to represent client devices that are signed with that Microsoft Entra tenant identity. **Permissions required**: Anyone who has 'Monitored Object Contributor' at an appropriate scope can perform this operation, as assigned in step 1. PUT https://management.azure.com/providers/Microsoft.Insights/monitoredObjects/{ | Name | In | Type | Description | |:|:|:|:|:|-| `AADTenantId` | path | string | ID of the Microsoft Entra tenant that the device(s) belong to. The MO will be created with the same ID | +| `AADTenantId` | path | string | ID of the Microsoft Entra tenant that the device(s) belong to. The MO is created with the same ID | **Headers** - Authorization: ARM Bearer Token PUT https://management.azure.com/providers/Microsoft.Insights/monitoredObjects/{ | Name | Description | |:|:|-| `location` | The Azure region where the MO object would be stored. It should be the **same region** where you created the Data Collection Rule. This is the location of the region from where agent communications would happen. | +| `location` | The Azure region where the MO object would be stored. It should be the **same region** where you created the Data Collection Rule. This region is the location where agent communications would happen. | #### 3. Associate DCR to Monitored Object $requestURL = "https://management.azure.com$RespondId/providers/microsoft.insigh ``` ## Verify successful setup Check the ΓÇÿHeartbeatΓÇÖ table (and other tables you configured in the rules) in the Log Analytics workspace that you specified as a destination in the data collection rule(s).-The `SourceComputerId`, `Computer`, `ComputerIP` columns should all reflect the client device information respectively, and the `Category` column should say 'Azure Monitor Agent'. See example below: -+The `SourceComputerId`, `Computer`, `ComputerIP` columns should all reflect the client device information respectively, and the `Category` column should say 'Azure Monitor Agent'. See the following example: +<!-- convertborder later --> ### Using PowerShell for offboarding ```PowerShell You can use any of the following options to check the installed version of the a - Open **Control Panel** > **Programs and Features** > **Azure Monitor Agent** and click 'Uninstall' - Open **Settings** > **Apps** > **Apps and Features** > **Azure Monitor Agent** and click 'Uninstall' -If you face issues during 'Uninstall', refer to [troubleshooting guidance](#troubleshoot) below +If you face issues during 'Uninstall', refer to the [troubleshooting guidance](#troubleshoot). ### Update the agent In order to update the version, install the new version you wish to update to. |
azure-monitor | Data Sources Performance Counters | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-monitor/agents/data-sources-performance-counters.md | description: Learn how to configure collection of performance counters for Windo Previously updated : 06/28/2022 Last updated : 10/19/2023 The following table provides different examples of log queries that retrieve per ## Next steps * [Collect performance counters from Linux applications](data-sources-linux-applications.md), including MySQL and Apache HTTP Server. * Learn about [log queries](../logs/log-query-overview.md) to analyze the data collected from data sources and solutions.-* Export collected data to [Power BI](../logs/log-powerbi.md) for more visualizations and analysis. +* Export collected data to [Power BI](../logs/log-powerbi.md) for more visualizations and analysis. |
azure-monitor | Vmext Troubleshoot | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-monitor/agents/vmext-troubleshoot.md | Title: Troubleshoot the Azure Log Analytics VM extension description: Describe the symptoms, causes, and resolution for the most common issues with the Log Analytics VM extension for Windows and Linux Azure VMs. Previously updated : 06/06/2019 Last updated : 10/19/2023 |
azure-monitor | Alerts Manage Alerts Previous Version | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-monitor/alerts/alerts-manage-alerts-previous-version.md | The current alert rule wizard is different from the earlier experience: 1. To make alerts stateful, select **Automatically resolve alerts (preview)**. 1. Specify if the alert rule should trigger one or more [action groups](./action-groups.md) when the alert condition is met. > [!NOTE]- > For limits on the actions that can be performed, see [Azure subscription service limits](../../azure-resource-manager/management/azure-subscription-service-limits.md). + > * For limits on the actions that can be performed, see [Azure subscription service limits](../../azure-resource-manager/management/azure-subscription-service-limits.md). + > * Search results were included in the payload of the triggered alert and its associated notifications. **Notice that**: The **email** included only **10 rows** from the unfiltered results while the **webhook payload** contained **1,000 unfiltered results**. 1. (Optional) Customize actions in log alert rules: - **Custom email subject**: Overrides the *email subject* of email actions. You can't modify the body of the mail and this field *isn't for email addresses*. - **Include custom Json payload for webhook**: Overrides the webhook JSON used by action groups, assuming that the action group contains a webhook action. Learn more about [webhook actions for log alerts](./alerts-log-webhook.md). |
azure-monitor | Container Insights Onboard | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-monitor/containers/container-insights-onboard.md | The following table lists the extra firewall configuration required for managed | `global.handler.control.monitor.azure.us` | Access control service | 443 | | `<cluster-region-name>.handler.control.monitor.azure.us` | Fetch data collection rules for specific AKS cluster | 443 | +## Troubleshooting +If you have registered your cluster and/or configured HCI Insights before November, 2023, features that use the AMA agent on HCI, such as Arc for Servers Insights, VM Insights, Container Insights, Defender for Cloud or Sentinel may not be collecting logs and event data properly. See [Repair AMA agent for HCI](/azure-stack/hci/manage/monitor-hci-single?tabs=22h2-and-later) for steps to reconfigure the AMA agent and HCI Insights. + ## Next steps After you've enabled monitoring, you can begin analyzing the performance of your Kubernetes clusters that are hosted on AKS, Azure Stack, or another environment. To learn how to use Container insights, see [View Kubernetes cluster performance](container-insights-analyze.md).+ |
azure-monitor | Log Standard Columns | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-monitor/logs/log-standard-columns.md | Event The **\_TimeReceived** column contains the date and time that the record was received by the Azure Monitor ingestion point in the Azure cloud. This can be useful for identifying latency issues between the data source and the cloud. An example would be a networking issue causing a delay with data being sent from an agent. See [Log data ingestion time in Azure Monitor](../logs/data-ingestion-time.md) for more details. > [!NOTE]-> The **\_TimeReceived** column is calculate each time it is used. This process is resource intensive. Refine from using it to filter large number of records. Using this function recurrently can lead to increased query execution duration. +> The **\_TimeReceived** column is calculate each time it is used. This process is resource intensive. Refrain from using it to filter large number of records. Using this function recurrently can lead to increased query execution duration. The following query gives the average latency by hour for event records from an agent. This includes the time from the agent to the cloud and the total time for the record to be available for log queries. It is always more efficient to use the \_SubscriptionId column than extracting i ## \_SubscriptionId The **\_SubscriptionId** column holds the subscription ID of the resource that the record is associated with. This gives you a standard column to use to scope your query to only records from a particular subscription, or to compare different subscriptions. -For Azure resources, the value of **__SubscriptionId** is the subscription part of the [Azure resource ID URL](../../azure-resource-manager/templates/template-functions-resource.md). The column is limited to Azure resources, including [Azure Arc](../../azure-arc/overview.md) resources, or to custom logs that indicated the Resource ID during ingestion. +For Azure resources, the value of **__SubscriptionId** is the subscription part of the [Azure resource ID URL](../../azure-resource-manager/templates/template-functions-resource.md). The column is limited to Azure resources, including [Azure Arc](../../azure-arc/overview.md) resources, or to custom logs that indicated the Subscription ID during ingestion. > [!NOTE] > Some data types already have fields that contain Azure subscription ID . While these fields are kept for backward compatibility, it is recommended to use the \_SubscriptionId column to perform cross correlation since it will be more consistent. |
azure-monitor | Move Workspace | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-monitor/logs/move-workspace.md | -In this article, you'll learn the steps to move a Log Analytics workspace to another resource group or subscription in the same region. +In this article, you'll learn the steps to move a Log Analytics workspace to another resource group or subscription in the same region. To move a workspace across regions, see [Move a Log Analytics workspace to another region](./move-workspace-region.md). > [!TIP] > To learn more about how to move Azure resources through the Azure portal, PowerShell, the Azure CLI, or the REST API, see [Move resources to a new resource group or subscription](../../azure-resource-manager/management/move-resource-group-and-subscription.md). In this article, you'll learn the steps to move a Log Analytics workspace to ano ## Prerequisites - The subscription or resource group where you want to move your Log Analytics workspace must be located in the same region as the Log Analytics workspace you're moving.- > [!NOTE] - > To move a workspace across regions, see [Move a Log Analytics workspace to another region](./move-workspace-region.md). - The move operation requires that no services can be linked to the workspace. Prior to the move, delete solutions that rely on linked services, including an Azure Automation account. These solutions must be removed before you can unlink your Automation account. Data collection for the solutions will stop and their tables will be removed from the UI, but data will remain in the workspace per the table retention period. When you add solutions after the move, ingestion is restored and tables become visible with data. Linked services include: - Update management - Change tracking |
azure-netapp-files | Backup Introduction | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-netapp-files/backup-introduction.md | Azure NetApp Files backup is supported for the following regions: * Brazil South * Canada Central * Canada East+* Central US * East Asia * East US * East US 2 Azure NetApp Files backup is supported for the following regions: * North Central US * North Europe * Norway East+* Norway West * Qatar Central * South Africa North * South Central US |
azure-resource-manager | User Defined Data Types | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-resource-manager/bicep/user-defined-data-types.md | param serviceConfig ServiceConfig = { type: 'bar', value: true } output config object = serviceConfig ``` -The parameter value is validated based on the discriminated property value. In the preceeding example, if the *serviceConfig* parameter value is of type *foo*, it undersoes validation using the *FooConfig*type. Likewise, if the parameter value is of type *bar*, validation is performed usin the *BarConfig* type, and this pattern continues for other types as well. +The parameter value is validated based on the discriminated property value. In the preceeding example, if the *serviceConfig* parameter value is of type *foo*, it undergoes validation using the *FooConfig*type. Likewise, if the parameter value is of type *bar*, validation is performed using the *BarConfig* type, and this pattern continues for other types as well. ## Import types between Bicep files (Preview) |
azure-vmware | Migrate Sql Server Always On Availability Group | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-vmware/migrate-sql-server-always-on-availability-group.md | For any of the following scenarios, ExpressRoute connectivity is recommended for - Production environments - Workloads with large database sizes-- Any case where there is a need to minimize downtime for migration the ExpressRoute connectivity is recommended for the migration.+- Any case where there is a need to minimize downtime Further downtime considerations are discussed in the next section. The following table indicates the estimated downtime for migration of each SQL S | **Scenario** | **Downtime expected** | **Notes** | |:|:--|:--|-| **Standalone instance** | Low | Migration is done using VMware vMotion, the database is available during migration time, but it isn't recommended to commit any critical data during it. | -| **Always On SQL Server Availability Group** | Low | The primary replica will always be available during the migration of the first secondary replica and the secondary replica will become the primary after the initial failover to Azure. | -| **Always On SQL Server Failover Cluster Instance** | High | All nodes of the cluster are shutdown and migrated using VMware HCX Cold Migration. Downtime duration depends upon database size and private network speed to Azure cloud. | +| **SQL Server standalone instance** | Low | Migration is done using VMware vMotion, the database is available during migration time, but it isn't recommended to commit any critical data during it. | +| **SQL Server Always On Availability Group** | Low | The primary replica will always be available during the migration of the first secondary replica and the secondary replica will become the primary after the initial failover to Azure. | +| **SQL Server Always On Failover Customer Instance** | High | All nodes of the cluster are shutdown and migrated using VMware HCX Cold Migration. Downtime duration depends upon database size and private network speed to Azure cloud. | ## Windows Server Failover Cluster quorum considerations |
azure-vmware | Migrate Sql Server Failover Cluster | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-vmware/migrate-sql-server-failover-cluster.md | For any of the following scenarios, ExpressRoute connectivity is recommended for - Production environments - Workloads with large database sizes-- Any case where there is a need to minimize downtime for migration the ExpressRoute connectivity is recommended for the migration.+- Any case where there is a need to minimize downtime Further downtime considerations are discussed in the next section. The following table indicates the estimated downtime for migration of each SQL S | **Scenario** | **Downtime expected** | **Notes** | |:|:--|:--|-| **Standalone instance** | Low | Migration is done using VMware vMotion, the database is available during migration time, but it isn't recommended to commit any critical data during it. | -| **Always On SQL Server Availability Group** | Low | The primary replica will always be available during the migration of the first secondary replica and the secondary replica will become the primary after the initial failover to Azure. | -| **Always On SQL Server Failover Cluster Instance** | High | All nodes of the cluster are shutdown and migrated using VMware HCX Cold Migration. Downtime duration depends upon database size and private network speed to Azure cloud. | +| **SQL Server standalone instance** | Low | Migration is done using VMware vMotion, the database is available during migration time, but it isn't recommended to commit any critical data during it. | +| **SQL Server Always On Availability Group** | Low | The primary replica will always be available during the migration of the first secondary replica and the secondary replica will become the primary after the initial failover to Azure. | +| **SQL Server Always On Failover Customer Instance** | High | All nodes of the cluster are shutdown and migrated using VMware HCX Cold Migration. Downtime duration depends upon database size and private network speed to Azure cloud. | ## Windows Server Failover Cluster quorum considerations |
azure-vmware | Migrate Sql Server Standalone Cluster | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/azure-vmware/migrate-sql-server-standalone-cluster.md | For any of the following scenarios, ExpressRoute connectivity is recommended for - Production environments - Workloads with large database sizes-- Any case where there is a need to minimize downtime for migration the ExpressRoute connectivity is recommended for the migration.+- Any case where there is a need to minimize downtime Further downtime considerations are discussed in the next section. The following table indicates the estimated downtime for migration of each SQL S | **Scenario** | **Downtime expected** | **Notes** | |:|:--|:--|-| **Standalone instance** | Low | Migration is done using VMware vMotion, the database is available during migration time, but it isn't recommended to commit any critical data during it. | -| **Always On SQL Server Availability Group** | Low | The primary replica will always be available during the migration of the first secondary replica and the secondary replica will become the primary after the initial failover to Azure. | -| **Always On SQL Server Failover Cluster Instance** | High | All nodes of the cluster are shutdown and migrated using VMware HCX Cold Migration. Downtime duration depends upon database size and private network speed to Azure cloud. | +| **SQL Server standalone instance** | Low | Migration is done using VMware vMotion, the database is available during migration time, but it isn't recommended to commit any critical data during it. | +| **SQL Server Always On Availability Group** | Low | The primary replica will always be available during the migration of the first secondary replica and the secondary replica will become the primary after the initial failover to Azure. | +| **SQL Server Always On Failover Customer Instance** | High | All nodes of the cluster are shutdown and migrated using VMware HCX Cold Migration. Downtime duration depends upon database size and private network speed to Azure cloud. | ## Executing the migration |
batch | Simplified Node Communication Pool No Public Ip | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/batch/simplified-node-communication-pool-no-public-ip.md | In a pool without public IP addresses, your virtual machines won't be able to ac Another way to provide outbound connectivity is to use a user-defined route (UDR). This method lets you route traffic to a proxy machine that has public internet access, for example [Azure Firewall](../firewall/overview.md). > [!IMPORTANT]-> There is no extra network resource (load balancer, network security group) created for simplified node communication pools without public IP addresses. Since the compute nodes in the pool are not bound to any load balancer, Azure may provide [Default Outbound Access](../virtual-network/ip-services/default-outbound-access.md). However, Default Outbound Access is not suitable for production workloads, so it is strongly recommended to bring your own Internet outbound access. +> There is no extra network resource (load balancer, network security group) created for simplified node communication pools without public IP addresses. Since the compute nodes in the pool are not bound to any load balancer, Azure may provide [Default Outbound Access](../virtual-network/ip-services/default-outbound-access.md). However, Default Outbound Access is not suitable for production workloads, and will be retired on September 30, 2025 (see the [official announcement](https://azure.microsoft.com/updates/default-outbound-access-for-vms-in-azure-will-be-retired-transition-to-a-new-method-of-internet-access/)). So if your workloads do require internet outbound access, or your pool doesn't use private endpoint to access Batch node management endpoint, you must provide your own solution to enable internet outbound access. ## Troubleshooting |
communication-services | Service Limits | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/communication-services/concepts/service-limits.md | Title: Service limits for Azure Communication Services description: Learn how to -+ This sandbox setup is designed to help developers begin building the application |Get chat message|per Chat thread|250|-| |List chat messages|per User per chat thread|50|200| |List chat messages|per Chat thread|250|400|-|Get read receipts|per User per chat thread|5|-| -|Get read receipts|per Chat thread|100|-| +|Get read receipts (20 participant limit**) |per User per chat thread|5|-| +|Get read receipts (20 participant limit**) |per Chat thread|100|-| |List chat thread participants|per User per chat thread|10|-| |List chat thread participants|per Chat thread|250|-| |Send message / update message / delete message|per Chat thread|10|30| This sandbox setup is designed to help developers begin building the application |Send typing indicator|per Chat thread|10|30| > [!NOTE] -> Read receipts and typing indicators are not supported on chat threads with more than 20 participants. +> ** Read receipts and typing indicators are not supported on chat threads with more than 20 participants. ### Chat storage Azure Communication Services stores chat messages indefinitely till they are deleted by the customer. |
communication-services | Sms Faq | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/communication-services/concepts/sms/sms-faq.md | Once you have submitted the short code program brief application in the Azure po - Lower case letters: a - z - Numbers: 0-9 - Spaces- - Special characters: *+*, *-*, _ , & ### Is a number purchase required to use alphanumeric sender ID? The use of alphanumeric sender ID does not require purchase of any phone number. Alphanumeric sender ID can be enabled through the Azure portal. See [enable alphanumeric sender ID quickstart](../../quickstarts/sms/enable-alphanumeric-sender-id.md) for instructions. |
container-apps | Service Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/container-apps/service-connector.md | -Azure Container Apps allows you to use Service Connector to connect to cloud services in just a few steps. Service Connector manages the configuration of the network settings and connection information between different services. To view all supported services, [learn more about Service Connector](../service-connector/overview.md#what-services-are-supported-in-service-connector). +Azure Container Apps allows you to use Service Connector to connect to cloud services in just a few steps. Service Connector manages the configuration of the network settings and connection information between different services. To view all supported services, [learn more about Service Connector](../service-connector/overview.md#what-services-are-supported-by-service-connector). In this article, you learn to connect a container app to Azure Blob Storage. |
cosmos-db | Configure Synapse Link | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/cosmos-db/configure-synapse-link.md | The first step to use Synapse Link is to enable it for your Azure Cosmos DB data 1. [Create a new Azure account](create-sql-api-dotnet.md#create-account), or select an existing Azure Cosmos DB account. -1. Navigate to your Azure Cosmos DB account and open the **Azure Synapse Link** under Intergrations in the left pane. +1. Navigate to your Azure Cosmos DB account and open the **Azure Synapse Link** under Integrations in the left pane. 1. Select **Enable**. This process can take 1 to 5 minutes to complete. Please note the following details when enabling Azure Synapse Link on your exist * You won't be able to query analytical store of an existing container while Synapse Link is being enabled on that container. Your OLTP workload isn't impacted and you can keep on reading data normally. Data ingested after the start of the initial sync will be merged into analytical store by the regular analytical store auto-sync process. > [!NOTE]-> Currently you can't enable Synapse Link on your existing MongoDB API containers. Synapse Link can be enabled on newly created Mongo DB containers. +> Now you can enable Synapse Link on your existing MongoDB API collections, using Azure CLI or PowerShell. ### Azure portal Please note the following details when enabling Azure Synapse Link on your exist The following options enable Synapse Link in a container by using Azure CLI by setting the `--analytical-storage-ttl` property. -* [Create an Azure Cosmos DB MongoDB collection](/cli/azure/cosmosdb/mongodb/collection#az-cosmosdb-mongodb-collection-create-examples) +* [Create or update an Azure Cosmos DB MongoDB collection](/cli/azure/cosmosdb/mongodb/collection#az-cosmosdb-mongodb-collection-create-examples) * [Create or update an Azure Cosmos DB SQL API container](/cli/azure/cosmosdb/sql/container#az-cosmosdb-sql-container-create) ##### Use Azure CLI to enable Synapse Link for Azure Synapse Link for Gremlin API Graphs For existing graphs, replace `create` with `update`. The following options enable Synapse Link in a container by using Azure CLI by setting the `-AnalyticalStorageTtl` property. -* [Create an Azure Cosmos DB MongoDB collection](/powershell/module/az.cosmosdb/new-azcosmosdbmongodbcollection#description) +* [Create or update an Azure Cosmos DB MongoDB collection](/powershell/module/az.cosmosdb/new-azcosmosdbmongodbcollection#description) * [Create or update an Azure Cosmos DB SQL API container](/powershell/module/az.cosmosdb/new-azcosmosdbsqlcontainer) |
cosmos-db | Continuous Backup Restore Introduction | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/cosmos-db/continuous-backup-restore-introduction.md | Currently, you can restore an Azure Cosmos DB account (API for NoSQL or MongoDB, By default, Azure Cosmos DB stores continuous mode backup data in locally redundant storage blobs. For the regions that have zone redundancy configured, the backup is stored in zone-redundant storage blobs. In continuous backup mode, you can't update the backup storage redundancy. ## Different ways to restore-Continuous backup mode supports two ways to restore deleted containers and databases. They can be restored into a [new account](restore-account-continuous-backup.md) as documented here or can be restored into an existing account as described [here](restore-account-continuous-backup.md). The choice between these two depends on the scenarios and impact. In most cases it is preferred to restore deleted containers and databases into an existing account to prevent the cost of data transfer which is required in the case they are restored to a new account. For scenarios where you have modified the data accidentally restore into new account could be the prefered option. +Continuous backup mode supports two ways to restore deleted containers and databases. They can be restored into a [new account](restore-account-continuous-backup.md) as documented here or can be restored into an existing account as described [here](restore-account-continuous-backup.md). The choice between these two depends on the scenarios and impact. In most cases it is preferred to restore deleted containers and databases into an existing account to prevent the cost of data transfer which is required in the case they are restored to a new account. For scenarios where you have modified the data accidentally restore into new account could be the preferred option. ## What is restored into a new account? Currently the point in time restore functionality has the following limitations: * Multi-regions write accounts aren't supported. -* Currently Azure Synapse Link isn't fully compatible with continuous backup mode. For more information about backup with analytical store, see [analytical store backup](analytical-store-introduction.md#backup). +* Currently Azure Synapse Link can be enabled, in preview, in continuous backup database accounts. The opposite situation isn't supported yet, it is not possible to turn on continuous backup in Synapse Link enabled database accounts. And analytical store isn't included in backups. For more information about backup and analytical store, see [analytical store backup](analytical-store-introduction.md#backup). * The restored account is created in the same region where your source account exists. You can't restore an account into a region where the source account didn't exist. |
cosmos-db | Synapse Link | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/cosmos-db/synapse-link.md | Azure Synapse Link isn't recommended if you're looking for traditional data ware * Accessing the Azure Cosmos DB analytics store with Azure Synapse Dedicated SQL Pool currently isn't supported. -* Enabling Azure Synapse Link on existing Azure Cosmos DB containers is only supported for API for NoSQL accounts. Azure Synapse Link can be enabled on new containers for both API for NoSQL and MongoDB accounts. - * Although analytical store data isn't backed up, and therefore can't be restored, you can rebuild your analytical store by reenabling Azure Synapse Link in the restored container. Check the [analytical store documentation](analytical-store-introduction.md) for more information. -* Currently Azure Synapse Link isn't fully compatible with continuous backup mode. Check the [analytical store documentation](analytical-store-introduction.md) for more information. +* The capability to turn on Synapse Link in database accounts with continuous backup enabled is in preview now. The opposite situation, to turn on continuous backup in Synapse Link enabled database accounts, is still not supported yet. * Granular role-based access control isn't supported when querying from Synapse. Users that have access to your Synapse workspace and have access to the Azure Cosmos DB account can access all containers within that account. We currently don't support more granular access to the containers. |
cosmos-db | Vector Search | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/cosmos-db/vector-search.md | Implement RAG-patterns with Azure Cosmos DB for NoSQL and Azure Cognitive Search ### Code samples -- [.NET retail chatbot reference solution](https://github.com/Azure/Vector-Search-AI-Assistant/tree/cognitive-search-vector)-- [.NET samples - Hackathon project](https://github.com/AzureCosmosDB/OpenAIHackathon)+- [.NET RAG Pattern retail reference solution](https://github.com/Azure/Vector-Search-AI-Assistant-MongoDBvCore) +- [.NET samples - Hackathon project](https://github.com/Azure/Build-Modern-AI-Apps-Hackathon) - [.NET tutorial - recipe chatbot](https://github.com/microsoft/AzureDataRetrievalAugmentedGenerationSamples/tree/main/C%23/CosmosDB-NoSQL_CognitiveSearch) - [.NET tutorial - recipe chatbot w/ Semantic Kernel](https://github.com/microsoft/AzureDataRetrievalAugmentedGenerationSamples/tree/main/C%23/CosmosDB-NoSQL_CognitiveSearch_SemanticKernel) - [Python notebook tutorial - Azure product chatbot](https://github.com/microsoft/AzureDataRetrievalAugmentedGenerationSamples/tree/main/Python/CosmosDB-NoSQL_CognitiveSearch) RAG can be applied using the native vector search feature in Azure Cosmos DB for ### Code samples -- [.NET retail chatbot sample](https://github.com/Azure/Vector-Search-AI-Assistant/tree/mongovcorev2)+- [.NET RAG Pattern retail reference solution](https://github.com/Azure/Vector-Search-AI-Assistant-MongoDBvCore) - [.NET tutorial - recipe chatbot](https://github.com/microsoft/AzureDataRetrievalAugmentedGenerationSamples/tree/main/C%23/CosmosDB-MongoDBvCore) - [Python notebook tutorial - Azure product chatbot](https://github.com/microsoft/AzureDataRetrievalAugmentedGenerationSamples/tree/main/Python/CosmosDB-MongoDB-vCore) You can employ RAG by utilizing native vector search within Azure Cosmos DB for + |
cost-management-billing | Pay By Invoice | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/cost-management-billing/manage/pay-by-invoice.md | If you signed up for Azure through a Microsoft representative, then your default When you switch to pay by wire transfer, you must pay your bill within 30 days of the invoice date by wire transfer. -Users with a Microsoft Customer Agreement must always submit a request [Submit a request to set up pay by wire transfer](#submit-a-request-to-set-up-pay-by-wire-transfer) to Azure support to enable pay by wire transfer. +Users with a Microsoft Customer Agreement must always [submit a request to set up pay by wire transfer](#submit-a-request-to-set-up-pay-by-wire-transfer) to Azure support to enable pay by wire transfer. -Customers who have a Microsoft Online Services Program (pay-as-you-go) account can use the Azure portal to [Request to pay by wire transfer](#request-to-pay-by-wire-transfer). +Customers who have a Microsoft Online Services Program (pay-as-you-go) account can use the Azure portal to [request to pay by wire transfer](#request-to-pay-by-wire-transfer). > [!IMPORTANT] > * Pay by wire transfer is only available for customers using Azure on behalf of a company. Customers who have a Microsoft Online Services Program (pay-as-you-go) account c ## Request to pay by wire transfer > [!NOTE]-> Currently only customers in the United States can get automatically approved to change their payment method to wire transfer. Support for other regions is being evaluated. If you are not in the United States, you must [Submit a request to set up pay by wire transfer](#submit-a-request-to-set-up-pay-by-wire-transfer) to change your payment method. +> Currently only customers in the United States can get automatically approved to change their payment method to wire transfer. Support for other regions is being evaluated. If you are not in the United States, you must [submit a request to set up pay by wire transfer](#submit-a-request-to-set-up-pay-by-wire-transfer) to change your payment method. 1. Sign in to the Azure portal. 1. Navigate to **Subscriptions** and then select the one that you want to set up wire transfer for. Customers who have a Microsoft Online Services Program (pay-as-you-go) account c 1. On the **Pay by wire transfer** page, you see a message stating that you can request to use wire transfer instead of automatic payment using a credit or debit card. Select **Continue** to start the check. 1. Depending on your approval status: - If you're automatically approved, the page shows a message stating that you've been approved to pay by wire transfer. Enter your **Company name** and then select **Save**. - - If the request couldn't be processed or if you're not approved, you need to follow the steps in the next [Submit a request to set up pay by wire transfer](#submit-a-request-to-set-up-pay-by-wire-transfer) section. + - If the request couldn't be processed or if you're not approved, you need to follow the steps in the next section [Submit a request to set up pay by wire transfer](#submit-a-request-to-set-up-pay-by-wire-transfer). 1. If you've been approved, on the Payment methods page under **Other payment methods**, to the right of **Wire transfer**, select the ellipsis (**...**) symbol and then select **Make default**. You're all set to pay by wire transfer. On the Payment methods page, select **Pay by wire transfer**. ### Switch billing profile to wire transfer -Using the following steps to switch a billing profile to wire transfer. Only the person who signed up for Azure can change the default payment method of a billing profile. +Use the following steps to switch a billing profile to wire transfer. Only the person who signed up for Azure can change the default payment method of a billing profile. 1. Go to the Azure portal view your billing information. Search for and select **Cost Management + Billing**. 1. In the menu, choose **Billing profiles**. |
data-factory | Compare Versions | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/compare-versions.md | - Title: Compare Azure Data Factory with Data Factory version 1 -description: This article compares Azure Data Factory with Azure Data Factory version 1. ----- Previously updated : 04/12/2023--# Compare Azure Data Factory with Data Factory version 1 ---This article compares Data Factory with Data Factory version 1. For an introduction to Data Factory, see [Introduction to Data Factory](introduction.md).For an introduction to Data Factory version 1, see [Introduction to Azure Data Factory](v1/data-factory-introduction.md). --## Feature comparison -The following table compares the features of Data Factory with the features of Data Factory version 1. --| Feature | Version 1 | Current version | -| - | | | -| Datasets | A named view of data that references the data that you want to use in your activities as inputs and outputs. Datasets identify data within different data stores, such as tables, files, folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder in Azure Blob storage from which the activity should read the data.<br/><br/>**Availability** defines the processing window slicing model for the dataset (for example, hourly, daily, and so on). | Datasets are the same in the current version. However, you do not need to define **availability** schedules for datasets. You can define a trigger resource that can schedule pipelines from a clock scheduler paradigm. For more information, see [Triggers](concepts-pipeline-execution-triggers.md#trigger-execution-with-json) and [Datasets](concepts-datasets-linked-services.md). | -| Linked services | Linked services are much like connection strings, which define the connection information that's necessary for Data Factory to connect to external resources. | Linked services are the same as in Data Factory V1, but with a new **connectVia** property to utilize the Integration Runtime compute environment of the current version of Data Factory. For more information, see [Integration runtime in Azure Data Factory](concepts-integration-runtime.md) and [Linked service properties for Azure Blob storage](connector-azure-blob-storage.md#linked-service-properties). | -| Pipelines | A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform a task. You use startTime, endTime, and isPaused to schedule and run pipelines. | Pipelines are groups of activities that are performed on data. However, the scheduling of activities in the pipeline has been separated into new trigger resources. You can think of pipelines in the current version of Data Factory more as "workflow units" that you schedule separately via triggers. <br/><br/>Pipelines do not have "windows" of time execution in the current version of Data Factory. The Data Factory V1 concepts of startTime, endTime, and isPaused are no longer present in the current version of Data Factory. For more information, see [Pipeline execution and triggers](concepts-pipeline-execution-triggers.md) and [Pipelines and activities](concepts-pipelines-activities.md). | -| Activities | Activities define actions to perform on your data within a pipeline. Data movement (copy activity) and data transformation activities (such as Hive, Pig, and MapReduce) are supported. | In the current version of Data Factory, activities still are defined actions within a pipeline. The current version of Data Factory introduces new [control flow activities](concepts-pipelines-activities.md#control-flow-activities). You use these activities in a control flow (looping and branching). Data movement and data transformation activities that were supported in V1 are supported in the current version. You can define transformation activities without using datasets in the current version. | -| Hybrid data movement and activity dispatch | Now called Integration Runtime, [Data Management Gateway](v1/data-factory-data-management-gateway.md) supported moving data between on-premises and cloud.| Data Management Gateway is now called Self-Hosted Integration Runtime. It provides the same capability as it did in V1. <br/><br/> The Azure-SSIS Integration Runtime in the current version of Data Factory also supports deploying and running SQL Server Integration Services (SSIS) packages in the cloud. For more information, see [Integration runtime in Azure Data Factory](concepts-integration-runtime.md).| -| Parameters | NA | Parameters are key-value pairs of read-only configuration settings that are defined in pipelines. You can pass arguments for the parameters when you are manually running the pipeline. If you are using a scheduler trigger, the trigger can pass values for the parameters too. Activities within the pipeline consume the parameter values. | -| Expressions | Data Factory V1 allows you to use functions and system variables in data selection queries and activity/dataset properties. | In the current version of Data Factory, you can use expressions anywhere in a JSON string value. For more information, see [Expressions and functions in the current version of Data Factory](control-flow-expression-language-functions.md).| -| Pipeline runs | NA | A single instance of a pipeline execution. For example, say you have a pipeline that executes at 8 AM, 9 AM, and 10 AM. There would be three separate runs of the pipeline (pipeline runs) in this case. Each pipeline run has a unique pipeline run ID. The pipeline run ID is a GUID that uniquely defines that particular pipeline run. Pipeline runs are typically instantiated by passing arguments to parameters that are defined in the pipelines. | -| Activity runs | NA | An instance of an activity execution within a pipeline. | -| Trigger runs | NA | An instance of a trigger execution. For more information, see [Triggers](concepts-pipeline-execution-triggers.md). | -| Scheduling | Scheduling is based on pipeline start/end times and dataset availability. | Scheduler trigger or execution via external scheduler. For more information, see [Pipeline execution and triggers](concepts-pipeline-execution-triggers.md). | --The following sections provide more information about the capabilities of the current version. --## Control flow -To support diverse integration flows and patterns in the modern data warehouse, the current version of Data Factory has enabled a new flexible data pipeline model that is no longer tied to time-series data. A few common flows that were previously not possible are now enabled. They are described in the following sections. --### Chaining activities -In V1, you had to configure the output of an activity as an input of another activity to chain them. in the current version, you can chain activities in a sequence within a pipeline. You can use the **dependsOn** property in an activity definition to chain it with an upstream activity. For more information and an example, see [Pipelines and activities](concepts-pipelines-activities.md#multiple-activities-in-a-pipeline) and [Branching and chaining activities](tutorial-control-flow.md). --### Branching activities -in the current version, you can branch activities within a pipeline. The [If-condition activity](control-flow-if-condition-activity.md) provides the same functionality that an `if` statement provides in programming languages. It evaluates a set of activities when the condition evaluates to `true` and another set of activities when the condition evaluates to `false`. For examples of branching activities, see the [Branching and chaining activities](tutorial-control-flow.md) tutorial. --### Parameters -You can define parameters at the pipeline level and pass arguments while you're invoking the pipeline on-demand or from a trigger. Activities can consume the arguments that are passed to the pipeline. For more information, see [Pipelines and triggers](concepts-pipeline-execution-triggers.md). --### Custom state passing -Activity outputs including state can be consumed by a subsequent activity in the pipeline. For example, in the JSON definition of an activity, you can access the output of the previous activity by using the following syntax: `@activity('NameofPreviousActivity').output.value`. By using this feature, you can build workflows where values can pass through activities. --### Looping containers -The [ForEach activity](control-flow-for-each-activity.md) defines a repeating control flow in your pipeline. This activity iterates over a collection and runs specified activities in a loop. The loop implementation of this activity is similar to the Foreach looping structure in programming languages. --The [Until](control-flow-until-activity.md) activity provides the same functionality that a do-until looping structure provides in programming languages. It runs a set of activities in a loop until the condition that's associated with the activity evaluates to `true`. You can specify a timeout value for the until activity in Data Factory. --### Trigger-based flows -Pipelines can be triggered by on-demand (event-based, i.e. blob post) or wall-clock time. The [pipelines and triggers](concepts-pipeline-execution-triggers.md) article has detailed information about triggers. --### Invoking a pipeline from another pipeline -The [Execute Pipeline activity](control-flow-execute-pipeline-activity.md) allows a Data Factory pipeline to invoke another pipeline. --### Delta flows -A key use case in ETL patterns is "delta loads," in which only data that has changed since the last iteration of a pipeline is loaded. New capabilities in the current version, such as [lookup activity](control-flow-lookup-activity.md), flexible scheduling, and control flow, enable this use case in a natural way. For a tutorial with step-by-step instructions, see [Tutorial: Incremental copy](tutorial-incremental-copy-powershell.md). --### Other control flow activities -Following are a few more control flow activities that are supported by the current version of Data Factory. --Control activity | Description -- | ---[ForEach activity](control-flow-for-each-activity.md) | Defines a repeating control flow in your pipeline. This activity is used to iterate over a collection and runs specified activities in a loop. The loop implementation of this activity is similar to Foreach looping structure in programming languages. -[Web activity](control-flow-web-activity.md) | Calls a custom REST endpoint from a Data Factory pipeline. You can pass datasets and linked services to be consumed and accessed by the activity. -[Lookup activity](control-flow-lookup-activity.md) | Reads or looks up a record or table name value from any external source. This output can further be referenced by succeeding activities. -[Get metadata activity](control-flow-get-metadata-activity.md) | Retrieves the metadata of any data in Azure Data Factory. -[Wait activity](control-flow-wait-activity.md) | Pauses the pipeline for a specified period of time. --## Deploy SSIS packages to Azure -You use Azure-SSIS if you want to move your SSIS workloads to the cloud, create a data factory by using the current version, and provision an Azure-SSIS Integration Runtime. --The Azure-SSIS Integration Runtime is a fully managed cluster of Azure VMs (nodes) that are dedicated to running your SSIS packages in the cloud. After you provision Azure-SSIS Integration Runtime, you can use the same tools that you have been using to deploy SSIS packages to an on-premises SSIS environment. --For example, you can use SQL Server Data Tools or SQL Server Management Studio to deploy SSIS packages to this runtime on Azure. For step-by-step instructions, see the tutorial [Deploy SQL Server integration services packages to Azure](./tutorial-deploy-ssis-packages-azure.md). --## Flexible scheduling -In the current version of Data Factory, you do not need to define dataset availability schedules. You can define a trigger resource that can schedule pipelines from a clock scheduler paradigm. You can also pass parameters to pipelines from a trigger for a flexible scheduling and execution model. --Pipelines do not have "windows" of time execution in the current version of Data Factory. The Data Factory V1 concepts of startTime, endTime, and isPaused don't exist in the current version of Data Factory. For more information about how to build and then schedule a pipeline in the current version of Data Factory, see [Pipeline execution and triggers](concepts-pipeline-execution-triggers.md). --## Support for more data stores -The current version supports the copying of data to and from more data stores than V1. For a list of supported data stores, see the following articles: --- [Version 1 - supported data stores](v1/data-factory-data-movement-activities.md#supported-data-stores-and-formats)-- [Current version - supported data stores](copy-activity-overview.md#supported-data-stores-and-formats)--## Support for on-demand Spark cluster -The current version supports the creation of an on-demand Azure HDInsight Spark cluster. To create an on-demand Spark cluster, specify the cluster type as Spark in your on-demand, HDInsight linked service definition. Then you can configure the Spark activity in your pipeline to use this linked service. --At runtime, when the activity is executed, the Data Factory service automatically creates the Spark cluster for you. For more information, see the following articles: --- [Spark Activity in the current version of Data Factory](transform-data-using-spark.md)-- [Azure HDInsight on-demand linked service](compute-linked-services.md#azure-hdinsight-on-demand-linked-service)--## Custom activities -In V1, you implement (custom) DotNet activity code by creating a .NET class library project with a class that implements the Execute method of the IDotNetActivity interface. Therefore, you need to write your custom code in .NET Framework 4.5.2 and run it on Windows-based Azure Batch Pool nodes. --In a custom activity in the current version, you don't have to implement a .NET interface. You can directly run commands, scripts, and your own custom code compiled as an executable. --For more information, see [Difference between custom activity in Data Factory and version 1](transform-data-using-dotnet-custom-activity.md#compare-v2-v1). --## SDKs - the current version of Data Factory provides a richer set of SDKs that can be used to author, manage, and monitor pipelines. --- **.NET SDK**: The .NET SDK is updated in the current version.--- **PowerShell**: The PowerShell cmdlets are updated in the current version. The cmdlets for the current version have **DataFactoryV2** in the name, for example: Get-AzDataFactoryV2. --- **Python SDK**: This SDK is new in the current version.--- **REST API**: The REST API is updated in the current version. --The SDKs that are updated in the current version are not backward-compatible with V1 clients. --## Authoring experience --| | Version 2 | Version 1 | -| | -- | -- | -| **Azure portal** | [Yes](quickstart-create-data-factory-portal.md) | No | -| **Azure PowerShell** | [Yes](quickstart-create-data-factory-powershell.md) | [Yes](./v1/data-factory-build-your-first-pipeline-using-powershell.md) | -| **.NET SDK** | [Yes](quickstart-create-data-factory-dot-net.md) | [Yes](./v1/data-factory-build-your-first-pipeline-using-vs.md) | -| **REST API** | [Yes](quickstart-create-data-factory-rest-api.md) | [Yes](./v1/data-factory-build-your-first-pipeline-using-rest-api.md) | -| **Python SDK** | [Yes](quickstart-create-data-factory-python.md) | No | -| **Resource Manager template** | [Yes](quickstart-create-data-factory-resource-manager-template.md) | [Yes](./v1/data-factory-build-your-first-pipeline-using-arm.md) | --## Roles and permissions --The Data Factory version 1 Contributor role can be used to create and manage the current version of Data Factory resources. For more info, see [Data Factory Contributor](../role-based-access-control/built-in-roles.md#data-factory-contributor). --## Monitoring experience -in the current version, you can also monitor data factories by using [Azure Monitor](monitor-using-azure-monitor.md). The new PowerShell cmdlets support monitoring of [integration runtimes](monitor-integration-runtime.md). Both V1 and V2 support visual monitoring via a monitoring application that can be launched from the Azure portal. ---## Next steps -Learn how to create a data factory by following step-by-step instructions in the following quickstarts: [PowerShell](quickstart-create-data-factory-powershell.md), [.NET](quickstart-create-data-factory-dot-net.md), [Python](quickstart-create-data-factory-python.md), [REST API](quickstart-create-data-factory-rest-api.md). |
data-factory | Concepts Datasets Linked Services | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/concepts-datasets-linked-services.md | Last updated 02/08/2023 # Datasets in Azure Data Factory and Azure Synapse Analytics-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-create-datasets.md) -> * [Current version](concepts-datasets-linked-services.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Concepts Linked Services | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/concepts-linked-services.md | Last updated 10/25/2022 # Linked services in Azure Data Factory and Azure Synapse Analytics -> [!div class="op_single_selector" title1="Select the version of Data Factory service you're using:"] -> * [Version 1](v1/data-factory-create-datasets.md) -> * [Current version](concepts-linked-services.md) - [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article describes what linked services are, how they're defined in JSON format, and how they're used in Azure Data Factory and Azure Synapse Analytics. |
data-factory | Concepts Pipeline Execution Triggers | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/concepts-pipeline-execution-triggers.md | -> [!div class="op_single_selector" title1="Select the version of the Data Factory service that you're using:"] -> * [Version 1](v1/data-factory-scheduling-and-execution.md) -> * [Current version](concepts-pipeline-execution-triggers.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] A _pipeline run_ in Azure Data Factory and Azure Synapse defines an instance of a pipeline execution. For example, say you have a pipeline that executes at 8:00 AM, 9:00 AM, and 10:00 AM. In this case, there are three separate runs of the pipeline or pipeline runs. Each pipeline run has a unique pipeline run ID. A run ID is a GUID that uniquely defines that particular pipeline run. |
data-factory | Concepts Pipelines Activities | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/concepts-pipelines-activities.md | Last updated 10/24/2022 # Pipelines and activities in Azure Data Factory and Azure Synapse Analytics -> [!div class="op_single_selector" title1="Select the version of Data Factory service you're using:"] -> * [Version 1](v1/data-factory-create-pipelines.md) -> * [Current version](concepts-pipelines-activities.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] [!INCLUDE[ML Studio (classic) retirement](../../includes/machine-learning-studio-classic-deprecation.md)] |
data-factory | Connector Amazon Redshift | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-amazon-redshift.md | Last updated 07/13/2023 # Copy data from Amazon Redshift using Azure Data Factory or Synapse Analytics-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-amazon-redshift-connector.md) -> * [Current version](connector-amazon-redshift.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Connector Amazon Simple Storage Service | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-amazon-simple-storage-service.md | Last updated 06/05/2023 # Copy and transform data in Amazon Simple Storage Service using Azure Data Factory or Azure Synapse Analytics-> [!div class="op_single_selector" title1="Select the version of Data Factory service you're using:"] -> -> * [Version 1](v1/data-factory-amazon-simple-storage-service-connector.md) -> * [Current version](connector-amazon-simple-storage-service.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Connector Azure Blob Storage | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-azure-blob-storage.md | Last updated 09/29/2023 # Copy and transform data in Azure Blob Storage by using Azure Data Factory or Azure Synapse Analytics -> [!div class="op_single_selector" title1="Select the version of Data Factory service you're using:"] -> - [Version 1](v1/data-factory-azure-blob-connector.md) -> - [Current version](connector-azure-blob-storage.md) - [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use the Copy activity in Azure Data Factory and Azure Synapse pipelines to copy data from and to Azure Blob Storage. It also describes how to use the Data Flow activity to transform data in Azure Blob Storage. To learn more, read the [Azure Data Factory](introduction.md) and the [Azure Synapse Analytics](..\synapse-analytics\overview-what-is.md) introduction articles. |
data-factory | Connector Azure Cosmos Analytical Store | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-azure-cosmos-analytical-store.md | Last updated 03/31/2023 # Copy and transform data in Azure Cosmos DB analytical store by using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Current version](connector-azure-cosmos-analytical-store.md) - [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use Data Flow to transform data in Azure Cosmos DB analytical store. To learn more, read the introductory articles for [Azure Data Factory](introduction.md) and [Azure Synapse Analytics](../synapse-analytics/overview-what-is.md). When transforming data in mapping data flow, you can read and write to collectio Settings specific to Azure Cosmos DB are available in the **Source Options** tab of the source transformation. -**Include system columns:** If true, ```id```, ```_ts```, and other system columns will be included in your data flow metadata from Azure Cosmos DB. When updating collections, it is important to include this so that you can grab the existing row ID. +**Include system columns:** If true, ```id```, ```_ts```, and other system columns are included in your data flow metadata from Azure Cosmos DB. When updating collections, it's important to include this so that you can grab the existing row ID. **Page size:** The number of documents per page of the query result. Default is "-1" which uses the service dynamic page up to 1000. Settings specific to Azure Cosmos DB are available in the **Source Options** tab **Preferred regions:** Choose the preferred read regions for this process. -**Change feed:** If true, you will get data from [Azure Cosmos DB change feed](../cosmos-db/change-feed.md) which is a persistent record of changes to a container in the order they occur from last run automatically. When you set it true, do not set both **Infer drifted column types** and **Allow schema drift** as true at the same time. For more details, see [Azure Cosmos DB change feed](#azure-cosmos-db-change-feed). +**Change feed:** If true, you'll get data from [Azure Cosmos DB change feed](../cosmos-db/change-feed.md), which is a persistent record of changes to a container in the order they occur from last run automatically. When you set it true, don't set both **Infer drifted column types** and **Allow schema drift** as true at the same time. For more information, see [Azure Cosmos DB change feed](#azure-cosmos-db-change-feed). -**Start from beginning:** If true, you will get initial load of full snapshot data in the first run, followed by capturing changed data in next runs. If false, the initial load will be skipped in the first run, followed by capturing changed data in next runs. The setting is aligned with the same setting name in [Azure Cosmos DB reference](https://github.com/Azure/azure-cosmosdb-spark/wiki/Configuration-references#reading-cosmosdb-collection-change-feed). For more details, see [Azure Cosmos DB change feed](#azure-cosmos-db-change-feed). +**Start from beginning:** If true, you'll get initial load of full snapshot data in the first run, followed by capturing changed data in next runs. If false, the initial load will be skipped in the first run, followed by capturing changed data in next runs. The setting is aligned with the same setting name in [Azure Cosmos DB reference](https://github.com/Azure/azure-cosmosdb-spark/wiki/Configuration-references#reading-cosmosdb-collection-change-feed). For more information, see [Azure Cosmos DB change feed](#azure-cosmos-db-change-feed). ### Sink transformation Settings specific to Azure Cosmos DB are available in the **Settings** tab of th **Update method:** Determines what operations are allowed on your database destination. The default is to only allow inserts. To update, upsert, or delete rows, an alter-row transformation is required to tag rows for those actions. For updates, upserts and deletes, a key column or columns must be set to determine which row to alter. **Collection action:** Determines whether to recreate the destination collection prior to writing.-* None: No action will be done to the collection. -* Recreate: The collection will get dropped and recreated +* None: No action is done to the collection. +* Recreate: The collection gets dropped and recreated **Batch size**: An integer that represents how many objects are being written to Azure Cosmos DB collection in each batch. Usually, starting with the default batch size is sufficient. To further tune this value, note: -- Azure Cosmos DB limits single request's size to 2MB. The formula is "Request Size = Single Document Size * Batch Size". If you hit error saying "Request size is too large", reduce the batch size value.+- Azure Cosmos DB limits single request's size to 2 MB. The formula is "Request Size = Single Document Size * Batch Size". If you hit error saying "Request size is too large", reduce the batch size value. - The larger the batch size, the better throughput the service can achieve, while make sure you allocate enough RUs to empower your workload. **Partition key:** Enter a string that represents the partition key for your collection. Example: ```/movies/title``` Settings specific to Azure Cosmos DB are available in the **Settings** tab of th ## Azure Cosmos DB change feed -Azure Data Factory can get data from [Azure Cosmos DB change feed](../cosmos-db/change-feed.md) by enabling it in the mapping data flow source transformation. With this connector option, you can read change feeds and apply transformations before loading transformed data into destination datasets of your choice. You do not have to use Azure functions to read the change feed and then write custom transformations. You can use this option to move data from one container to another, prepare change feed driven material views for fit purpose or automate container backup or recovery based on change feed, and enable many more such use cases using visual drag and drop capability of Azure Data Factory. +Azure Data Factory can get data from [Azure Cosmos DB change feed](../cosmos-db/change-feed.md) by enabling it in the mapping data flow source transformation. With this connector option, you can read change feeds and apply transformations before loading transformed data into destination datasets of your choice. You don't have to use Azure functions to read the change feed and then write custom transformations. You can use this option to move data from one container to another, prepare change feed driven material views for fit purpose or automate container backup or recovery based on change feed, and enable many more such use cases using visual drag and drop capability of Azure Data Factory. Make sure you keep the pipeline and activity name unchanged, so that the checkpoint can be recorded by ADF for you to get changed data from the last run automatically. If you change your pipeline name or activity name, the checkpoint will be reset, which leads you to start from beginning or get changes from now in the next run. -When you debug the pipeline, this feature works the same. Be aware that the checkpoint will be reset when you refresh your browser during the debug run. After you are satisfied with the pipeline result from debug run, you can go ahead to publish and trigger the pipeline. At the moment when you first time trigger your published pipeline, it automatically restarts from the beginning or gets changes from now on. +When you debug the pipeline, this feature works the same. The checkpoint will be reset when you refresh your browser during the debug run. After you're satisfied with the pipeline result from debug run, you can go ahead to publish and trigger the pipeline. At the moment when you first time trigger your published pipeline, it automatically restarts from the beginning or gets changes from now on. -In the monitoring section, you always have the chance to rerun a pipeline. When you are doing so, the changed data is always captured from the previous checkpoint of your selected pipeline run. +In the monitoring section, you always have the chance to rerun a pipeline. When you're doing so, the changed data is always captured from the previous checkpoint of your selected pipeline run. In addition, Azure Cosmos DB analytical store now supports Change Data Capture (CDC) for Azure Cosmos DB API for NoSQL and Azure Cosmos DB API for Mongo DB (public preview). Azure Cosmos DB analytical store allows you to efficiently consume a continuous and incremental feed of changed (inserted, updated, and deleted) data from analytical store. |
data-factory | Connector Azure Cosmos Db | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-azure-cosmos-db.md | Last updated 03/02/2023 # Copy and transform data in Azure Cosmos DB for NoSQL by using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-azure-documentdb-connector.md) -> * [Current version](connector-azure-cosmos-db.md) - [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Cosmos DB for NoSQL, and use Data Flow to transform data in Azure Cosmos DB for NoSQL. To learn more, read the introductory articles for [Azure Data Factory](introduction.md) and [Azure Synapse Analytics](../synapse-analytics/overview-what-is.md). |
data-factory | Connector Azure Data Lake Store | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-azure-data-lake-store.md | Last updated 08/10/2023 # Copy data to or from Azure Data Lake Storage Gen1 using Azure Data Factory or Azure Synapse Analytics -> [!div class="op_single_selector" title1="Select the version of Azure Data Factory that you're using:"] -> -> * [Version 1](v1/data-factory-azure-datalake-connector.md) -> * [Current version](connector-azure-data-lake-store.md) - [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to copy data to and from Azure Data Lake Storage Gen1. To learn more, read the introductory article for [Azure Data Factory](introduction.md) or [Azure Synapse Analytics](../synapse-analytics/overview-what-is.md). |
data-factory | Connector Azure Search | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-azure-search.md | Last updated 07/13/2023 # Copy data to an Azure Cognitive Search index using Azure Data Factory or Synapse Analytics -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-azure-search-connector.md) -> * [Current version](connector-azure-search.md) - [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use the Copy Activity in an Azure Data Factory or Synapse Analytics pipeline to copy data into Azure Cognitive Search index. It builds on the [copy activity overview](copy-activity-overview.md) article that presents a general overview of copy activity. |
data-factory | Connector Azure Sql Data Warehouse | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-azure-sql-data-warehouse.md | Last updated 04/20/2023 # Copy and transform data in Azure Synapse Analytics by using Azure Data Factory or Synapse pipelines -> [!div class="op_single_selector" title1="Select the version of Data Factory service you're using:"] -> -> - [Version1](v1/data-factory-azure-sql-data-warehouse-connector.md) -> - [Current version](connector-azure-sql-data-warehouse.md) - [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use Copy Activity in Azure Data Factory or Synapse pipelines to copy data from and to Azure Synapse Analytics, and use Data Flow to transform data in Azure Data Lake Storage Gen2. To learn about Azure Data Factory, read the [introductory article](introduction.md). To use this feature, create an [Azure Blob Storage linked service](connector-azu ## Use PolyBase to load data into Azure Synapse Analytics -Using [PolyBase](/sql/relational-databases/polybase/polybase-guide) is an efficient way to load a large amount of data into Azure Synapse Analytics with high throughput. You'll see a large gain in the throughput by using PolyBase instead of the default BULKINSERT mechanism. For a walkthrough with a use case, see [Load 1 TB into Azure Synapse Analytics](v1/data-factory-load-sql-data-warehouse.md). +Using [PolyBase](/sql/relational-databases/polybase/polybase-guide) is an efficient way to load a large amount of data into Azure Synapse Analytics with high throughput. You'll see a large gain in the throughput by using PolyBase instead of the default BULKINSERT mechanism. - If your source data is in **Azure Blob, Azure Data Lake Storage Gen1 or Azure Data Lake Storage Gen2**, and the **format is PolyBase compatible**, you can use copy activity to directly invoke PolyBase to let Azure Synapse Analytics pull the data from source. For details, see **[Direct copy by using PolyBase](#direct-copy-by-using-polybase)**. - If your source data store and format isn't originally supported by PolyBase, use the **[Staged copy by using PolyBase](#staged-copy-by-using-polybase)** feature instead. The staged copy feature also provides you better throughput. It automatically converts the data into PolyBase-compatible format, stores the data in Azure Blob storage, then calls PolyBase to load data into Azure Synapse Analytics. |
data-factory | Connector Azure Sql Database | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-azure-sql-database.md | Last updated 04/06/2023 # Copy and transform data in Azure SQL Database by using Azure Data Factory or Azure Synapse Analytics -> [!div class="op_single_selector" title1="Select the version of Azure Data Factory that you're using:"] -> -> - [Version 1](v1/data-factory-azure-sql-connector.md) -> - [Current version](connector-azure-sql-database.md) - [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use Copy Activity in Azure Data Factory or Azure Synapse pipelines to copy data from and to Azure SQL Database, and use Data Flow to transform data in Azure SQL Database. To learn more, read the introductory article for [Azure Data Factory](introduction.md) or [Azure Synapse Analytics](../synapse-analytics/overview-what-is.md). |
data-factory | Connector Azure Table Storage | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-azure-table-storage.md | Last updated 07/13/2023 # Copy data to and from Azure Table storage using Azure Data Factory or Synapse Analytics -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-azure-table-connector.md) -> * [Current version](connector-azure-table-storage.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Connector Cassandra | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-cassandra.md | Last updated 01/25/2023 # Copy data from Cassandra using Azure Data Factory or Synapse Analytics-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-onprem-cassandra-connector.md) -> * [Current version](connector-cassandra.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Connector Db2 | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-db2.md | Last updated 07/13/2023 # Copy data from DB2 using Azure Data Factory or Synapse Analytics-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-onprem-db2-connector.md) -> * [Current version](connector-db2.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Connector File System | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-file-system.md | -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-onprem-file-system-connector.md) -> * [Current version](connector-file-system.md) + [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] -This article outlines how to copy data to and from file system. To learn more read the introductory article for [Azure Data Factory](introduction.md) or [Azure Synapse Analytics](../synapse-analytics/overview-what-is.md). +This article outlines how to copy data to and from file system. To learn more, read the introductory article for [Azure Data Factory](introduction.md) or [Azure Synapse Analytics](../synapse-analytics/overview-what-is.md). ## Supported capabilities Specifically, this file system connector supports: Use the following steps to create a file system linked service in the Azure portal UI. -1. Browse to the Manage tab in your Azure Data Factory or Synapse workspace and select Linked Services, then click New: +1. Browse to the Manage tab in your Azure Data Factory or Synapse workspace and select Linked Services, then select New: # [Azure Data Factory](#tab/data-factory) The following properties are supported for file system under `location` settings | Property | Description | Required | | - | | -- | | type | The type property under `location` in dataset must be set to **FileServerLocation**. | Yes |-| folderPath | The path to folder. If you want to use wildcard to filter folder, skip this setting and specify in activity source settings. Note that you will need to setup the file share location in your Windows or Linux environment to expose the folder for sharing. | No | +| folderPath | The path to folder. If you want to use wildcard to filter folder, skip this setting and specify in activity source settings. You need to set up the file share location in your Windows or Linux environment to expose the folder for sharing. | No | | fileName | The file name under the given folderPath. If you want to use wildcard to filter files, skip this setting and specify in activity source settings. | No | **Example:** The following properties are supported for file system under `storeSettings` set | ***Locate the files to copy:*** | | | | OPTION 1: static path<br> | Copy from the given folder/file path specified in the dataset. If you want to copy all files from a folder, additionally specify `wildcardFileName` as `*`. | | | OPTION 2: server side filter<br>- fileFilter | File server side native filter, which provides better performance than OPTION 3 wildcard filter. Use `*` to match zero or more characters and `?` to match zero or single character. Learn more about the syntax and notes from the **Remarks** under [this section](/dotnet/api/system.io.directory.getfiles#system-io-directory-getfiles(system-string-system-string-system-io-searchoption)). | No |-| OPTION 3: client side filter<br>- wildcardFolderPath | The folder path with wildcard characters to filter source folders. Such filter happens within the service, which enumerate the folders/files under the given path then apply the wildcard filter.<br>Allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character); use `^` to escape if your actual folder name has wildcard or this escape char inside. <br>See more examples in [Folder and file filter examples](#folder-and-file-filter-examples). | No | +| OPTION 3: client side filter<br>- wildcardFolderPath | The folder path with wildcard characters to filter source folders. Such filter happens within the service, which enumerates the folders/files under the given path then apply the wildcard filter.<br>Allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character); use `^` to escape if your actual folder name has wildcard or this escape char inside. <br>See more examples in [Folder and file filter examples](#folder-and-file-filter-examples). | No | | OPTION 3: client side filter<br>- wildcardFileName | The file name with wildcard characters under the given folderPath/wildcardFolderPath to filter source files. Such filter happens within the service, which enumerates the files under the given path then apply the wildcard filter.<br>Allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character); use `^` to escape if your actual file name has wildcard or this escape char inside.<br>See more examples in [Folder and file filter examples](#folder-and-file-filter-examples). | Yes |-| OPTION 3: a list of files<br>- fileListPath | Indicates to copy a given file set. Point to a text file that includes a list of files you want to copy, one file per line, which is the relative path to the path configured in the dataset.<br/>When using this option, do not specify file name in dataset. See more examples in [File list examples](#file-list-examples). |No | +| OPTION 3: a list of files<br>- fileListPath | Indicates to copy a given file set. Point to a text file that includes a list of files you want to copy, one file per line, which is the relative path to the path configured in the dataset.<br/>When using this option, don't specify file name in dataset. See more examples in [File list examples](#file-list-examples). |No | | ***Additional settings:*** | | |-| recursive | Indicates whether the data is read recursively from the subfolders or only from the specified folder. Note that when recursive is set to true and the sink is a file-based store, an empty folder or subfolder isn't copied or created at the sink. <br>Allowed values are **true** (default) and **false**.<br>This property doesn't apply when you configure `fileListPath`. |No | -| deleteFilesAfterCompletion | Indicates whether the binary files will be deleted from source store after successfully moving to the destination store. The file deletion is per file, so when copy activity fails, you will see some files have already been copied to the destination and deleted from source, while others are still remaining on source store. <br/>This property is only valid in binary files copy scenario. The default value: false. |No | -| modifiedDatetimeStart | Files filter based on the attribute: Last Modified. <br>The files will be selected if their last modified time is greater than or equal to `modifiedDatetimeStart` and less than `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of "2018-12-01T05:00:00Z". <br> The properties can be NULL, which means no file attribute filter will be applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value will be selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value will be selected.<br/>This property doesn't apply when you configure `fileListPath`. | No | +| recursive | Indicates whether the data is read recursively from the subfolders or only from the specified folder. When recursive is set to true and the sink is a file-based store, an empty folder or subfolder isn't copied or created at the sink. <br>Allowed values are **true** (default) and **false**.<br>This property doesn't apply when you configure `fileListPath`. |No | +| deleteFilesAfterCompletion | Indicates whether the binary files will be deleted from source store after successfully moving to the destination store. The file deletion is per file, so when copy activity fails, you'll see some files have already been copied to the destination and deleted from source, while others are still remaining on source store. <br/>This property is only valid in binary files copy scenario. The default value: false. |No | +| modifiedDatetimeStart | Files filter based on the attribute: Last Modified. <br>The files are selected if their last modified time is greater than or equal to `modifiedDatetimeStart` and less than `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of "2018-12-01T05:00:00Z". <br> The properties can be NULL, which means no file attribute filter is applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value are selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value are selected.<br/>This property doesn't apply when you configure `fileListPath`. | No | | modifiedDatetimeEnd | Same as above. | No |-| enablePartitionDiscovery | For files that are partitioned, specify whether to parse the partitions from the file path and add them as additional source columns.<br/>Allowed values are **false** (default) and **true**. | No | -| partitionRootPath | When partition discovery is enabled, specify the absolute root path in order to read partitioned folders as data columns.<br/><br/>If it is not specified, by default,<br/>- When you use file path in dataset or list of files on source, partition root path is the path configured in dataset.<br/>- When you use wildcard folder filter, partition root path is the sub-path before the first wildcard.<br/><br/>For example, assuming you configure the path in dataset as "root/folder/year=2020/month=08/day=27":<br/>- If you specify partition root path as "root/folder/year=2020", copy activity will generate two more columns `month` and `day` with value "08" and "27" respectively, in addition to the columns inside the files.<br/>- If partition root path is not specified, no extra column will be generated. | No | +| enablePartitionDiscovery | For files that are partitioned, specify whether to parse the partitions from the file path and add them as extra source columns.<br/>Allowed values are **false** (default) and **true**. | No | +| partitionRootPath | When partition discovery is enabled, specify the absolute root path in order to read partitioned folders as data columns.<br/><br/>If it isn't specified, by default,<br/>- When you use file path in dataset or list of files on source, partition root path is the path configured in dataset.<br/>- When you use wildcard folder filter, partition root path is the subpath before the first wildcard.<br/><br/>For example, assuming you configure the path in dataset as "root/folder/year=2020/month=08/day=27":<br/>- If you specify partition root path as "root/folder/year=2020", copy activity generates two more columns `month` and `day` with value "08" and "27" respectively, in addition to the columns inside the files.<br/>- If partition root path isn't specified, no extra column is generated. | No | | maxConcurrentConnections |The upper limit of concurrent connections established to the data store during the activity run. Specify a value only when you want to limit concurrent connections.| No | **Example:** This section describes the resulting behavior of the Copy operation for differen | true |preserveHierarchy | Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5 | The target folder Folder1 is created with the same structure as the source:<br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5. | | true |flattenHierarchy | Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5 | The target Folder1 is created with the following structure: <br/><br/>Folder1<br/> autogenerated name for File1<br/> autogenerated name for File2<br/> autogenerated name for File3<br/> autogenerated name for File4<br/> autogenerated name for File5 | | true |mergeFiles | Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5 | The target Folder1 is created with the following structure: <br/><br/>Folder1<br/> File1 + File2 + File3 + File4 + File 5 contents are merged into one file with autogenerated file name |-| false |preserveHierarchy | Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5 | The target folder Folder1 is created with the following structure<br/><br/>Folder1<br/> File1<br/> File2<br/><br/>Subfolder1 with File3, File4, and File5 are not picked up. | +| false |preserveHierarchy | Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5 | The target folder Folder1 is created with the following structure<br/><br/>Folder1<br/> File1<br/> File2<br/><br/>Subfolder1 with File3, File4, and File5 aren't picked up. | | false |flattenHierarchy | Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5 | The target folder Folder1 is created with the following structure<br/><br/>Folder1<br/> autogenerated name for File1<br/> autogenerated name for File2<br/><br/>Subfolder1 with File3, File4, and File5 are not picked up. |-| false |mergeFiles | Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5 | The target folder Folder1 is created with the following structure<br/><br/>Folder1<br/> File1 + File2 contents are merged into one file with autogenerated file name. autogenerated name for File1<br/><br/>Subfolder1 with File3, File4, and File5 are not picked up. | +| false |mergeFiles | Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5 | The target folder Folder1 is created with the following structure<br/><br/>Folder1<br/> File1 + File2 contents are merged into one file with autogenerated file name. autogenerated name for File1<br/><br/>Subfolder1 with File3, File4, and File5 aren't picked up. | ## Lookup activity properties To learn details about the properties, check [Lookup activity](control-flow-look ## GetMetadata activity properties -To learn details about the properties, check [GetMetadata activity](control-flow-get-metadata-activity.md) +To learn details about the properties, check [GetMetadata activity.](control-flow-get-metadata-activity.md) ## Delete activity properties -To learn details about the properties, check [Delete activity](delete-activity.md) +To learn details about the properties, check [Delete activity.](delete-activity.md) ## Legacy models To learn details about the properties, check [Delete activity](delete-activity.m |: |: |: | | type | The type property of the dataset must be set to: **FileShare** |Yes | | folderPath | Path to the folder. Wildcard filter is supported, allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character); use `^` to escape if your actual folder name has wildcard or this escape char inside. <br/><br/>Examples: rootfolder/subfolder/, see more examples in [Sample linked service and dataset definitions](#sample-linked-service-and-dataset-definitions) and [Folder and file filter examples](#folder-and-file-filter-examples). |No |-| fileName | **Name or wildcard filter** for the file(s) under the specified "folderPath". If you don't specify a value for this property, the dataset points to all files in the folder. <br/><br/>For filter, allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character).<br/>- Example 1: `"fileName": "*.csv"`<br/>- Example 2: `"fileName": "???20180427.txt"`<br/>Use `^` to escape if your actual file name has wildcard or this escape char inside.<br/><br/>When fileName isn't specified for an output dataset and **preserveHierarchy** isn't specified in the activity sink, the copy activity automatically generates the file name with the following pattern: "*Data.[activity run ID GUID].[GUID if FlattenHierarchy].[format if configured].[compression if configured]*", for example "Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt.gz"; if you copy from tabular source using table name instead of query, the name pattern is "*[table name].[format].[compression if configured]*", for example "MyTable.csv". |No | -| modifiedDatetimeStart | Files filter based on the attribute: Last Modified. The files will be selected if their last modified time is greater than or equal to `modifiedDatetimeStart` and less than `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of "2018-12-01T05:00:00Z". <br/><br/> Be aware the overall performance of data movement will be impacted by enabling this setting when you want to do file filter from huge amounts of files. <br/><br/> The properties can be NULL, which means no file attribute filter will be applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value will be selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value will be selected.| No | -| modifiedDatetimeEnd | Files filter based on the attribute: Last Modified. The files will be selected if their last modified time is greater than or equal to `modifiedDatetimeStart` and less than `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of "2018-12-01T05:00:00Z". <br/><br/> Be aware the overall performance of data movement will be impacted by enabling this setting when you want to do file filter from huge amounts of files. <br/><br/> The properties can be NULL, which means no file attribute filter will be applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value will be selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value will be selected.| No | +| fileName | **Name or wildcard filter** for the files under the specified "folderPath". If you don't specify a value for this property, the dataset points to all files in the folder. <br/><br/>For filter, allowed wildcards are: `*` (matches zero or more characters) and `?` (matches zero or single character).<br/>- Example 1: `"fileName": "*.csv"`<br/>- Example 2: `"fileName": "???20180427.txt"`<br/>Use `^` to escape if your actual file name has wildcard or this escape char inside.<br/><br/>When fileName isn't specified for an output dataset and **preserveHierarchy** isn't specified in the activity sink, the copy activity automatically generates the file name with the following pattern: "*Data.[activity run ID GUID].[GUID if FlattenHierarchy].[format if configured].[compression if configured]*", for example "Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt.gz"; if you copy from tabular source using table name instead of query, the name pattern is "*[table name].[format].[compression if configured]*", for example "MyTable.csv". |No | +| modifiedDatetimeStart | Files filter based on the attribute: Last Modified. The files are selected if their last modified time is greater than or equal to `modifiedDatetimeStart` and less than `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of "2018-12-01T05:00:00Z". <br/><br/> Be aware the overall performance of data movement are impacted by enabling this setting when you want to do file filter from huge amounts of files. <br/><br/> The properties can be NULL, which means no file attribute filter is applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value are selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value will be selected.| No | +| modifiedDatetimeEnd | Files filter based on the attribute: Last Modified. The files are selected if their last modified time is greater than or equal to `modifiedDatetimeStart` and less than `modifiedDatetimeEnd`. The time is applied to UTC time zone in the format of "2018-12-01T05:00:00Z". <br/><br/> Be aware the overall performance of data movement are impacted by enabling this setting when you want to do file filter from huge amounts of files. <br/><br/> The properties can be NULL, which means no file attribute filter is applied to the dataset. When `modifiedDatetimeStart` has datetime value but `modifiedDatetimeEnd` is NULL, it means the files whose last modified attribute is greater than or equal with the datetime value are selected. When `modifiedDatetimeEnd` has datetime value but `modifiedDatetimeStart` is NULL, it means the files whose last modified attribute is less than the datetime value are selected.| No | | format | If you want to **copy files as-is** between file-based stores (binary copy), skip the format section in both input and output dataset definitions.<br/><br/>If you want to parse or generate files with a specific format, the following file format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see [Text Format](supported-file-formats-and-compression-codecs-legacy.md#text-format), [Json Format](supported-file-formats-and-compression-codecs-legacy.md#json-format), [Avro Format](supported-file-formats-and-compression-codecs-legacy.md#avro-format), [Orc Format](supported-file-formats-and-compression-codecs-legacy.md#orc-format), and [Parquet Format](supported-file-formats-and-compression-codecs-legacy.md#parquet-format) sections. |No (only for binary copy scenario) | | compression | Specify the type and level of compression for the data. For more information, see [Supported file formats and compression codecs](supported-file-formats-and-compression-codecs-legacy.md#compression-support).<br/>Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**.<br/>Supported levels are: **Optimal** and **Fastest**. |No | To learn details about the properties, check [Delete activity](delete-activity.m | Property | Description | Required | |: |: |: | | type | The type property of the copy activity source must be set to: **FileSystemSource** |Yes |-| recursive | Indicates whether the data is read recursively from the subfolders or only from the specified folder. Note when recursive is set to true and sink is file-based store, empty folder/sub-folder will not be copied/created at sink.<br/>Allowed values are: **true** (default), **false** | No | +| recursive | Indicates whether the data is read recursively from the subfolders or only from the specified folder. Note when recursive is set to true and sink is file-based store, empty folder/sub-folder won't be copied/created at sink.<br/>Allowed values are: **true** (default), **false** | No | | maxConcurrentConnections |The upper limit of concurrent connections established to the data store during the activity run. Specify a value only when you want to limit concurrent connections.| No | **Example:** |
data-factory | Connector Ftp | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-ftp.md | -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> -> * [Version 1](v1/data-factory-ftp-connector.md) > * [Current version](connector-ftp.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Connector Hdfs | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-hdfs.md | -> [!div class="op_single_selector" title1="Select the version of the Data Factory service that you are using:"] -> * [Version 1](v1/data-factory-hdfs-connector.md) -> * [Current version](connector-hdfs.md) - [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to copy data from the Hadoop Distributed File System (HDFS) server. To learn more, read the introductory articles for [Azure Data Factory](introduction.md) and [Synapse Analytics](../synapse-analytics/overview-what-is.md). |
data-factory | Connector Http | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-http.md | -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-http-connector.md) -> * [Current version](connector-http.md) - [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use Copy Activity in Azure Data Factory and Azure Synapse to copy data from an HTTP endpoint. The article builds on [Copy Activity](copy-activity-overview.md), which presents a general overview of Copy Activity. |
data-factory | Connector Mongodb Legacy | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-mongodb-legacy.md | Last updated 01/25/2023 # Copy data from MongoDB using Azure Data Factory or Synapse Analytics (legacy) -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-on-premises-mongodb-connector.md) -> * [Current version](connector-mongodb.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use the Copy Activity in an Azure Data Factory or Synapse Analytics pipeline to copy data from a MongoDB database. It builds on the [copy activity overview](copy-activity-overview.md) article that presents a general overview of copy activity. |
data-factory | Connector Mysql | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-mysql.md | -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-onprem-mysql-connector.md) -> * [Current version](connector-mysql.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use the Copy Activity in Azure Data Factory and Synapse Analytics pipelines to copy data from a MySQL database. It builds on the [copy activity overview](copy-activity-overview.md) article that presents a general overview of copy activity. |
data-factory | Connector Odata | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-odata.md | -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-odata-connector.md) -> * [Current version](connector-odata.md) This article outlines how to use Copy Activity in an Azure Data Factory or Synapse Analytics pipeline to copy data from an OData source. The article builds on [Copy Activity](copy-activity-overview.md), which presents a general overview of Copy Activity. |
data-factory | Connector Odbc | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-odbc.md | Last updated 10/25/2022 # Copy data from and to ODBC data stores using Azure Data Factory or Synapse Analytics-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-odbc-connector.md) -> * [Current version](connector-odbc.md) + [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use the Copy Activity in Azure Data Factory to copy data from and to an ODBC data store. It builds on the [copy activity overview](copy-activity-overview.md) article that presents a general overview of copy activity. |
data-factory | Connector Oracle | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-oracle.md | -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-onprem-oracle-connector.md) -> * [Current version](connector-oracle.md) - [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use the copy activity in Azure Data Factory to copy data from and to an Oracle database. It builds on the [copy activity overview](copy-activity-overview.md). |
data-factory | Connector Postgresql | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-postgresql.md | Last updated 10/25/2022 # Copy data from PostgreSQL using Azure Data Factory or Synapse Analytics-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-onprem-postgresql-connector.md) -> * [Current version](connector-postgresql.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use the Copy Activity in Azure Data Factory and Synapse Analytics pipelines to copy data from a PostgreSQL database. It builds on the [copy activity overview](copy-activity-overview.md) article that presents a general overview of copy activity. |
data-factory | Connector Salesforce | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-salesforce.md | Last updated 07/13/2023 # Copy data from and to Salesforce using Azure Data Factory or Azure Synapse Analytics -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-salesforce-connector.md) -> * [Current version](connector-salesforce.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Connector Sap Business Warehouse | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-sap-business-warehouse.md | Last updated 10/25/2022 # Copy data from SAP Business Warehouse using Azure Data Factory or Synapse Analytics-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-sap-business-warehouse-connector.md) -> * [Current version](connector-sap-business-warehouse.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use the Copy Activity in Azure Data Factory and Synapse Analytics pipelines to copy data from an SAP Business Warehouse (BW). It builds on the [copy activity overview](copy-activity-overview.md) article that presents a general overview of copy activity. |
data-factory | Connector Sap Hana | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-sap-hana.md | Last updated 10/20/2022 # Copy data from SAP HANA using Azure Data Factory or Synapse Analytics-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-sap-hana-connector.md) -> * [Current version](connector-sap-hana.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use the Copy Activity in Azure Data Factory and Synapse Analytics pipelines to copy data from an SAP HANA database. It builds on the [copy activity overview](copy-activity-overview.md) article that presents a general overview of copy activity. |
data-factory | Connector Sftp | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-sftp.md | Last updated 04/12/2023 # Copy and transform data in SFTP server using Azure Data Factory or Azure Synapse Analytics -> [!div class="op_single_selector" title1="Select the version of the Data Factory service that you are using:"] -> * [Version 1](v1/data-factory-sftp-connector.md) -> * [Current version](connector-sftp.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use Copy Activity to copy data from and to the secure FTP (SFTP) server, and use Data Flow to transform data in SFTP server. To learn more read the introductory article for [Azure Data Factory](introduction.md) or [Azure Synapse Analytics](../synapse-analytics/overview-what-is.md). |
data-factory | Connector Sql Server | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-sql-server.md | Last updated 07/13/2023 # Copy and transform data to and from SQL Server by using Azure Data Factory or Azure Synapse Analytics -> [!div class="op_single_selector" title1="Select the version of Azure Data Factory that you're using:"] -> * [Version 1](v1/data-factory-sqlserver-connector.md) -> * [Current version](connector-sql-server.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use the copy activity in Azure Data Factory and Azure Synapse pipelines to copy data from and to SQL Server database and use Data Flow to transform data in SQL Server database. To learn more read the introductory article for [Azure Data Factory](introduction.md) or [Azure Synapse Analytics](../synapse-analytics/overview-what-is.md). |
data-factory | Connector Sybase | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-sybase.md | Last updated 01/20/2023 # Copy data from Sybase using Azure Data Factory or Synapse Analytics-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-onprem-sybase-connector.md) -> * [Current version](connector-sybase.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use the Copy Activity in an Azure Data Factory or Synapse Analytics pipeline to copy data from a Sybase database. It builds on the [copy activity overview](copy-activity-overview.md) article that presents a general overview of copy activity. |
data-factory | Connector Teradata | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-teradata.md | -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> -> * [Version 1](v1/data-factory-onprem-teradata-connector.md) > * [Current version](connector-teradata.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Connector Web Table | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/connector-web-table.md | Last updated 01/18/2023 # Copy data from Web table by using Azure Data Factory or Synapse Analytics-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-web-table-connector.md) -> * [Current version](connector-web-table.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article outlines how to use the Copy Activity in an Azure Data Factory or Synapse Analytics pipeline to copy data from a Web table database. It builds on the [copy activity overview](copy-activity-overview.md) article that presents a general overview of copy activity. |
data-factory | Control Flow Expression Language Functions | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/control-flow-expression-language-functions.md | Last updated 10/25/2022 # Expressions and functions in Azure Data Factory and Azure Synapse Analytics -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-functions-variables.md) -> * [Current version/Synapse version](control-flow-expression-language-functions.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] This article provides details about expressions and functions supported by Azure Data Factory and Azure Synapse Analytics. |
data-factory | Copy Activity Fault Tolerance | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/copy-activity-fault-tolerance.md | Last updated 10/25/2022 # Fault tolerance of copy activity in Azure Data Factory and Synapse Analytics pipelines-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-copy-activity-fault-tolerance.md) -> * [Current version](copy-activity-fault-tolerance.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Copy Activity Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/copy-activity-overview.md | -> [!div class="op_single_selector" title1="Select the version of Data Factory that you're using:"] -> * [Version 1](v1/data-factory-data-movement-activities.md) -> * [Current version](copy-activity-overview.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Copy Activity Performance | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/copy-activity-performance.md | Last updated 10/25/2022 # Copy activity performance and scalability guide -> [!div class="op_single_selector" title1="Select the version of Azure Data Factory that you're using:"] -> * [Version 1](v1/data-factory-copy-activity-performance.md) -> * [Current version](copy-activity-performance.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Data Movement Security Considerations | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/data-movement-security-considerations.md | Last updated 02/01/2023 # Security considerations for data movement in Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> -> * [Version 1](v1/data-factory-data-movement-security-considerations.md) > * [Current version](data-movement-security-considerations.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | How To Expression Language Functions | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/how-to-expression-language-functions.md | Last updated 07/17/2023 # How to use parameters, expressions and functions in Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-functions-variables.md) -> * [Current version](how-to-expression-language-functions.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] In this document, we will primarily focus on learning fundamental concepts with various examples to explore the ability to create parameterized data pipelines within Azure Data Factory. Parameterization and dynamic expressions are such notable additions to ADF because they can save a tremendous amount of time and allow for a much more flexible Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) solution, which will dramatically reduce the cost of solution maintenance and speed up the implementation of new features into existing pipelines. These gains are because parameterization minimizes the amount of hard coding and increases the number of reusable objects and processes in a solution. |
data-factory | Pipeline Trigger Troubleshoot Guide | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/pipeline-trigger-troubleshoot-guide.md | Type=Microsoft.DataTransfer.Execution.Core.ExecutionException,Message=There are **Cause** -You've reached the integration runtime's capacity limit. You might be running a large amount of data flow by using the same integration runtime at the same time. See [Azure subscription and service limits, quotas, and constraints](../azure-resource-manager/management/azure-subscription-service-limits.md#version-2) for details. +You've reached the integration runtime's capacity limit. You might be running a large amount of data flow by using the same integration runtime at the same time. See [Azure subscription and service limits, quotas, and constraints](../azure-resource-manager/management/azure-subscription-service-limits.md#azure-data-factory-limits) for details. **Resolution** |
data-factory | Quickstart Create Data Factory Dot Net | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/quickstart-create-data-factory-dot-net.md | -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) -> * [Current version](quickstart-create-data-factory-dot-net.md) [!INCLUDE[appliesto-adf-xxx-md](includes/appliesto-adf-xxx-md.md)] |
data-factory | Quickstart Create Data Factory Powershell | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/quickstart-create-data-factory-powershell.md | -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) -> * [Current version](quickstart-create-data-factory-powershell.md) [!INCLUDE[appliesto-adf-xxx-md](includes/appliesto-adf-xxx-md.md)] |
data-factory | Quickstart Create Data Factory Python | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/quickstart-create-data-factory-python.md | -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) -> * [Current version](quickstart-create-data-factory-python.md) [!INCLUDE[appliesto-adf-xxx-md](includes/appliesto-adf-xxx-md.md)] |
data-factory | Quickstart Create Data Factory Resource Manager Template | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/quickstart-create-data-factory-resource-manager-template.md | Last updated 10/25/2022 # Quickstart: Create an Azure Data Factory using ARM template -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-build-your-first-pipeline-using-arm.md) -> * [Current version](quickstart-create-data-factory-resource-manager-template.md) [!INCLUDE[appliesto-adf-xxx-md](includes/appliesto-adf-xxx-md.md)] |
data-factory | Quickstart Create Data Factory Rest Api | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/quickstart-create-data-factory-rest-api.md | -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) -> * [Current version](quickstart-create-data-factory-rest-api.md) [!INCLUDE[appliesto-adf-xxx-md](includes/appliesto-adf-xxx-md.md)] |
data-factory | Self Hosted Integration Runtime Troubleshoot Guide | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/self-hosted-integration-runtime-troubleshoot-guide.md | For Azure Data Factory v2 and Azure Synapse customers: - If automatic update is off and you've already upgraded your .NET Framework Runtime to 4.7.2 or later, you can manually download the latest 5.x and install it on your machine. - If automatic update is off and you haven't upgraded your .NET Framework Runtime to 4.7.2 or later. When you try to manually install self-hosted integration runtime 5.x and register the key, you will be required to upgrade your .NET Framework Runtime version first. --For Azure Data Factory v1 customers: -- Self-hosted integration runtime 5.X doesn't support Azure Data Factory v1.-- The self-hosted integration runtime will be automatically upgraded to the latest version of 4.x. And the latest version of 4.x won't expire. -- If you try to manually install self-hosted integration runtime 5.x and register the key, you'll be notified that self-hosted integration runtime 5.x doesn't support Azure Data Factory v1.- - ## Self-hosted IR connectivity issues ### Self-hosted integration runtime can't connect to the cloud service |
data-factory | Transform Data Using Custom Activity | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/transform-data-using-custom-activity.md | Last updated 08/10/2023 # Use custom activities in an Azure Data Factory or Azure Synapse Analytics pipeline -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-use-custom-activities.md) -> * [Current version](transform-data-using-dotnet-custom-activity.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] There are two types of activities that you can use in an Azure Data Factory or Synapse pipeline. This serialization is not truly secure, and is not intended to be secure. The in To access properties of type *SecureString* from a custom activity, read the `activity.json` file, which is placed in the same folder as your .EXE, deserialize the JSON, and then access the JSON property (extendedProperties => [propertyName] => value). -## <a name="compare-v2-v1"></a> Compare v2 Custom Activity and version 1 (Custom) DotNet Activity --In Azure Data Factory version 1, you implement a (Custom) DotNet Activity by creating a .NET Class Library project with a class that implements the `Execute` method of the `IDotNetActivity` interface. The Linked Services, Datasets, and Extended Properties in the JSON payload of a (Custom) DotNet Activity are passed to the execution method as strongly-typed objects. For details about the version 1 behavior, see [(Custom) DotNet in version 1](v1/data-factory-use-custom-activities.md). Because of this implementation, your version 1 DotNet Activity code has to target .NET Framework 4.5.2. The version 1 DotNet Activity also has to be executed on Windows-based Azure Batch Pool nodes. --In the Azure Data Factory V2 and Synapse pipelines Custom Activity, you are not required to implement a .NET interface. You can now directly run commands, scripts, and your own custom code, compiled as an executable. To configure this implementation, you specify the `Command` property together with the `folderPath` property. The Custom Activity uploads the executable and its dependencies to `folderpath` and executes the command for you. --The Linked Services, Datasets (defined in referenceObjects), and Extended Properties defined in the JSON payload of a Data Factory v2 or Synapse pipeline Custom Activity can be accessed by your executable as JSON files. You can access the required properties using a JSON serializer as shown in the preceding SampleApp.exe code sample. --With the changes introduced in the Data Factory V2 and Synapse pipeline Custom Activity, you can write your custom code logic in your preferred language and execute it on Windows and Linux Operation Systems supported by Azure Batch. --The following table describes the differences between the Data Factory V2 and Synapse pipeline Custom Activity and the Data Factory version 1 (Custom) DotNet Activity: --|Differences | Custom Activity | version 1 (Custom) DotNet Activity | -| - | - | - | -|How custom logic is defined |By providing an executable |By implementing a .NET DLL | -|Execution environment of the custom logic |Windows or Linux |Windows (.NET Framework 4.5.2) | -|Executing scripts |Supports executing scripts directly (for example "cmd /c echo hello world" on Windows VM) |Requires implementation in the .NET DLL | -|Dataset required |Optional |Required to chain activities and pass information | -|Pass information from activity to custom logic |Through ReferenceObjects (LinkedServices and Datasets) and ExtendedProperties (custom properties) |Through ExtendedProperties (custom properties), Input, and Output Datasets | -|Retrieve information in custom logic |Parses activity.json, linkedServices.json, and datasets.json stored in the same folder of the executable |Through .NET SDK (.NET Frame 4.5.2) | -|Logging |Writes directly to STDOUT |Implementing Logger in .NET DLL | --If you have existing .NET code written for a version 1 (Custom) DotNet Activity, you need to modify your code for it to work with the current version of the Custom Activity. Update your code by following these high-level guidelines: -- - Change the project from a .NET Class Library to a Console App. - - Start your application with the `Main` method. The `Execute` method of the `IDotNetActivity` interface is no longer required. - - Read and parse the Linked Services, Datasets and Activity with a JSON serializer, and not as strongly-typed objects. Pass the values of required properties to your main custom code logic. Refer to the preceding SampleApp.exe code as an example. - - The Logger object is no longer supported. Output from your executable can be printed to the console and is saved to stdout.txt. - - The Microsoft.Azure.Management.DataFactories NuGet package is no longer required. - - Compile your code, upload the executable and its dependencies to Azure Storage, and define the path in the `folderPath` property. --For a complete sample of how the end-to-end DLL and pipeline sample described in the Data Factory version 1 article [Use custom activities in an Azure Data Factory pipeline](./v1/data-factory-use-custom-activities.md) can be rewritten as a Custom Activity for Data Factory v2 and Synapse pipelines, see [Custom Activity sample](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/ADFv2CustomActivitySample). - ## Auto-scaling of Azure Batch You can also create an Azure Batch pool with **autoscale** feature. For example, you could create an Azure batch pool with 0 dedicated VMs and an autoscale formula based on the number of pending tasks. |
data-factory | Transform Data Using Data Lake Analytics | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/transform-data-using-data-lake-analytics.md | Last updated 08/10/2023 # Process data by running U-SQL scripts on Azure Data Lake Analytics with Azure Data Factory and Synapse Analytics-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-usql-activity.md) -> * [Current version](transform-data-using-data-lake-analytics.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Transform Data Using Hadoop Hive | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/transform-data-using-hadoop-hive.md | Last updated 08/10/2023 # Transform data using Hadoop Hive activity in Azure Data Factory or Synapse Analytics -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-hive-activity.md) -> * [Current version](transform-data-using-hadoop-hive.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Transform Data Using Hadoop Map Reduce | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/transform-data-using-hadoop-map-reduce.md | Last updated 08/10/2023 # Transform data using Hadoop MapReduce activity in Azure Data Factory or Synapse Analytics -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-map-reduce.md) -> * [Current version](transform-data-using-hadoop-map-reduce.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Transform Data Using Hadoop Pig | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/transform-data-using-hadoop-pig.md | Last updated 08/10/2023 # Transform data using Hadoop Pig activity in Azure Data Factory or Synapse Analytics -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-pig-activity.md) -> * [Current version](transform-data-using-hadoop-pig.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Transform Data Using Hadoop Streaming | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/transform-data-using-hadoop-streaming.md | Last updated 08/10/2023 # Transform data using Hadoop Streaming activity in Azure Data Factory or Synapse Analytics-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-hadoop-streaming-activity.md) -> * [Current version](transform-data-using-hadoop-streaming.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Transform Data Using Machine Learning | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/transform-data-using-machine-learning.md | Last updated 08/10/2023 # Create a predictive pipeline using Machine Learning Studio (classic) with Azure Data Factory or Synapse Analytics -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-azure-ml-batch-execution-activity.md) -> * [Current version](transform-data-using-machine-learning.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Transform Data Using Spark | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/transform-data-using-spark.md | Last updated 08/10/2023 # Transform data using Spark activity in Azure Data Factory and Synapse Analytics-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-spark.md) -> * [Current version](transform-data-using-spark.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Transform Data Using Stored Procedure | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/transform-data-using-stored-procedure.md | Last updated 08/10/2023 # Transform data by using the SQL Server Stored Procedure activity in Azure Data Factory or Synapse Analytics-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-stored-proc-activity.md) -> * [Current version](transform-data-using-stored-procedure.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Tutorial Copy Data Tool | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/tutorial-copy-data-tool.md | Last updated 08/10/2023 # Copy data from Azure Blob storage to a SQL Database by using the Copy Data tool -> [!div class="op_single_selector" title1="Select the version of the Data Factory service that you're using:"] -> * [Version 1](v1/data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) -> * [Current version](tutorial-copy-data-tool.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Tutorial Hybrid Copy Data Tool | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/tutorial-hybrid-copy-data-tool.md | Last updated 08/10/2023 # Copy data from a SQL Server database to Azure Blob storage by using the Copy Data tool-> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](v1/data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) -> * [Current version](tutorial-hybrid-copy-data-tool.md) [!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
data-factory | Data Factory Amazon Redshift Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-amazon-redshift-connector.md | - Title: Move data from Amazon Redshift by using Azure Data Factory -description: Learn how to move data from Amazon Redshift by using Azure Data Factory Copy Activity. ---- Previously updated : 04/12/2023----# Move data From Amazon Redshift using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-amazon-redshift-connector.md) -> * [Version 2 (current version)](../connector-amazon-redshift.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Amazon Redshift connector in V2](../connector-amazon-redshift.md). --This article explains how to use the Copy Activity in Azure Data Factory to move data from Amazon Redshift. The article builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --Data Factory currently supports only moving data from Amazon Redshift to a [supported sink data store](data-factory-data-movement-activities.md#supported-data-stores-and-formats). Moving data from other data stores to Amazon Redshift is not supported. --> [!TIP] -> To achieve the best performance when copying large amounts of data from Amazon Redshift, consider using the built-in Redshift **UNLOAD** command through Amazon Simple Storage Service (Amazon S3). For details, see [Use UNLOAD to copy data from Amazon Redshift](#use-unload-to-copy-data-from-amazon-redshift). --## Prerequisites -* If you are moving data to an on-premises data store, install [Data Management Gateway](data-factory-data-management-gateway.md) on an on-premises machine. Grant access for a gateway to the Amazon Redshift cluster by using the on-premises machine IP address. For instructions, see [Authorize access to the cluster](https://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-authorize-cluster-access.html). -* To move data to an Azure data store, see the [Compute IP address and SQL ranges that are used by the Microsoft Azure Datacenters](https://www.microsoft.com/download/details.aspx?id=41653). --## Getting started -You can create a pipeline with a copy activity to move data from an Amazon Redshift source by using different tools and APIs. --The easiest way to create a pipeline is to use the Azure Data Factory Copy Wizard. For a quick walkthrough on creating a pipeline by using the Copy Wizard, see the [Tutorial: Create a pipeline by using the Copy Wizard](data-factory-copy-data-wizard-tutorial.md). --You can also create a pipeline by using Visual Studio, Azure PowerShell, or other tools. Azure Resource Manager templates, the .NET API, or the REST API can also be used to create the pipeline. For step-by-step instructions to create a pipeline with a copy activity, see the [Copy Activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create linked services to link input and output data stores to your data factory. -2. Create datasets to represent input and output data for the copy operation. -3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the Copy Wizard, JSON definitions for these Data Factory entities are automatically created. When you use tools or APIs (except the .NET API), you define the Data Factory entities by using the JSON format. The JSON example: Copy data from Amazon Redshift to Azure Blob storage shows the JSON definitions for the Data Factory entities that are used to copy data from an Amazon Redshift data store. --The following sections describe the JSON properties that are used to define the Data Factory entities for Amazon Redshift. --## Linked service properties --The following table provides descriptions for the JSON elements that are specific to an Amazon Redshift linked service. --| Property | Description | Required | -| | | | -| **type** |This property must be set to **AmazonRedshift**. |Yes | -| **server** |The IP address or host name of the Amazon Redshift server. |Yes | -| **port** |The number of the TCP port that the Amazon Redshift server uses to listen for client connections. |No (default is 5439) | -| **database** |The name of the Amazon Redshift database. |Yes | -| **username** |The name of the user who has access to the database. |Yes | -| **password** |The password for the user account. |Yes | --## Dataset properties --For a list of the sections and properties that are available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. The **structure**, **availability**, and **policy** sections are similar for all dataset types. Examples of dataset types include Azure SQL, Azure Blob storage, and Azure Table storage. --The **typeProperties** section is different for each type of dataset and provides information about the location of the data in the store. **The typeProperties** section for a dataset of type **RelationalTable**, which includes the Amazon Redshift dataset, has the following properties: --| Property | Description | Required | -| | | | -| **tableName** |The name of the table in the Amazon Redshift database that the linked service refers to. |No (if the **query** property of a copy activity of type **RelationalSource** is specified) | --## Copy Activity properties --For a list of sections and properties that are available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. The **name**, **description**, **inputs** table, **outputs** table, and **policy** properties are available for all types of activities. The properties that are available in the **typeProperties** section vary for each activity type. For Copy Activity, the properties vary depending on the types of data sources and sinks. --For Copy Activity, when the source is of type **AmazonRedshiftSource**, the following properties are available in **typeProperties** section: --| Property | Description | Required | -| | | | -| **query** | Use the custom query to read the data. |No (if the **tableName** property of a dataset is specified) | -| **redshiftUnloadSettings** | Contains the property group when using the Redshift **UNLOAD** command. | No | -| **s3LinkedServiceName** | The Amazon S3 to use as an interim store. The linked service is specified by using an Azure Data Factory name of type **AwsAccessKey**. | Required when using the **redshiftUnloadSettings** property | -| **bucketName** | Indicates the Amazon S3 bucket to use to store the interim data. If this property is not provided, Copy Activity auto-generates a bucket. | Required when using the **redshiftUnloadSettings** property | --Alternatively, you can use the **RelationalSource** type, which includes Amazon Redshift, with the following property in the **typeProperties** section. Note this source type doesn't support the Redshift **UNLOAD** command. --| Property | Description | Required | -| | | | -| **query** |Use the custom query to read the data. | No (if the **tableName** property of a dataset is specified) | --## Use UNLOAD to copy data from Amazon Redshift --The Amazon Redshift [**UNLOAD**](https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html) command unloads the results of a query to one or more files on Amazon S3. This command is recommended by Amazon for copying large datasets from Redshift. --**Example: Copy data from Amazon Redshift to Azure Synapse Analytics** --This example copies data from Amazon Redshift to Azure Synapse Analytics. The example uses the Redshift **UNLOAD** command, staged copy data, and Microsoft PolyBase. --For this sample use case, Copy Activity first unloads the data from Amazon Redshift to Amazon S3 as configured in the **redshiftUnloadSettings** option. Next, the data is copied from Amazon S3 to Azure Blob storage as specified in the **stagingSettings** option. Finally, PolyBase loads the data into Azure Synapse Analytics. All of the interim formats are handled by Copy Activity. ---```json -{ - "name": "CopyFromRedshiftToSQLDW", - "type": "Copy", - "typeProperties": { - "source": { - "type": "AmazonRedshiftSource", - "query": "select * from MyTable", - "redshiftUnloadSettings": { - "s3LinkedServiceName":"MyAmazonS3StorageLinkedService", - "bucketName": "bucketForUnload" - } - }, - "sink": { - "type": "SqlDWSink", - "allowPolyBase": true - }, - "enableStaging": true, - "stagingSettings": { - "linkedServiceName": "MyAzureStorageLinkedService", - "path": "adfstagingcopydata" - }, - "cloudDataMovementUnits": 32 - ..... - } -} -``` --## JSON example: Copy data from Amazon Redshift to Azure Blob storage -This sample shows how to copy data from an Amazon Redshift database to Azure Blob Storage. Data can be copied directly to any [supported sink](data-factory-data-movement-activities.md#supported-data-stores-and-formats) by using Copy Activity. --The sample has the following data factory entities: --* A linked service of type [AmazonRedshift](#linked-service-properties) -* A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -* An input [dataset](data-factory-create-datasets.md) of type [RelationalTable](#dataset-properties) -* An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties) -* A [pipeline](data-factory-create-pipelines.md) with a copy activity that uses the [RelationalSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties) properties --The sample copies data from a query result in Amazon Redshift to an Azure blob hourly. The JSON properties that are used in the sample are described in the sections that follow the entity definitions. --**Amazon Redshift linked service** --```json -{ - "name": "AmazonRedshiftLinkedService", - "properties": - { - "type": "AmazonRedshift", - "typeProperties": - { - "server": "< The IP address or host name of the Amazon Redshift server >", - "port": "<The number of the TCP port that the Amazon Redshift server uses to listen for client connections.>", - "database": "<The database name of the Amazon Redshift database>", - "username": "<username>", - "password": "<password>" - } - } -} -``` --**Azure Blob storage linked service** --```json -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -**Amazon Redshift input dataset** --The **external** property is set to "true" to inform the Data Factory service that the dataset is external to the data factory. This property setting indicates that the dataset is not produced by an activity in the data factory. Set the property to true on an input dataset that is not produced by an activity in the pipeline. --```json -{ - "name": "AmazonRedshiftInputDataset", - "properties": { - "type": "RelationalTable", - "linkedServiceName": "AmazonRedshiftLinkedService", - "typeProperties": { - "tableName": "<Table name>" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true - } -} -``` --**Azure Blob output dataset** --Data is written to a new blob every hour by setting the **frequency** property to "Hour" and the **interval** property to 1. The **folderPath** property for the blob is dynamically evaluated. The property value is based on the start time of the slice that is being processed. The folder path uses the year, month, day, and hours parts of the start time. --```json -{ - "name": "AzureBlobOutputDataSet", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/fromamazonredshift/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Copy activity in a pipeline with an Azure Redshift source (of type RelationalSource) and an Azure Blob sink** --The pipeline contains a copy activity that is configured to use the input and output datasets. The pipeline is scheduled to run every hour. In the JSON definition for the pipeline, the **source** type is set to **RelationalSource** and the **sink** type is set to **BlobSink**. The SQL query specified for the **query** property selects the data to copy from the past hour. --```json -{ - "name": "CopyAmazonRedshiftToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "AmazonRedshiftSource", - "query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)", - "redshiftUnloadSettings": { - "s3LinkedServiceName":"myS3Storage", - "bucketName": "bucketForUnload" - } - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - }, - "cloudDataMovementUnits": 32 - }, - "inputs": [ - { - "name": "AmazonRedshiftInputDataset" - } - ], - "outputs": [ - { - "name": "AzureBlobOutputDataSet" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "AmazonRedshiftToBlob" - } - ], - "start": "2014-06-01T18:00:00Z", - "end": "2014-06-01T19:00:00Z" - } -} -``` -### Type mapping for Amazon Redshift -As mentioned in the [data movement activities](data-factory-data-movement-activities.md) article, Copy Activity performs automatic type conversions from source type to sink type. The types are converted by using a two-step approach: --1. Convert from a native source type to a .NET type -2. Convert from a .NET type to a native sink type --The following mappings are used when Copy Activity converts the data from an Amazon Redshift type to a .NET type: --| Amazon Redshift type | .NET type | -| | | -| SMALLINT |Int16 | -| INTEGER |Int32 | -| BIGINT |Int64 | -| DECIMAL |Decimal | -| REAL |Single | -| DOUBLE PRECISION |Double | -| BOOLEAN |String | -| CHAR |String | -| VARCHAR |String | -| DATE |DateTime | -| TIMESTAMP |DateTime | -| TEXT |String | --## Map source to sink columns -To learn how to map columns in the source dataset to columns in the sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Repeatable reads from relational sources -When you copy data from a relational data store, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure the retry **policy** for a dataset to rerun a slice when a failure occurs. Make sure that the same data is read, no matter how many times the slice is rerun. Also make sure that the same data is read regardless of how you rerun the slice. For more information, see [Repeatable reads from relational sources](data-factory-repeatable-copy.md#repeatable-read-from-relational-sources). --## Performance and tuning -Learn about key factors that affect the performance of Copy Activity and ways to optimize performance in the [Copy Activity Performance and Tuning Guide](data-factory-copy-activity-performance.md). --## Next steps -For step-by-step instructions for creating a pipeline with Copy Activity, see the [Copy Activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). |
data-factory | Data Factory Amazon Simple Storage Service Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-amazon-simple-storage-service-connector.md | - Title: Move data from Amazon Simple Storage Service by using Data Factory -description: Learn about how to move data from Amazon Simple Storage Service (S3) by using Azure Data Factory. ---- Previously updated : 04/12/2023----# Move data from Amazon Simple Storage Service by using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-amazon-simple-storage-service-connector.md) -> * [Version 2 (current version)](../connector-amazon-simple-storage-service.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Amazon S3 connector in V2](../connector-amazon-simple-storage-service.md). --This article explains how to use the copy activity in Azure Data Factory to move data from Amazon Simple Storage Service (S3). It builds on the [Data movement activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --You can copy data from Amazon S3 to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the [Supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. Data Factory currently supports only moving data from Amazon S3 to other data stores, but not moving data from other data stores to Amazon S3. --## Required permissions -To copy data from Amazon S3, make sure you have been granted the following permissions: --* `s3:GetObject` and `s3:GetObjectVersion` for Amazon S3 Object Operations. -* `s3:ListBucket` for Amazon S3 Bucket Operations. If you are using the Data Factory Copy Wizard, `s3:ListAllMyBuckets` is also required. --For details about the full list of Amazon S3 permissions, see [Specifying Permissions in a Policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/using-with-s3-actions.html). --## Getting started -You can create a pipeline with a copy activity that moves data from an Amazon S3 source by using different tools or APIs. --The easiest way to create a pipeline is to use the **Copy Wizard**. For a quick walkthrough, see [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md). --You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. For step-by-step instructions to create a pipeline with a copy activity, see the [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --Whether you use tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools or APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an Amazon S3 data store, see the [JSON example: Copy data from Amazon S3 to Azure Blob](#json-example-copy-data-from-amazon-s3-to-azure-blob-storage) section of this article. --> [!NOTE] -> For details about supported file and compression formats for a copy activity, see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md). --The following sections provide details about JSON properties that are used to define Data Factory entities specific to Amazon S3. --## Linked service properties -A linked service links a data store to a data factory. You create a linked service of type **AwsAccessKey** to link your Amazon S3 data store to your data factory. The following table provides description for JSON elements specific to Amazon S3 (AwsAccessKey) linked service. --| Property | Description | Allowed values | Required | -| | | | | -| accessKeyID |ID of the secret access key. |string |Yes | -| secretAccessKey |The secret access key itself. |Encrypted secret string |Yes | -->[!NOTE] ->This connector requires access keys for IAM account to copy data from Amazon S3. [Temporary Security Credential](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp.html) is not supported. -> --Here is an example: --```json -{ - "name": "AmazonS3LinkedService", - "properties": { - "type": "AwsAccessKey", - "typeProperties": { - "accessKeyId": "<access key id>", - "secretAccessKey": "<secret access key>" - } - } -} -``` --## Dataset properties -To specify a dataset to represent input data in Azure Blob storage, set the type property of the dataset to **AmazonS3**. Set the **linkedServiceName** property of the dataset to the name of the Amazon S3 linked service. For a full list of sections and properties available for defining datasets, see [Creating datasets](data-factory-create-datasets.md). --Sections such as structure, availability, and policy are similar for all dataset types (such as SQL database, Azure blob, and Azure table). The **typeProperties** section is different for each type of dataset, and provides information about the location of the data in the data store. The **typeProperties** section for a dataset of type **AmazonS3** (which includes the Amazon S3 dataset) has the following properties: --| Property | Description | Allowed values | Required | -| | | | | -| bucketName |The S3 bucket name. |String |Yes | -| key |The S3 object key. |String |No | -| prefix |Prefix for the S3 object key. Objects whose keys start with this prefix are selected. Applies only when key is empty. |String |No | -| version |The version of the S3 object, if S3 versioning is enabled. |String |No | -| format | The following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see the [Text format](data-factory-supported-file-and-compression-formats.md#text-format), [JSON format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro format](data-factory-supported-file-and-compression-formats.md#avro-format), [Orc format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet format](data-factory-supported-file-and-compression-formats.md#parquet-format) sections. <br><br> If you want to copy files as-is between file-based stores (binary copy), skip the format section in both input and output dataset definitions. | |No | -| compression | Specify the type and level of compression for the data. The supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**. The supported levels are: **Optimal** and **Fastest**. For more information, see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). | |No | ---> [!NOTE] -> **bucketName + key** specifies the location of the S3 object, where bucket is the root container for S3 objects, and key is the full path to the S3 object. --### Sample dataset with prefix --```json -{ - "name": "dataset-s3", - "properties": { - "type": "AmazonS3", - "linkedServiceName": "link- testS3", - "typeProperties": { - "prefix": "testFolder/test", - "bucketName": "testbucket", - "format": { - "type": "OrcFormat" - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true - } -} -``` -### Sample dataset (with version) --```json -{ - "name": "dataset-s3", - "properties": { - "type": "AmazonS3", - "linkedServiceName": "link- testS3", - "typeProperties": { - "key": "testFolder/test.orc", - "bucketName": "testbucket", - "version": "XXXXXXXXXczm0CJajYkHf0_k6LhBmkcL", - "format": { - "type": "OrcFormat" - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true - } -} -``` --### Dynamic paths for S3 -The preceding sample uses fixed values for the **key** and **bucketName** properties in the Amazon S3 dataset. --```json -"key": "testFolder/test.orc", -"bucketName": "testbucket", -``` --You can have Data Factory calculate these properties dynamically at runtime, by using system variables such as SliceStart. --```json -"key": "$$Text.Format('{0:MM}/{0:dd}/test.orc', SliceStart)" -"bucketName": "$$Text.Format('{0:yyyy}', SliceStart)" -``` --You can do the same for the **prefix** property of an Amazon S3 dataset. For a list of supported functions and variables, see [Data Factory functions and system variables](data-factory-functions-variables.md). --## Copy activity properties -For a full list of sections and properties available for defining activities, see [Creating pipelines](data-factory-create-pipelines.md). Properties such as name, description, input and output tables, and policies are available for all types of activities. Properties available in the **typeProperties** section of the activity vary with each activity type. For the copy activity, properties vary depending on the types of sources and sinks. When a source in the copy activity is of type **FileSystemSource** (which includes Amazon S3), the following property is available in **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| recursive |Specifies whether to recursively list S3 objects under the directory. |true/false |No | --## JSON example: Copy data from Amazon S3 to Azure Blob storage -This sample shows how to copy data from Amazon S3 to an Azure Blob storage. However, data can be copied directly to [any of the sinks that are supported](data-factory-data-movement-activities.md#supported-data-stores-and-formats) by using the copy activity in Data Factory. --The sample provides JSON definitions for the following Data Factory entities. You can use these definitions to create a pipeline to copy data from Amazon S3 to Blob storage, by using the [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). --* A linked service of type [AwsAccessKey](#linked-service-properties). -* A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -* An input [dataset](data-factory-create-datasets.md) of type [AmazonS3](#dataset-properties). -* An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -* A [pipeline](data-factory-create-pipelines.md) with copy activity that uses [FileSystemSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data from Amazon S3 to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples. --### Amazon S3 linked service --```json -{ - "name": "AmazonS3LinkedService", - "properties": { - "type": "AwsAccessKey", - "typeProperties": { - "accessKeyId": "<access key id>", - "secretAccessKey": "<secret access key>" - } - } -} -``` --### Azure Storage linked service --```json -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --### Amazon S3 input dataset --Setting **"external": true** informs the Data Factory service that the dataset is external to the data factory. Set this property to true on an input dataset that is not produced by an activity in the pipeline. --```json - { - "name": "AmazonS3InputDataset", - "properties": { - "type": "AmazonS3", - "linkedServiceName": "AmazonS3LinkedService", - "typeProperties": { - "key": "testFolder/test.orc", - "bucketName": "testbucket", - "format": { - "type": "OrcFormat" - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true - } - } -``` ---### Azure Blob output dataset --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses the year, month, day, and hours parts of the start time. --```json -{ - "name": "AzureBlobOutputDataSet", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/fromamazons3/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` ---### Copy activity in a pipeline with an Amazon S3 source and a blob sink --The pipeline contains a copy activity that is configured to use the input and output datasets, and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **FileSystemSource**, and **sink** type is set to **BlobSink**. --```json -{ - "name": "CopyAmazonS3ToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "FileSystemSource", - "recursive": true - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ - { - "name": "AmazonS3InputDataset" - } - ], - "outputs": [ - { - "name": "AzureBlobOutputDataSet" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "AmazonS3ToBlob" - } - ], - "start": "2014-08-08T18:00:00Z", - "end": "2014-08-08T19:00:00Z" - } -} -``` -> [!NOTE] -> To map columns from a source dataset to columns from a sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). ---## Next steps -See the following articles: --* To learn about key factors that impact performance of data movement (copy activity) in Data Factory, and various ways to optimize it, see the [Copy activity performance and tuning guide](data-factory-copy-activity-performance.md). --* For step-by-step instructions for creating a pipeline with a copy activity, see the [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). |
data-factory | Data Factory Api Change Log | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-api-change-log.md | - Title: Data Factory - .NET API Change Log -description: Describes breaking changes, feature additions, bug fixes, and so on, in a specific version of .NET API for the Azure Data Factory. -------- Previously updated : 04/12/2023---# Azure Data Factory - .NET API change log -> [!NOTE] -> This article applies to version 1 of Data Factory. --This article provides information about changes to Azure Data Factory SDK in a specific version. You can find the latest NuGet package for Azure Data Factory [here](https://www.nuget.org/packages/Microsoft.Azure.Management.DataFactories) --## Version 4.11.0 -Feature Additions: --* The following linked service types have been added: - * [OnPremisesMongoDbLinkedService](/dotnet/api/microsoft.azure.management.datafactories.models.onpremisesmongodblinkedservice) - * [AmazonRedshiftLinkedService](/dotnet/api/microsoft.azure.management.datafactories.models.amazonredshiftlinkedservice) - * [AwsAccessKeyLinkedService](/dotnet/api/microsoft.azure.management.datafactories.models.awsaccesskeylinkedservice) -* The following dataset types have been added: - * [MongoDbCollectionDataset](/dotnet/api/microsoft.azure.management.datafactories.models.mongodbcollectiondataset) - * [AmazonS3Dataset](/dotnet/api/microsoft.azure.management.datafactories.models.amazons3dataset) -* The following copy source types have been added: - * [MongoDbSource](/dotnet/api/microsoft.azure.management.datafactories.models.mongodbsource) --## Version 4.10.0 -* The following optional properties have been added to TextFormat: - * [SkipLineCount](/dotnet/api/microsoft.azure.management.datafactories.models.textformat) - * [FirstRowAsHeader](/dotnet/api/microsoft.azure.management.datafactories.models.textformat) - * [TreatEmptyAsNull](/dotnet/api/microsoft.azure.management.datafactories.models.textformat) -* The following linked service types have been added: - * [OnPremisesCassandraLinkedService](/dotnet/api/microsoft.azure.management.datafactories.models.onpremisescassandralinkedservice) - * [SalesforceLinkedService](/dotnet/api/microsoft.azure.management.datafactories.models.salesforcelinkedservice) -* The following dataset types have been added: - * [OnPremisesCassandraTableDataset](/dotnet/api/microsoft.azure.management.datafactories.models.onpremisescassandratabledataset) -* The following copy source types have been added: - * [CassandraSource](/dotnet/api/microsoft.azure.management.datafactories.models.cassandrasource) -* Add [WebServiceInputs](/dotnet/api/microsoft.azure.management.datafactories.models.azuremlbatchexecutionactivity) property to AzureMLBatchExecutionActivity - * Enable passing multiple web service inputs to an Azure Machine Learning experiment --## Version 4.9.1 -### Bug fix -* Deprecate WebApi-based authentication for [WebLinkedService](/dotnet/api/microsoft.azure.management.datafactories.models.weblinkedservice). --## Version 4.9.0 -### Feature Additions -* Add [EnableStaging](/dotnet/api/microsoft.azure.management.datafactories.models.copyactivity) and [StagingSettings](/dotnet/api/microsoft.azure.management.datafactories.models.stagingsettings) properties to CopyActivity. See [Staged copy](data-factory-copy-activity-performance.md#staged-copy) for details on the feature. --### Bug fix -* Introduce an overload of [ActivityWindowOperationExtensions.List](/dotnet/api/microsoft.azure.management.datafactories.activitywindowoperationsextensions) method, which takes an [ActivityWindowsByActivityListParameters](/dotnet/api/microsoft.azure.management.datafactories.models.activitywindowsbyactivitylistparameters) instance. -* Mark [WriteBatchSize](/dotnet/api/microsoft.azure.management.datafactories.models.copysink) and [WriteBatchTimeout](/dotnet/api/microsoft.azure.management.datafactories.models.copysink) as optional in CopySink. --## Version 4.8.0 -### Feature Additions -* The following optional properties have been added to Copy activity type to enable tuning of copy performance: - * [ParallelCopies](/dotnet/api/microsoft.azure.management.datafactories.models.copyactivity) - * [CloudDataMovementUnits](/dotnet/api/microsoft.azure.management.datafactories.models.copyactivity) --## Version 4.7.0 -### Feature Additions -* Added new StorageFormat type [OrcFormat](/dotnet/api/microsoft.azure.management.datafactories.models.orcformat) type to copy files in optimized row columnar (ORC) format. -* Add [AllowPolyBase](/dotnet/api/microsoft.azure.management.datafactories.models.sqldwsink) and PolyBaseSettings properties to SqlDWSink. - * Enables the use of PolyBase to copy data into Azure Synapse Analytics. --## Version 4.6.1 -### Bug Fixes -* Fixes HTTP request for listing activity windows. - * Removes the resource group name and the data factory name from the request payload. --## Version 4.6.0 -### Feature Additions -* The following properties have been added to [PipelineProperties](/dotnet/api/microsoft.azure.management.datafactories.models.pipelineproperties): - * [PipelineMode](/dotnet/api/microsoft.azure.management.datafactories.models.pipelineproperties) - * [ExpirationTime](/dotnet/api/microsoft.azure.management.datafactories.models.pipelineproperties) - * [Datasets](/dotnet/api/microsoft.azure.management.datafactories.models.pipelineproperties) -* The following properties have been added to [PipelineRuntimeInfo](/dotnet/api/microsoft.azure.management.datafactories.common.models.pipelineruntimeinfo): - * [PipelineState](/dotnet/api/microsoft.azure.management.datafactories.common.models.pipelineruntimeinfo) -* Added new [StorageFormat](/dotnet/api/microsoft.azure.management.datafactories.models.storageformat) type [JsonFormat](/dotnet/api/microsoft.azure.management.datafactories.models.jsonformat) type to define datasets whose data is in JSON format. --## Version 4.5.0 -### Feature Additions -* Added [list operations for activity window](/dotnet/api/microsoft.azure.management.datafactories.activitywindowoperationsextensions). - * Added methods to retrieve activity windows with filters based on the entity types (that is, data factories, datasets, pipelines, and activities). -* The following linked service types have been added: - * [ODataLinkedService](/dotnet/api/microsoft.azure.management.datafactories.models.odatalinkedservice), [WebLinkedService](/dotnet/api/microsoft.azure.management.datafactories.models.weblinkedservice) -* The following dataset types have been added: - * [ODataResourceDataset](/dotnet/api/microsoft.azure.management.datafactories.models.odataresourcedataset), [WebTableDataset](/dotnet/api/microsoft.azure.management.datafactories.models.webtabledataset) -* The following copy source types have been added: - * [WebSource](/dotnet/api/microsoft.azure.management.datafactories.models.websource) --## Version 4.4.0 -### Feature additions -* The following linked service type has been added as data sources and sinks for copy activities: - * [AzureStorageSasLinkedService](/dotnet/api/microsoft.azure.management.datafactories.models.azurestoragesaslinkedservice). See [Azure Storage SAS Linked Service](data-factory-azure-blob-connector.md#azure-storage-sas-linked-service) for conceptual information and examples. --## Version 4.3.0 -### Feature additions -* The following linked service types haven been added as data sources for copy activities: - * [HdfsLinkedService](/dotnet/api/microsoft.azure.management.datafactories.models.hdfslinkedservice). See [Move data from HDFS using Data Factory](data-factory-hdfs-connector.md) for conceptual information and examples. - * [OnPremisesOdbcLinkedService](/dotnet/api/microsoft.azure.management.datafactories.models.onpremisesodbclinkedservice). See [Move data From ODBC data stores using Azure Data Factory](data-factory-odbc-connector.md) for conceptual information and examples. --## Version 4.2.0 -### Feature additions -* The following new activity type has been added: [AzureMLUpdateResourceActivity](/dotnet/api/microsoft.azure.management.datafactories.models.azuremlupdateresourceactivity). For details about the activity, see [Updating Azure ML models using the Update Resource Activity](data-factory-azure-ml-batch-execution-activity.md). -* A new optional property [updateResourceEndpoint](/dotnet/api/microsoft.azure.management.datafactories.models.azuremllinkedservice) has been added to the [AzureMLLinkedService class](/dotnet/api/microsoft.azure.management.datafactories.models.azuremllinkedservice). -* [LongRunningOperationInitialTimeout](/dotnet/api/microsoft.azure.management.datafactories.datafactorymanagementclient) and [LongRunningOperationRetryTimeout](/dotnet/api/microsoft.azure.management.datafactories.datafactorymanagementclient) properties have been added to the [DataFactoryManagementClient](/dotnet/api/microsoft.azure.management.datafactories.datafactorymanagementclient) class. -* Allow configuration of the timeouts for client calls to the Data Factory service. --## Version 4.1.0 -### Feature additions -* The following linked service types have been added: - * [AzureDataLakeStoreLinkedService](/dotnet/api/microsoft.azure.management.datafactories.models.azuredatalakestorelinkedservice) - * [AzureDataLakeAnalyticsLinkedService](/dotnet/api/microsoft.azure.management.datafactories.models.azuredatalakeanalyticslinkedservice) -* The following activity types have been added: - * [DataLakeAnalyticsUSQLActivity](/dotnet/api/microsoft.azure.management.datafactories.models.datalakeanalyticsusqlactivity) -* The following dataset types have been added: - * [AzureDataLakeStoreDataset](/dotnet/api/microsoft.azure.management.datafactories.models.azuredatalakestoredataset) -* The following source and sink types for Copy Activity have been added: - * [AzureDataLakeStoreSource](/dotnet/api/microsoft.azure.management.datafactories.models.azuredatalakestoresource) - * [AzureDataLakeStoreSink](/dotnet/api/microsoft.azure.management.datafactories.models.azuredatalakestoresink) --## Version 4.0.1 -### Breaking changes -The following classes have been renamed. The new names were the original names of classes before 4.0.0 release. --| Name in 4.0.0 | Name in 4.0.1 | -|: |: | -| AzureSqlDataWarehouseDataset |[AzureSqlDataWarehouseTableDataset](/dotnet/api/microsoft.azure.management.datafactories.models.azuresqldatawarehousetabledataset) | -| AzureSqlDataset |[AzureSqlTableDataset](/dotnet/api/microsoft.azure.management.datafactories.models.azuresqltabledataset) | -| AzureDataset |[AzureTableDataset](/dotnet/api/microsoft.azure.management.datafactories.models.azuretabledataset) | -| OracleDataset |[OracleTableDataset](/dotnet/api/microsoft.azure.management.datafactories.models.oracletabledataset) | -| RelationalDataset |[RelationalTableDataset](/dotnet/api/microsoft.azure.management.datafactories.models.relationaltabledataset) | -| SqlServerDataset |[SqlServerTableDataset](/dotnet/api/microsoft.azure.management.datafactories.models.sqlservertabledataset) | --## Version 4.0.0 --### Breaking changes --* The Following classes/interfaces have been renamed. --| Old name | New name | -|: |: | -| ITableOperations |[IDatasetOperations](/dotnet/api/microsoft.azure.management.datafactories.idatasetoperations) | -| Table |[Dataset](/dotnet/api/microsoft.azure.management.datafactories.models.dataset) | -| TableProperties |[DatasetProperties](/dotnet/api/microsoft.azure.management.datafactories.models.datasetproperties) | -| TableTypeProprerties |[DatasetTypeProperties](/dotnet/api/microsoft.azure.management.datafactories.models.datasettypeproperties) | -| TableCreateOrUpdateParameters |[DatasetCreateOrUpdateParameters](/dotnet/api/microsoft.azure.management.datafactories.models.datasetcreateorupdateparameters) | -| TableCreateOrUpdateResponse |[DatasetCreateOrUpdateResponse](/dotnet/api/microsoft.azure.management.datafactories.models.datasetcreateorupdateresponse) | -| TableGetResponse |[DatasetGetResponse](/dotnet/api/microsoft.azure.management.datafactories.models.datasetgetresponse) | -| TableListResponse |[DatasetListResponse](/dotnet/api/microsoft.azure.management.datafactories.models.datasetlistresponse) | -| CreateOrUpdateWithRawJsonContentParameters |[DatasetCreateOrUpdateWithRawJsonContentParameters](/dotnet/api/microsoft.azure.management.datafactories.models.datasetcreateorupdatewithrawjsoncontentparameters) | --* The **List** methods return paged results now. If the response contains a non-empty **NextLink** property, the client application needs to continue fetching the next page until all pages are returned. Here is an example: -- ```csharp - PipelineListResponse response = client.Pipelines.List("ResourceGroupName", "DataFactoryName"); - var pipelines = new List<Pipeline>(response.Pipelines); -- string nextLink = response.NextLink; - while (!string.IsNullOrEmpty(nextLink)) - { - PipelineListResponse nextResponse = client.Pipelines.ListNext(nextLink); - pipelines.AddRange(nextResponse.Pipelines); -- nextLink = nextResponse.NextLink; - } - ``` --* **List** pipeline API returns only the summary of a pipeline instead of full details. For instance, activities in a pipeline summary only contain name and type. --### Feature additions -* The [SqlDWSink](/dotnet/api/microsoft.azure.management.datafactories.models.sqldwsink) class supports two new properties, **SliceIdentifierColumnName** and **SqlWriterCleanupScript**, to support idempotent copy to Azure Synapse Analytics. See the [Azure Synapse Analytics](data-factory-azure-sql-data-warehouse-connector.md) article for details about these properties. -* We now support running stored procedure against Azure SQL Database and Azure Synapse Analytics sources as part of the Copy Activity. The [SqlSource](/dotnet/api/microsoft.azure.management.datafactories.models.sqlsource) and [SqlDWSource](/dotnet/api/microsoft.azure.management.datafactories.models.sqldwsource) classes have the following properties: **SqlReaderStoredProcedureName** and **StoredProcedureParameters**. See the [Azure SQL Database](data-factory-azure-sql-connector.md#sqlsource) and [Azure Synapse Analytics](data-factory-azure-sql-data-warehouse-connector.md#sqldwsource) articles on Azure.com for details about these properties. |
data-factory | Data Factory Azure Blob Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-azure-blob-connector.md | - Title: Copy data to/from Azure Blob Storage -description: 'Learn how to copy blob data in Azure Data Factory. Use our sample: How to copy data to and from Azure Blob Storage and Azure SQL Database.' ---- Previously updated : 04/12/2023----# Copy data to or from Azure Blob Storage using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-azure-blob-connector.md) -> * [Version 2 (current version)](../connector-azure-blob-storage.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Azure Blob Storage connector in V2](../connector-azure-blob-storage.md). ---This article explains how to use the Copy Activity in Azure Data Factory to copy data to and from Azure Blob Storage. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --## Overview -You can copy data from any supported source data store to Azure Blob Storage or from Azure Blob Storage to any supported sink data store. The following table provides a list of data stores supported as sources or sinks by the copy activity. For example, you can move data **from** a SQL Server database or a database in Azure SQL Database **to** an Azure blob storage. And, you can copy data **from** Azure blob storage **to** Azure Synapse Analytics or an Azure Cosmos DB collection. ---## Supported scenarios -You can copy data **from Azure Blob Storage** to the following data stores: ---You can copy data from the following data stores **to Azure Blob Storage**: ---> [!IMPORTANT] -> Copy Activity supports copying data from/to both general-purpose Azure Storage accounts and Hot/Cool Blob storage. The activity supports **reading from block, append, or page blobs**, but supports **writing to only block blobs**. Azure Premium Storage is not supported as a sink because it is backed by page blobs. -> -> Copy Activity does not delete data from the source after the data is successfully copied to the destination. If you need to delete source data after a successful copy, create a [custom activity](data-factory-use-custom-activities.md) to delete the data and use the activity in the pipeline. For an example, see the [Delete blob or folder sample on GitHub](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/DeleteBlobFileFolderCustomActivity). --## Get started -You can create a pipeline with a copy activity that moves data to/from an Azure Blob Storage by using different tools/APIs. --The easiest way to create a pipeline is to use the **Copy Wizard**. This article has a [walkthrough](#walkthrough-use-copy-wizard-to-copy-data-tofrom-blob-storage) for creating a pipeline to copy data from an Azure Blob Storage location to another Azure Blob Storage location. For a tutorial on creating a pipeline to copy data from an Azure Blob Storage to Azure SQL Database, see [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md). --You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create a **data factory**. A data factory may contain one or more pipelines. -2. Create **linked services** to link input and output data stores to your data factory. For example, if you are copying data from an Azure blob storage to Azure SQL Database, you create two linked services to link your Azure storage account and Azure SQL Database to your data factory. For linked service properties that are specific to Azure Blob Storage, see [linked service properties](#linked-service-properties) section. -2. Create **datasets** to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the blob container and folder that contains the input data. And, you create another dataset to specify the SQL table in Azure SQL Database that holds the data copied from the blob storage. For dataset properties that are specific to Azure Blob Storage, see [dataset properties](#dataset-properties) section. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use BlobSource as a source and SqlSink as a sink for the copy activity. Similarly, if you are copying from Azure SQL Database to Azure Blob Storage, you use SqlSource and BlobSink in the copy activity. For copy activity properties that are specific to Azure Blob Storage, see [copy activity properties](#copy-activity-properties) section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from an Azure Blob Storage, see [JSON examples](#json-examples-for-copying-data-to-and-from-blob-storage) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure Blob Storage. --## Linked service properties -There are two types of linked services you can use to link an Azure Storage to an Azure data factory. They are: **AzureStorage** linked service and **AzureStorageSas** linked service. The Azure Storage linked service provides the data factory with global access to the Azure Storage. Whereas, The Azure Storage SAS (Shared Access Signature) linked service provides the data factory with restricted/time-bound access to the Azure Storage. There are no other differences between these two linked services. Choose the linked service that suits your needs. The following sections provide more details on these two linked services. ---## Dataset properties -To specify a dataset to represent input or output data in an Azure Blob Storage, you set the type property of the dataset to: **AzureBlob**. Set the **linkedServiceName** property of the dataset to the name of the Azure Storage or Azure Storage SAS linked service. The type properties of the dataset specify the **blob container** and the **folder** in the blob storage. --For a full list of JSON sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --Data factory supports the following CLS-compliant .NET based type values for providing type information in "structure" for schema-on-read data sources like Azure blob: Int16, Int32, Int64, Single, Double, Decimal, Byte[], Bool, String, Guid, Datetime, Datetimeoffset, Timespan. Data Factory automatically performs type conversions when moving data from a source data store to a sink data store. --The **typeProperties** section is different for each type of dataset and provides information about the location, format etc., of the data in the data store. The typeProperties section for dataset of type **AzureBlob** dataset has the following properties: --| Property | Description | Required | -| | | | -| folderPath |Path to the container and folder in the blob storage. Example: myblobcontainer\myblobfolder\ |Yes | -| fileName |Name of the blob. fileName is optional and case-sensitive.<br/><br/>If you specify a filename, the activity (including Copy) works on the specific Blob.<br/><br/>When fileName is not specified, Copy includes all Blobs in the folderPath for input dataset.<br/><br/>When **fileName** is not specified for an output dataset and **preserveHierarchy** is not specified in activity sink, the name of the generated file would be in the following this format: `Data.<Guid>.txt` (for example: : Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt |No | -| partitionedBy |partitionedBy is an optional property. You can use it to specify a dynamic folderPath and filename for time series data. For example, folderPath can be parameterized for every hour of data. See the [Using partitionedBy property section](#using-partitionedby-property) for details and examples. |No | -| format | The following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see [Text Format](data-factory-supported-file-and-compression-formats.md#text-format), [Json Format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro Format](data-factory-supported-file-and-compression-formats.md#avro-format), [Orc Format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet Format](data-factory-supported-file-and-compression-formats.md#parquet-format) sections. <br><br> If you want to **copy files as-is** between file-based stores (binary copy), skip the format section in both input and output dataset definitions. |No | -| compression | Specify the type and level of compression for the data. Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**. Supported levels are: **Optimal** and **Fastest**. For more information, see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). |No | --### Using partitionedBy property -As mentioned in the previous section, you can specify a dynamic folderPath and filename for time series data with the **partitionedBy** property, [Data Factory functions, and the system variables](data-factory-functions-variables.md). --For more information on time series datasets, scheduling, and slices, see [Creating Datasets](data-factory-create-datasets.md) and [Scheduling & Execution](data-factory-scheduling-and-execution.md) articles. --#### Sample 1 --```json -"folderPath": "wikidatagateway/wikisampledataout/{Slice}", -"partitionedBy": -[ - { "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } }, -], -``` --In this example, {Slice} is replaced with the value of Data Factory system variable SliceStart in the format (YYYYMMDDHH) specified. The SliceStart refers to start time of the slice. The folderPath is different for each slice. For example: wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104 --#### Sample 2 --```json -"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}", -"fileName": "{Hour}.csv", -"partitionedBy": -[ - { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, - { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, - { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, - { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } } -], -``` --In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by folderPath and fileName properties. --## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output datasets, and policies are available for all types of activities. Whereas, properties available in the **typeProperties** section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. If you are moving data from an Azure Blob Storage, you set the source type in the copy activity to **BlobSource**. Similarly, if you are moving data to an Azure Blob Storage, you set the sink type in the copy activity to **BlobSink**. This section provides a list of properties supported by BlobSource and BlobSink. --**BlobSource** supports the following properties in the **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| recursive |Indicates whether the data is read recursively from the sub folders or only from the specified folder. |True (default value), False |No | --**BlobSink** supports the following properties **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| copyBehavior |Defines the copy behavior when the source is BlobSource or FileSystem. |<b>PreserveHierarchy</b>: preserves the file hierarchy in the target folder. The relative path of source file to source folder is identical to the relative path of target file to target folder.<br/><br/><b>FlattenHierarchy</b>: all files from the source folder are in the first level of target folder. The target files have auto generated name. <br/><br/><b>MergeFiles</b>: merges all files from the source folder to one file. If the File/Blob Name is specified, the merged file name would be the specified name; otherwise, would be auto-generated file name. |No | --**BlobSource** also supports these two properties for backward compatibility. --* **treatEmptyAsNull**: Specifies whether to treat null or empty string as null value. -* **skipHeaderLineCount** - Specifies how many lines need be skipped. It is applicable only when input dataset is using TextFormat. --Similarly, **BlobSink** supports the following property for backward compatibility. --* **blobWriterAddHeader**: Specifies whether to add a header of column definitions while writing to an output dataset. --Datasets now support the following properties that implement the same functionality: **treatEmptyAsNull**, **skipLineCount**, **firstRowAsHeader**. --The following table provides guidance on using the new dataset properties in place of these blob source/sink properties. --| Copy Activity property | Dataset property | -|: |: | -| skipHeaderLineCount on BlobSource |skipLineCount and firstRowAsHeader. Lines are skipped first and then the first row is read as a header. | -| treatEmptyAsNull on BlobSource |treatEmptyAsNull on input dataset | -| blobWriterAddHeader on BlobSink |firstRowAsHeader on output dataset | --See [Specifying TextFormat](data-factory-supported-file-and-compression-formats.md#text-format) section for detailed information on these properties. --### recursive and copyBehavior examples -This section describes the resulting behavior of the Copy operation for different combinations of recursive and copyBehavior values. --| recursive | copyBehavior | Resulting behavior | -| | | | -| true |preserveHierarchy |For a source folder Folder1 with the following structure: <br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target folder Folder1 is created with the same structure as the source<br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5. | -| true |flattenHierarchy |For a source folder Folder1 with the following structure: <br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target Folder1 is created with the following structure: <br/><br/>Folder1<br/> auto-generated name for File1<br/> auto-generated name for File2<br/> auto-generated name for File3<br/> auto-generated name for File4<br/> auto-generated name for File5 | -| true |mergeFiles |For a source folder Folder1 with the following structure: <br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target Folder1 is created with the following structure: <br/><br/>Folder1<br/> File1 + File2 + File3 + File4 + File 5 contents are merged into one file with auto-generated file name | -| false |preserveHierarchy |For a source folder Folder1 with the following structure: <br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target folder Folder1 is created with the following structure<br/><br/>Folder1<br/> File1<br/> File2<br/><br/><br/>Subfolder1 with File3, File4, and File5 are not picked up. | -| false |flattenHierarchy |For a source folder Folder1 with the following structure:<br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target folder Folder1 is created with the following structure<br/><br/>Folder1<br/> auto-generated name for File1<br/> auto-generated name for File2<br/><br/><br/>Subfolder1 with File3, File4, and File5 are not picked up. | -| false |mergeFiles |For a source folder Folder1 with the following structure:<br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target folder Folder1 is created with the following structure<br/><br/>Folder1<br/> File1 + File2 contents are merged into one file with auto-generated file name. auto-generated name for File1<br/><br/>Subfolder1 with File3, File4, and File5 are not picked up. | --## Walkthrough: Use Copy Wizard to copy data to/from Blob Storage -Let's look at how to quickly copy data to/from an Azure blob storage. In this walkthrough, both source and destination data stores of type: Azure Blob Storage. The pipeline in this walkthrough copies data from a folder to another folder in the same blob container. This walkthrough is intentionally simple to show you settings or properties when using Blob Storage as a source or sink. --### Prerequisites -1. Create a general-purpose **Azure Storage Account** if you don't have one already. You use the blob storage as both **source** and **destination** data store in this walkthrough. if you don't have an Azure storage account, see the [Create a storage account](../../storage/common/storage-account-create.md) article for steps to create one. -2. Create a blob container named **adfblobconnector** in the storage account. -4. Create a folder named **input** in the **adfblobconnector** container. -5. Create a file named **emp.txt** with the following content and upload it to the **input** folder by using tools such as [Azure Storage Explorer](https://azure.microsoft.com/features/storage-explorer/) - ```json - John, Doe - Jane, Doe - ``` --### Create the data factory -1. Sign in to the [Azure portal](https://portal.azure.com). -2. Click **Create a resource** from the top-left corner, click **Intelligence + analytics**, and click **Data Factory**. -3. In the **New data factory** pane: - 1. Enter **ADFBlobConnectorDF** for the **name**. The name of the Azure data factory must be globally unique. If you receive the error: `*Data factory name "ADFBlobConnectorDF" is not available`, change the name of the data factory (for example, yournameADFBlobConnectorDF) and try creating again. See [Data Factory - Naming Rules](data-factory-naming-rules.md) topic for naming rules for Data Factory artifacts. - 2. Select your Azure **subscription**. - 3. For Resource Group, select **Use existing** to select an existing resource group (or) select **Create new** to enter a name for a resource group. - 4. Select a **location** for the data factory. - 5. Select **Pin to dashboard** check box at the bottom of the blade. - 6. Click **Create**. -3. After the creation is complete, you see the **Data Factory** blade as shown in the following image: - :::image type="content" source="./media/data-factory-azure-blob-connector/data-factory-home-page.png" alt-text="Data factory home page"::: --### Copy Wizard -1. On the Data Factory home page, click the **Copy data** tile to launch **Copy Data Wizard** in a separate tab. -- > [!NOTE] - > If you see that the web browser is stuck at "Authorizing...", disable/uncheck **Block third-party cookies and site data** setting (or) keep it enabled and create an exception for **login.microsoftonline.com** and then try launching the wizard again. -2. In the **Properties** page: - 1. Enter **CopyPipeline** for **Task name**. The task name is the name of the pipeline in your data factory. - 2. Enter a **description** for the task (optional). - 3. For **Task cadence or Task schedule**, keep the **Run regularly on schedule** option. If you want to run this task only once instead of run repeatedly on a schedule, select **Run once now**. If you select, **Run once now** option, a [one-time pipeline](data-factory-create-pipelines.md#onetime-pipeline) is created. - 4. Keep the settings for **Recurring pattern**. This task runs daily between the start and end times you specify in the next step. - 5. Change the **Start date time** to **04/21/2017**. - 6. Change the **End date time** to **04/25/2017**. You may want to type the date instead of browsing through the calendar. - 8. Click **Next**. - :::image type="content" source="./media/data-factory-azure-blob-connector/copy-tool-properties-page.png" alt-text="Copy Tool - Properties page"::: -3. On the **Source data store** page, click **Azure Blob Storage** tile. You use this page to specify the source data store for the copy task. You can use an existing data store linked service (or) specify a new data store. To use an existing linked service, you would select **FROM EXISTING LINKED SERVICES** and select the right linked service. - :::image type="content" source="./media/data-factory-azure-blob-connector/copy-tool-source-data-store-page.png" alt-text="Copy Tool - Source data store page"::: -4. On the **Specify the Azure Blob storage account** page: - 1. Keep the auto-generated name for **Connection name**. The connection name is the name of the linked service of type: Azure Storage. - 2. Confirm that **From Azure subscriptions** option is selected for **Account selection method**. - 3. Select your Azure subscription or keep **Select all** for **Azure subscription**. - 4. Select an **Azure storage account** from the list of Azure storage accounts available in the selected subscription. You can also choose to enter storage account settings manually by selecting **Enter manually** option for the **Account selection method**. - 5. Click **Next**. - :::image type="content" source="./media/data-factory-azure-blob-connector/copy-tool-specify-azure-blob-storage-account.png" alt-text="Copy Tool - Specify the Azure Blob storage account"::: -5. On **Choose the input file or folder** page: - 1. Double-click **adfblobcontainer**. - 2. Select **input**, and click **Choose**. In this walkthrough, you select the input folder. You could also select the emp.txt file in the folder instead. - :::image type="content" source="./media/data-factory-azure-blob-connector/copy-tool-choose-input-file-or-folder.png" alt-text="Copy Tool - Choose the input file or folder 1"::: -6. On the **Choose the input file or folder** page: - 1. Confirm that the **file or folder** is set to **adfblobconnector/input**. If the files are in sub folders, for example, 2017/04/01, 2017/04/02, and so on, enter adfblobconnector/input/{year}/{month}/{day} for file or folder. When you press TAB out of the text box, you see three drop-down lists to select formats for year (yyyy), month (MM), and day (dd). - 2. Do not set **Copy file recursively**. Select this option to recursively traverse through folders for files to be copied to the destination. - 3. Do not the **binary copy** option. Select this option to perform a binary copy of source file to the destination. Do not select for this walkthrough so that you can see more options in the next pages. - 4. Confirm that the **Compression type** is set to **None**. Select a value for this option if your source files are compressed in one of the supported formats. - 5. Click **Next**. - :::image type="content" source="./media/data-factory-azure-blob-connector/chose-input-file-folder.png" alt-text="Copy Tool - Choose the input file or folder 2"::: -7. On the **File format settings** page, you see the delimiters and the schema that is auto-detected by the wizard by parsing the file. - 1. Confirm the following options: - a. The **file format** is set to **Text format**. You can see all the supported formats in the drop-down list. For example: JSON, Avro, ORC, Parquet. - b. The **column delimiter** is set to `Comma (,)`. You can see the other column delimiters supported by Data Factory in the drop-down list. You can also specify a custom delimiter. - c. The **row delimiter** is set to `Carriage Return + Line feed (\r\n)`. You can see the other row delimiters supported by Data Factory in the drop-down list. You can also specify a custom delimiter. - d. The **skip line count** is set to **0**. If you want a few lines to be skipped at the top of the file, enter the number here. - e. The **first data row contains column names** is not set. If the source files contain column names in the first row, select this option. - f. The **treat empty column value as null** option is set. - 2. Expand **Advanced settings** to see advanced option available. - 3. At the bottom of the page, see the **preview** of data from the emp.txt file. - 4. Click **SCHEMA** tab at the bottom to see the schema that the copy wizard inferred by looking at the data in the source file. - 5. Click **Next** after you review the delimiters and preview data. - :::image type="content" source="./media/data-factory-azure-blob-connector/copy-tool-file-format-settings.png" alt-text="Copy Tool - File format settings"::: -8. On the **Destination data store page**, select **Azure Blob Storage**, and click **Next**. You are using the Azure Blob Storage as both the source and destination data stores in this walkthrough. - :::image type="content" source="media/data-factory-azure-blob-connector/select-destination-data-store.png" alt-text="Copy Tool - select destination data store"::: -9. On **Specify the Azure Blob storage account** page: - 1. Enter **AzureStorageLinkedService** for the **Connection name** field. - 2. Confirm that **From Azure subscriptions** option is selected for **Account selection method**. - 3. Select your Azure **subscription**. - 4. Select your Azure storage account. - 5. Click **Next**. -10. On the **Choose the output file or folder** page: - 1. specify **Folder path** as **adfblobconnector/output/{year}/{month}/{day}**. Enter **TAB**. - 1. For the **year**, select **yyyy**. - 1. For the **month**, confirm that it is set to **MM**. - 1. For the **day**, confirm that it is set to **dd**. - 1. Confirm that the **compression type** is set to **None**. - 1. Confirm that the **copy behavior** is set to **Merge files**. If the output file with the same name already exists, the new content is added to the same file at the end. - 1. Click **Next**. - :::image type="content" source="media/data-factory-azure-blob-connector/choose-the-output-file-or-folder.png" alt-text="Copy Tool - Choose output file or folder"::: -11. On the **File format settings** page, review the settings, and click **Next**. One of the additional options here is to add a header to the output file. If you select that option, a header row is added with names of the columns from the schema of the source. You can rename the default column names when viewing the schema for the source. For example, you could change the first column to First Name and second column to Last Name. Then, the output file is generated with a header with these names as column names. - :::image type="content" source="media/data-factory-azure-blob-connector/file-format-destination.png" alt-text="Copy Tool - File format settings for destination"::: -12. On the **Performance settings** page, confirm that **cloud units** and **parallel copies** are set to **Auto**, and click Next. For details about these settings, see [Copy activity performance and tuning guide](data-factory-copy-activity-performance.md#parallel-copy). - :::image type="content" source="media/data-factory-azure-blob-connector/copy-performance-settings.png" alt-text="Copy Tool - Performance settings"::: -14. On the **Summary** page, review all settings (task properties, settings for source and destination, and copy settings), and click **Next**. - :::image type="content" source="media/data-factory-azure-blob-connector/copy-tool-summary-page.png" alt-text="Copy Tool - Summary page"::: -15. Review information in the **Summary** page, and click **Finish**. The wizard creates two linked services, two datasets (input and output), and one pipeline in the data factory (from where you launched the Copy Wizard). - :::image type="content" source="media/data-factory-azure-blob-connector/copy-tool-deployment-page.png" alt-text="Copy Tool - Deployment page"::: --### Monitor the pipeline (copy task) --1. Click the link `Click here to monitor copy pipeline` on the **Deployment** page. -2. You should see the **Monitor and Manage application** in a separate tab. - :::image type="content" source="media/data-factory-azure-blob-connector/monitor-manage-app.png" alt-text="Monitor and Manage App"::: -3. Change the **start** time at the top to `04/19/2017` and **end** time to `04/27/2017`, and then click **Apply**. -4. You should see five activity windows in the **ACTIVITY WINDOWS** list. The **WindowStart** times should cover all days from pipeline start to pipeline end times. -5. Click **Refresh** button for the **ACTIVITY WINDOWS** list a few times until you see the status of all the activity windows is set to Ready. -6. Now, verify that the output files are generated in the output folder of adfblobconnector container. You should see the following folder structure in the output folder: -- ```output - 2017/04/21 - 2017/04/22 - 2017/04/23 - 2017/04/24 - 2017/04/25 - ``` -- For detailed information about monitoring and managing data factories, see [Monitor and manage Data Factory pipeline](data-factory-monitor-manage-app.md) article. --### Data Factory entities -Now, switch back to the tab with the Data Factory home page. Notice that there are two linked services, two datasets, and one pipeline in your data factory now. ---Click **Author and deploy** to launch Data Factory Editor. ---You should see the following Data Factory entities in your data factory: --- Two linked services. One for the source and the other one for the destination. Both the linked services refer to the same Azure Storage account in this walkthrough.-- Two datasets. An input dataset and an output dataset. In this walkthrough, both use the same blob container but refer to different folders (input and output).-- A pipeline. The pipeline contains a copy activity that uses a blob source and a blob sink to copy data from an Azure blob location to another Azure blob location.--The following sections provide more information about these entities. --#### Linked services -You should see two linked services. One for the source and the other one for the destination. In this walkthrough, both definitions look the same except for the names. The **type** of the linked service is set to **AzureStorage**. Most important property of the linked service definition is the **connectionString**, which is used by Data Factory to connect to your Azure Storage account at runtime. Ignore the hubName property in the definition. --##### Source blob storage linked service -```json -{ - "name": "Source-BlobStorage-z4y", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=**********" - } - } -} -``` --##### Destination blob storage linked service --```json -{ - "name": "Destination-BlobStorage-z4y", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=**********" - } - } -} -``` --For more information about Azure Storage linked service, see [Linked service properties](#linked-service-properties) section. --#### Datasets -There are two datasets: an input dataset and an output dataset. The type of the dataset is set to **AzureBlob** for both. --The input dataset points to the **input** folder of the **adfblobconnector** blob container. The **external** property is set to **true** for this dataset as the data is not produced by the pipeline with the copy activity that takes this dataset as an input. --The output dataset points to the **output** folder of the same blob container. The output dataset also uses the year, month, and day of the **SliceStart** system variable to dynamically evaluate the path for the output file. For a list of functions and system variables supported by Data Factory, see [Data Factory functions and system variables](data-factory-functions-variables.md). The **external** property is set to **false** (default value) because this dataset is produced by the pipeline. --For more information about properties supported by Azure Blob dataset, see [Dataset properties](#dataset-properties) section. --##### Input dataset --```json -{ - "name": "InputDataset-z4y", - "properties": { - "structure": [ - { "name": "Prop_0", "type": "String" }, - { "name": "Prop_1", "type": "String" } - ], - "type": "AzureBlob", - "linkedServiceName": "Source-BlobStorage-z4y", - "typeProperties": { - "folderPath": "adfblobconnector/input/", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Day", - "interval": 1 - }, - "external": true, - "policy": {} - } -} -``` --##### Output dataset --```json -{ - "name": "OutputDataset-z4y", - "properties": { - "structure": [ - { "name": "Prop_0", "type": "String" }, - { "name": "Prop_1", "type": "String" } - ], - "type": "AzureBlob", - "linkedServiceName": "Destination-BlobStorage-z4y", - "typeProperties": { - "folderPath": "adfblobconnector/output/{year}/{month}/{day}", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - }, - "partitionedBy": [ - { "name": "year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, - { "name": "month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, - { "name": "day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } } - ] - }, - "availability": { - "frequency": "Day", - "interval": 1 - }, - "external": false, - "policy": {} - } -} -``` --#### Pipeline -The pipeline has just one activity. The **type** of the activity is set to **Copy**. In the type properties for the activity, there are two sections, one for source and the other one for sink. The source type is set to **BlobSource** as the activity is copying data from a blob storage. The sink type is set to **BlobSink** as the activity copying data to a blob storage. The copy activity takes InputDataset-z4y as the input and OutputDataset-z4y as the output. --For more information about properties supported by BlobSource and BlobSink, see [Copy activity properties](#copy-activity-properties) section. --```json -{ - "name": "CopyPipeline", - "properties": { - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "BlobSource", - "recursive": false - }, - "sink": { - "type": "BlobSink", - "copyBehavior": "MergeFiles", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ - { - "name": "InputDataset-z4y" - } - ], - "outputs": [ - { - "name": "OutputDataset-z4y" - } - ], - "policy": { - "timeout": "1.00:00:00", - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "style": "StartOfInterval", - "retry": 3, - "longRetry": 0, - "longRetryInterval": "00:00:00" - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - }, - "name": "Activity-0-Blob path_ adfblobconnector_input_->OutputDataset-z4y" - } - ], - "start": "2017-04-21T22:34:00Z", - "end": "2017-04-25T05:00:00Z", - "isPaused": false, - "pipelineMode": "Scheduled" - } -} -``` --## JSON examples for copying data to and from Blob Storage -The following examples provide sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). They show how to copy data to and from Azure Blob Storage and Azure SQL Database. However, data can be copied **directly** from any of sources to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --### JSON Example: Copy data from Blob Storage to SQL Database -The following sample shows: --1. A linked service of type [AzureSqlDatabase](data-factory-azure-sql-connector.md#linked-service-properties). -2. A linked service of type [AzureStorage](#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [AzureBlob](#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureSqlTable](data-factory-azure-sql-connector.md#dataset-properties). -5. A [pipeline](data-factory-create-pipelines.md) with a Copy activity that uses [BlobSource](#copy-activity-properties) and [SqlSink](data-factory-azure-sql-connector.md#copy-activity-properties). --The sample copies time-series data from an Azure blob to an Azure SQL table hourly. The JSON properties used in these samples are described in sections following the samples. --**Azure SQL linked service:** --```json -{ - "name": "AzureSqlLinkedService", - "properties": { - "type": "AzureSqlDatabase", - "typeProperties": { - "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" - } - } -} -``` -**Azure Storage linked service:** --```json -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -Azure Data Factory supports two types of Azure Storage linked --**Azure Blob input dataset:** --Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. "external": "true" setting informs Data Factory that the table is external to the data factory and is not produced by an activity in the data factory. --```json -{ - "name": "AzureBlobInput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/", - "fileName": "{Hour}.csv", - "partitionedBy": [ - { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, - { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, - { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, - { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } - ], - "format": { - "type": "TextFormat", - "columnDelimiter": ",", - "rowDelimiter": "\n" - } - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` -**Azure SQL output dataset:** --The sample copies data to a table named "MyTable" in Azure SQL Database. Create the table in your SQL database with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour. --```json -{ - "name": "AzureSqlOutput", - "properties": { - "type": "AzureSqlTable", - "linkedServiceName": "AzureSqlLinkedService", - "typeProperties": { - "tableName": "MyOutputTable" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` -**A copy activity in a pipeline with Blob source and SQL sink:** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **BlobSource** and **sink** type is set to **SqlSink**. --```json -{ - "name":"SamplePipeline", - "properties":{ - "start":"2014-06-01T18:00:00", - "end":"2014-06-01T19:00:00", - "description":"pipeline with copy activity", - "activities":[ - { - "name": "AzureBlobtoSQL", - "description": "Copy Activity", - "type": "Copy", - "inputs": [ - { - "name": "AzureBlobInput" - } - ], - "outputs": [ - { - "name": "AzureSqlOutput" - } - ], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "SqlSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` -### JSON Example: Copy data from Azure SQL to Azure Blob -The following sample shows: --1. A linked service of type [AzureSqlDatabase](data-factory-azure-sql-connector.md#linked-service-properties). -2. A linked service of type [AzureStorage](#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [AzureSqlTable](data-factory-azure-sql-connector.md#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](#dataset-properties). -5. A [pipeline](data-factory-create-pipelines.md) with Copy activity that uses [SqlSource](data-factory-azure-sql-connector.md#copy-activity-properties) and [BlobSink](#copy-activity-properties). --The sample copies time-series data from an Azure SQL table to an Azure blob hourly. The JSON properties used in these samples are described in sections following the samples. --**Azure SQL linked service:** --```json -{ - "name": "AzureSqlLinkedService", - "properties": { - "type": "AzureSqlDatabase", - "typeProperties": { - "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" - } - } -} -``` -**Azure Storage linked service:** --```json -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -Azure Data Factory supports two types of Azure Storage linked --**Azure SQL input dataset:** --The sample assumes you have created a table "MyTable" in Azure SQL and it contains a column called "timestampcolumn" for time series data. --Setting "external": "true" informs Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. --```json -{ - "name": "AzureSqlInput", - "properties": { - "type": "AzureSqlTable", - "linkedServiceName": "AzureSqlLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --**Azure Blob output dataset:** --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```json -{ - "name": "AzureBlobOutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}/", - "partitionedBy": [ - { - "name": "Year", - "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, - { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, - { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, - { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" } } - ], - "format": { - "type": "TextFormat", - "columnDelimiter": "\t", - "rowDelimiter": "\n" - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**A copy activity in a pipeline with SQL source and Blob sink:** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **SqlSource** and **sink** type is set to **BlobSink**. The SQL query specified for the **SqlReaderQuery** property selects the data in the past hour to copy. --```json -{ - "name":"SamplePipeline", - "properties":{ - "start":"2014-06-01T18:00:00", - "end":"2014-06-01T19:00:00", - "description":"pipeline for copy activity", - "activities":[ - { - "name": "AzureSQLtoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [ - { - "name": "AzureSQLInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "typeProperties": { - "source": { - "type": "SqlSource", - "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` --> [!NOTE] -> To map columns from source dataset to columns from sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory Azure Copy Wizard | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-azure-copy-wizard.md | - Title: Data Factory Azure Copy Wizard -description: Learn about how to use the Data Factory Azure Copy Wizard to copy data from supported data sources to sinks. ---- Previously updated : 04/12/2023-----# Azure Data Factory Copy Wizard --> [!NOTE] -> This article applies to version 1 of Data Factory. --The Azure Data Factory Copy Wizard eases the process of ingesting data, which is usually a first step in an end-to-end data integration scenario. When going through the Azure Data Factory Copy Wizard, you do not need to understand any JSON definitions for linked services, data sets, and pipelines. The wizard automatically creates a pipeline to copy data from the selected data source to the selected destination. In addition, the Copy Wizard helps you to validate the data being ingested at the time of authoring. This saves time, especially when you are ingesting data for the first time from the data source. To start the Copy Wizard, click the **Copy data** tile on the home page of your data factory. ---## Designed for big data -This wizard allows you to easily move data from a wide variety of sources to destinations in minutes. After you go through the wizard, a pipeline with a copy activity is automatically created for you, along with dependent Data Factory entities (linked services and data sets). No additional steps are required to create the pipeline. ---> [!NOTE] -> For step-by-step instructions to create a sample pipeline to copy data from an Azure blob to an Azure SQL Database table, see the [Copy Wizard tutorial](data-factory-copy-data-wizard-tutorial.md). --The wizard is designed with big data in mind from the start, with support for diverse data and object types. You can author Data Factory pipelines that move hundreds of folders, files, or tables. The wizard supports automatic data preview, schema capture and mapping, and data filtering. --## Automatic data preview -You can preview part of the data from the selected data source in order to validate whether the data is what you want to copy. In addition, if the source data is in a text file, the Copy Wizard parses the text file to learn the row and column delimiters and schema automatically. ---## Schema capture and mapping -The schema of input data may not match the schema of output data in some cases. In this scenario, you need to map columns from the source schema to columns from the destination schema. --> [!TIP] -> When copying data from SQL Server or Azure SQL Database into Azure Synapse Analytics, if the table does not exist in the destination store, Data Factory support auto table creation using source's schema. Learn more from [Move data to and from Azure Synapse Analytics using Azure Data Factory](./data-factory-azure-sql-data-warehouse-connector.md). --Use a drop-down list to select a column from the source schema to map to a column in the destination schema. The Copy Wizard tries to understand your pattern for column mapping. It applies the same pattern to the rest of the columns, so that you do not need to select each of the columns individually to complete the schema mapping. If you prefer, you can override these mappings by using the drop-down lists to map the columns one by one. The pattern becomes more accurate as you map more columns. The Copy Wizard constantly updates the pattern, and ultimately reaches the right pattern for the column mapping you want to achieve. ---## Filtering data -You can filter source data to select only the data that needs to be copied to the sink data store. Filtering reduces the volume of the data to be copied to the sink data store and therefore enhances the throughput of the copy operation. It provides a flexible way to filter data in a relational database by using the SQL query language, or files in an Azure blob folder by using [Data Factory functions and variables](data-factory-functions-variables.md). --### Filtering of data in a database -The following screenshot shows a SQL query using the `Text.Format` function and `WindowStart` variable. ---### Filtering of data in an Azure blob folder -You can use variables in the folder path to copy data from a folder that is determined at runtime based on [system variables](data-factory-functions-variables.md#data-factory-system-variables). The supported variables are: **{year}**, **{month}**, **{day}**, **{hour}**, **{minute}**, and **{custom}**. For example: inputfolder/{year}/{month}/{day}. --Suppose that you have input folders in the following format: --```text -2016/03/01/01 -2016/03/01/02 -2016/03/01/03 -... -``` --Click the **Browse** button for **File or folder**, browse to one of these folders (for example, 2016->03->01->02), and click **Choose**. You should see `2016/03/01/02` in the text box. Now, replace **2016** with **{year}**, **03** with **{month}**, **01** with **{day}**, and **02** with **{hour}**, and press the **Tab** key. You should see drop-down lists to select the format for these four variables: ---As shown in the following screenshot, you can also use a **custom** variable and any [supported format strings](/dotnet/standard/base-types/custom-date-and-time-format-strings). To select a folder with that structure, use the **Browse** button first. Then replace a value with **{custom}**, and press the **Tab** key to see the text box where you can type the format string. ---## Scheduling options -You can run the copy operation once or on a schedule (hourly, daily, and so on). Both of these options can be used for the breadth of the connectors across environments, including on-premises, cloud, and local desktop copy. --A one-time copy operation enables data movement from a source to a destination only once. It applies to data of any size and any supported format. The scheduled copy allows you to copy data on a prescribed recurrence. You can use rich settings (like retry, timeout, and alerts) to configure the scheduled copy. ---## Troubleshooting --This section explores common troubleshooting methods for Copy Wizard in Azure Data Factory. --> [!NOTE] -> These troubleshooting tips apply to copy wizard in version 1 of Data Factory. For Data Factory v2, see troubleshooting guide at [Troubleshoot Azure Data Factory Studio](../data-factory-ux-troubleshoot-guide.md). --### Error code: Unable to validate in Copy Wizard --- **Symptoms**: In the first step of Copy Wizard, you encounter the warning message of "Unable to Validate".-- **Causes**: This could happen when all third-party cookies are disabled.-- **Resolution**: - - Use Internet Explorer or Microsoft Edge browser. - - If you are using Chrome browser, follow instructions below to add Cookies exception for *microsoftonline.com* and *windows.net*. - 1. Open the Chrome browser. - 2. Click the wrench or three lines on the right (Customize and control Google Chrome). - 3. Click **Settings**. - 4. Search **Cookies** or go to **Privacy** under Advanced Settings. - 5. Select **Content Settings**. - 6. Cookies should be set to **allow local data to be set (recommended)**. - 7. Click **Manage exceptions**. Under **hostname pattern** enter the following, and make sure **Allow** is the behavior set. - - login.microsoftonline.com - - login.windows.net - 8. Close the browser and relaunch. - - If you are using Firefox browser, follow instructions below to add Cookies exception. - 1. From the Firefox menu, go to **Tools** > **Options**. - 2. Under **Privacy** > **History**, your may see that the current setting is **Use Custom settings for history**. - 3. In **Accept third-party cookies**, your current setting might be **Never**, then you should click **Exceptions** on the right to add the following sites. - - https://login.microsoftonline.com - - https://login.windows.net - 4. Close the browser and relaunch. ---### Error code: Unable to open login page and enter password --- **Symptoms**: Copy Wizard redirects you to login page, but login page doesn't show up successfully.-- **Causes**: This issue could happen if you changed the network environment from office network to home network. There are some caches in browsers. -- **Resolution**: - 1. Close the browser and try again. Go to the next step if the issue still exists. - 2. If you are using Internet Explorer browser, try to open it in private mode (Press "Ctrl" + "Shift" + "P"). If you are using Chrome browser, try to open it in incognito mode (Press "Ctrl" + "shift" + "N"). Go to the next step if the issue still exists. - 3. Try to use another browser. ---## Next steps -For a quick walkthrough of using the Data Factory Copy Wizard to create a pipeline with Copy Activity, see [Tutorial: Create a pipeline using the Copy Wizard](data-factory-copy-data-wizard-tutorial.md). |
data-factory | Data Factory Azure Datalake Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-azure-datalake-connector.md | - Title: Copy data to and from Azure Data Lake Storage Gen1 -description: Learn how to copy data to and from Data Lake Store by using Azure Data Factory ---- Previously updated : 04/12/2023-----# Copy data to and from Data Lake Storage Gen1 by using Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-azure-datalake-connector.md) -> * [Version 2 (current version)](../connector-azure-data-lake-store.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Azure Data Lake Storage Gen1 connector in V2](../connector-azure-data-lake-store.md). --This article explains how to use Copy Activity in Azure Data Factory to move data to and from Azure Data Lake Storage Gen1 (previously known as Azure Data Lake Store). It builds on the [Data movement activities](data-factory-data-movement-activities.md) article, an overview of data movement with Copy Activity. --## Supported scenarios -You can copy data **from Azure Data Lake Store** to the following data stores: ---You can copy data from the following data stores **to Azure Data Lake Store**: ---> [!NOTE] -> Create a Data Lake Store account before creating a pipeline with Copy Activity. For more information, see [Get started with Azure Data Lake Store](../../data-lake-store/data-lake-store-get-started-portal.md). --## Supported authentication types -The Data Lake Store connector supports these authentication types: -* Service principal authentication -* User credential (OAuth) authentication --We recommend that you use service principal authentication, especially for a scheduled data copy. Token expiration behavior can occur with user credential authentication. For configuration details, see the [Linked service properties](#linked-service-properties) section. --## Get started -You can create a pipeline with a copy activity that moves data to/from an Azure Data Lake Store by using different tools/APIs. --The easiest way to create a pipeline to copy data is to use the **Copy Wizard**. For a tutorial on creating a pipeline by using the Copy Wizard, see [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md). --You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create a **data factory**. A data factory may contain one or more pipelines. -2. Create **linked services** to link input and output data stores to your data factory. For example, if you are copying data from an Azure blob storage to an Azure Data Lake Store, you create two linked services to link your Azure storage account and Azure Data Lake store to your data factory. For linked service properties that are specific to Azure Data Lake Store, see [linked service properties](#linked-service-properties) section. -2. Create **datasets** to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the blob container and folder that contains the input data. And, you create another dataset to specify the folder and file path in the Data Lake store that holds the data copied from the blob storage. For dataset properties that are specific to Azure Data Lake Store, see [dataset properties](#dataset-properties) section. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use BlobSource as a source and AzureDataLakeStoreSink as a sink for the copy activity. Similarly, if you are copying from Azure Data Lake Store to Azure Blob Storage, you use AzureDataLakeStoreSource and BlobSink in the copy activity. For copy activity properties that are specific to Azure Data Lake Store, see [copy activity properties](#copy-activity-properties) section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from an Azure Data Lake Store, see [JSON examples](#json-examples-for-copying-data-to-and-from-data-lake-store) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to Data Lake Store. --## Linked service properties -A linked service links a data store to a data factory. You create a linked service of type **AzureDataLakeStore** to link your Data Lake Store data to your data factory. The following table describes JSON elements specific to Data Lake Store linked services. You can choose between service principal and user credential authentication. --| Property | Description | Required | -|: |: |: | -| **type** | The type property must be set to **AzureDataLakeStore**. | Yes | -| **dataLakeStoreUri** | Information about the Azure Data Lake Store account. This information takes one of the following formats: `https://[accountname].azuredatalakestore.net/webhdfs/v1` or `adl://[accountname].azuredatalakestore.net/`. | Yes | -| **subscriptionId** | Azure subscription ID to which the Data Lake Store account belongs. | Required for sink | -| **resourceGroupName** | Azure resource group name to which the Data Lake Store account belongs. | Required for sink | --### Service principal authentication (recommended) -To use service principal authentication, register an application entity in Microsoft Entra ID and grant it the access to Data Lake Store. For detailed steps, see [Service-to-service authentication](../../data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory.md). Make note of the following values, which you use to define the linked service: -* Application ID -* Application key -* Tenant ID --> [!IMPORTANT] -> Make sure you grant the service principal proper permission in Azure Data Lake Store: ->- **To use Data Lake Store as source**, grant at least **Read + Execute** data access permission to list and copy the contents of a folder, or **Read** permission to copy a single file. No requirement on account level access control. ->- **To use Data Lake Store as sink**, grant at least **Write + Execute** data access permission to create child items in the folder. And if you use Azure IR to empower copy (both source and sink are in cloud), in order to let Data Factory detect Data Lake Store's region, grant at least **Reader** role in account access control (IAM). If you want to avoid this IAM role, [specify executionLocation](data-factory-data-movement-activities.md#global) with the location of your Data Lake Store in copy activity. ->- If you **use Copy Wizard to author pipelines**, grant at least **Reader** role in account access control (IAM). Also, grant at least **Read + Execute** permission to your Data Lake Store root ("/") and its children. Otherwise you might see the message "The credentials provided are invalid." --Use service principal authentication by specifying the following properties: --| Property | Description | Required | -|: |: |: | -| **servicePrincipalId** | Specify the application's client ID. | Yes | -| **servicePrincipalKey** | Specify the application's key. | Yes | -| **tenant** | Specify the tenant information (domain name or tenant ID) under which your application resides. You can retrieve it by hovering the mouse in the upper-right corner of the Azure portal. | Yes | --**Example: Service principal authentication** -```json -{ - "name": "AzureDataLakeStoreLinkedService", - "properties": { - "type": "AzureDataLakeStore", - "typeProperties": { - "dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1", - "servicePrincipalId": "<service principal id>", - "servicePrincipalKey": "<service principal key>", - "tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>", - "subscriptionId": "<subscription of ADLS>", - "resourceGroupName": "<resource group of ADLS>" - } - } -} -``` --### User credential authentication -Alternatively, you can use user credential authentication to copy from or to Data Lake Store by specifying the following properties: --| Property | Description | Required | -|: |: |: | -| **authorization** | Click the **Authorize** button in the Data Factory Editor and enter your credential that assigns the autogenerated authorization URL to this property. | Yes | -| **sessionId** | OAuth session ID from the OAuth authorization session. Each session ID is unique and can be used only once. This setting is automatically generated when you use the Data Factory Editor. | Yes | --> [!IMPORTANT] -> Make sure you grant the user proper permission in Azure Data Lake Store: ->- **To use Data Lake Store as source**, grant at least **Read + Execute** data access permission to list and copy the contents of a folder, or **Read** permission to copy a single file. No requirement on account level access control. ->- **To use Data Lake Store as sink**, grant at least **Write + Execute** data access permission to create child items in the folder. And if you use Azure IR to empower copy (both source and sink are in cloud), in order to let Data Factory detect Data Lake Store's region, grant at least **Reader** role in account access control (IAM). If you want to avoid this IAM role, [specify executionLocation](data-factory-data-movement-activities.md#global) with the location of your Data Lake Store in copy activity. ->- If you **use Copy Wizard to author pipelines**, grant at least **Reader** role in account access control (IAM). Also, grant at least **Read + Execute** permission to your Data Lake Store root ("/") and its children. Otherwise you might see the message "The credentials provided are invalid." --**Example: User credential authentication** -```json -{ - "name": "AzureDataLakeStoreLinkedService", - "properties": { - "type": "AzureDataLakeStore", - "typeProperties": { - "dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1", - "sessionId": "<session ID>", - "authorization": "<authorization URL>", - "subscriptionId": "<subscription of ADLS>", - "resourceGroupName": "<resource group of ADLS>" - } - } -} -``` --#### Token expiration -The authorization code that you generate by using the **Authorize** button expires after a certain amount of time. The following message means that the authentication token has expired: --Credential operation error: invalid_grant - AADSTS70002: Error validating credentials. AADSTS70008: The provided access grant is expired or revoked. Trace ID: d18629e8-af88-43c5-88e3-d8419eb1fca1 Correlation ID: fac30a0c-6be6-4e02-8d69-a776d2ffefd7 Timestamp: 2015-12-15 21-09-31Z. --The following table shows the expiration times of different types of user accounts: --| User type | Expires after | -|: |: | -| User accounts *not* managed by Microsoft Entra ID (for example, @hotmail.com or @live.com) |12 hours | -| Users accounts managed by Microsoft Entra ID |14 days after the last slice run <br/><br/>90 days, if a slice based on an OAuth-based linked service runs at least once every 14 days | --If you change your password before the token expiration time, the token expires immediately. You will see the message mentioned earlier in this section. --You can reauthorize the account by using the **Authorize** button when the token expires to redeploy the linked service. You can also generate values for the **sessionId** and **authorization** properties programmatically by using the following code: ---```csharp -if (linkedService.Properties.TypeProperties is AzureDataLakeStoreLinkedService || - linkedService.Properties.TypeProperties is AzureDataLakeAnalyticsLinkedService) -{ - AuthorizationSessionGetResponse authorizationSession = this.Client.OAuth.Get(this.ResourceGroupName, this.DataFactoryName, linkedService.Properties.Type); -- WindowsFormsWebAuthenticationDialog authenticationDialog = new WindowsFormsWebAuthenticationDialog(null); - string authorization = authenticationDialog.AuthenticateAAD(authorizationSession.AuthorizationSession.Endpoint, new Uri("urn:ietf:wg:oauth:2.0:oob")); -- AzureDataLakeStoreLinkedService azureDataLakeStoreProperties = linkedService.Properties.TypeProperties as AzureDataLakeStoreLinkedService; - if (azureDataLakeStoreProperties != null) - { - azureDataLakeStoreProperties.SessionId = authorizationSession.AuthorizationSession.SessionId; - azureDataLakeStoreProperties.Authorization = authorization; - } -- AzureDataLakeAnalyticsLinkedService azureDataLakeAnalyticsProperties = linkedService.Properties.TypeProperties as AzureDataLakeAnalyticsLinkedService; - if (azureDataLakeAnalyticsProperties != null) - { - azureDataLakeAnalyticsProperties.SessionId = authorizationSession.AuthorizationSession.SessionId; - azureDataLakeAnalyticsProperties.Authorization = authorization; - } -} -``` -For details about the Data Factory classes used in the code, see the [AzureDataLakeStoreLinkedService Class](/dotnet/api/microsoft.azure.management.datafactories.models.azuredatalakestorelinkedservice), [AzureDataLakeAnalyticsLinkedService Class](/dotnet/api/microsoft.azure.management.datafactories.models.azuredatalakeanalyticslinkedservice), and [AuthorizationSessionGetResponse Class](/dotnet/api/microsoft.azure.management.datafactories.models.authorizationsessiongetresponse) topics. Add a reference to version `2.9.10826.1824` of `Microsoft.IdentityModel.Clients.ActiveDirectory.WindowsForms.dll` for the `WindowsFormsWebAuthenticationDialog` class used in the code. --## Troubleshooting tips --**Symptom:** When copying data **into** Azure Data Lake Store, if your copy activity fail with the following error: -- ``` - Failed to detect the region for Azure Data Lake account {your account name}. Please make sure that the Resource Group name: {resource group name} and subscription ID: {subscription ID} of this Azure Data Lake Store resource are correct. - ``` --**Root cause:** There are 2 possible reasons: --1. The `resourceGroupName` and/or `subscriptionId` specified in Azure Data Lake Store linked service is incorrect; -2. The user or the service principal doesn't have the needed permission. --**Resolution:** --1. Make sure the `subscriptionId` and `resourceGroupName` you specify in the linked service `typeProperties` are indeed the ones that your data lake account belongs to. --1. Grant, at a minimun, the **Reader** role to the user or service principal on the data lake account. -- For detailed steps, see [Assign Azure roles using the Azure portal](../../role-based-access-control/role-assignments-portal.md). --1. If you don't want to grant the **Reader** role to the user or service principal, an alternative is to [explicitly specify an execution location](data-factory-data-movement-activities.md#global) in copy activity with the location of your Data Lake Store. Example: -- ```json - { - "name": "CopyToADLS", - "type": "Copy", - ...... - "typeProperties": { - "source": { - "type": "<source type>" - }, - "sink": { - "type": "AzureDataLakeStoreSink" - }, - "exeuctionLocation": "West US" - } - } - ``` --## Dataset properties -To specify a dataset to represent input data in a Data Lake Store, you set the **type** property of the dataset to **AzureDataLakeStore**. Set the **linkedServiceName** property of the dataset to the name of the Data Lake Store linked service. For a full list of JSON sections and properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections of a dataset in JSON, such as **structure**, **availability**, and **policy**, are similar for all dataset types (Azure SQL database, Azure blob, and Azure table, for example). The **typeProperties** section is different for each type of dataset and provides information such as location and format of the data in the data store. --The **typeProperties** section for a dataset of type **AzureDataLakeStore** contains the following properties: --| Property | Description | Required | -|: |: |: | -| **folderPath** |Path to the container and folder in Data Lake Store. |Yes | -| **fileName** |Name of the file in Azure Data Lake Store. The **fileName** property is optional and case-sensitive. <br/><br/>If you specify **fileName**, the activity (including Copy) works on the specific file.<br/><br/>When **fileName** is not specified, Copy includes all files in **folderPath** in the input dataset.<br/><br/>When **fileName** is not specified for an output dataset and **preserveHierarchy** is not specified in activity sink, the name of the generated file is in the format `Data._Guid_.txt`. For example: Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt. |No | -| **partitionedBy** |The **partitionedBy** property is optional. You can use it to specify a dynamic path and file name for time-series data. For example, **folderPath** can be parameterized for every hour of data. For details and examples, see The partitionedBy property. |No | -| **format** | The following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, and **ParquetFormat**. Set the **type** property under **format** to one of these values. For more information, see the [Text format](data-factory-supported-file-and-compression-formats.md#text-format), [JSON format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro format](data-factory-supported-file-and-compression-formats.md#avro-format), [ORC format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet Format](data-factory-supported-file-and-compression-formats.md#parquet-format) sections in the [File and compression formats supported by Azure Data Factory](data-factory-supported-file-and-compression-formats.md) article. <br><br> If you want to copy files "as-is" between file-based stores (binary copy), skip the `format` section in both input and output dataset definitions. |No | -| **compression** | Specify the type and level of compression for the data. Supported types are **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**. Supported levels are **Optimal** and **Fastest**. For more information, see [File and compression formats supported by Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). |No | --### The partitionedBy property -You can specify dynamic **folderPath** and **fileName** properties for time-series data with the **partitionedBy** property, Data Factory functions, and system variables. For details, see the [Azure Data Factory - functions and system variables](data-factory-functions-variables.md) article. ---In the following example, `{Slice}` is replaced with the value of the Data Factory system variable `SliceStart` in the format specified (`yyyyMMddHH`). The name `SliceStart` refers to the start time of the slice. The `folderPath` property is different for each slice, as in `wikidatagateway/wikisampledataout/2014100103` or `wikidatagateway/wikisampledataout/2014100104`. --```JSON -"folderPath": "wikidatagateway/wikisampledataout/{Slice}", -"partitionedBy": -[ - { "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } }, -], -``` --In the following example, the year, month, day, and time of `SliceStart` are extracted into separate variables that are used by the `folderPath` and `fileName` properties: -```JSON -"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}", -"fileName": "{Hour}.csv", -"partitionedBy": -[ - { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, - { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, - { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, - { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } } -], -``` -For more details on time-series datasets, scheduling, and slices, see the [Datasets in Azure Data Factory](data-factory-create-datasets.md) and [Data Factory scheduling and execution](data-factory-scheduling-and-execution.md) articles. ---## Copy activity properties -For a full list of sections and properties available for defining activities, see the [Creating pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and policy are available for all types of activities. --The properties available in the **typeProperties** section of an activity vary with each activity type. For a copy activity, they vary depending on the types of sources and sinks. --**AzureDataLakeStoreSource** supports the following property in the **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| **recursive** |Indicates whether the data is read recursively from the subfolders or only from the specified folder. |True (default value), False |No | --**AzureDataLakeStoreSink** supports the following properties in the **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| **copyBehavior** |Specifies the copy behavior. |<b>PreserveHierarchy</b>: Preserves the file hierarchy in the target folder. The relative path of source file to source folder is identical to the relative path of target file to target folder.<br/><br/><b>FlattenHierarchy</b>: All files from the source folder are created in the first level of the target folder. The target files are created with autogenerated names.<br/><br/><b>MergeFiles</b>: Merges all files from the source folder to one file. If the file or blob name is specified, the merged file name is the specified name. Otherwise, the file name is autogenerated. |No | --### recursive and copyBehavior examples -This section describes the resulting behavior of the Copy operation for different combinations of recursive and copyBehavior values. --| recursive | copyBehavior | Resulting behavior | -| | | | -| true |preserveHierarchy |For a source folder Folder1 with the following structure: <br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target folder Folder1 is created with the same structure as the source<br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5. | -| true |flattenHierarchy |For a source folder Folder1 with the following structure: <br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target Folder1 is created with the following structure: <br/><br/>Folder1<br/> auto-generated name for File1<br/> auto-generated name for File2<br/> auto-generated name for File3<br/> auto-generated name for File4<br/> auto-generated name for File5 | -| true |mergeFiles |For a source folder Folder1 with the following structure: <br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target Folder1 is created with the following structure: <br/><br/>Folder1<br/> File1 + File2 + File3 + File4 + File 5 contents are merged into one file with auto-generated file name | -| false |preserveHierarchy |For a source folder Folder1 with the following structure: <br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target folder Folder1 is created with the following structure<br/><br/>Folder1<br/> File1<br/> File2<br/><br/><br/>Subfolder1 with File3, File4, and File5 are not picked up. | -| false |flattenHierarchy |For a source folder Folder1 with the following structure:<br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target folder Folder1 is created with the following structure<br/><br/>Folder1<br/> auto-generated name for File1<br/> auto-generated name for File2<br/><br/><br/>Subfolder1 with File3, File4, and File5 are not picked up. | -| false |mergeFiles |For a source folder Folder1 with the following structure:<br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target folder Folder1 is created with the following structure<br/><br/>Folder1<br/> File1 + File2 contents are merged into one file with auto-generated file name. auto-generated name for File1<br/><br/>Subfolder1 with File3, File4, and File5 are not picked up. | --## Supported file and compression formats -For details, see the [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md) article. --## JSON examples for copying data to and from Data Lake Store -The following examples provide sample JSON definitions. You can use these sample definitions to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). The examples show how to copy data to and from Data Lake Store and Azure Blob storage. However, data can be copied _directly_ from any of the sources to any of the supported sinks. For more information, see the section "Supported data stores and formats" in the [Move data by using Copy Activity](data-factory-data-movement-activities.md) article. --### Example: Copy data from Azure Blob Storage to Azure Data Lake Store -The example code in this section shows: --* A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -* A linked service of type [AzureDataLakeStore](#linked-service-properties). -* An input [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -* An output [dataset](data-factory-create-datasets.md) of type [AzureDataLakeStore](#dataset-properties). -* A [pipeline](data-factory-create-pipelines.md) with a copy activity that uses [BlobSource](data-factory-azure-blob-connector.md#copy-activity-properties) and [AzureDataLakeStoreSink](#copy-activity-properties). --The examples show how time-series data from Azure Blob Storage is copied to Data Lake Store every hour. --**Azure Storage linked service** --```JSON -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --**Azure Data Lake Store linked service** --```JSON -{ - "name": "AzureDataLakeStoreLinkedService", - "properties": { - "type": "AzureDataLakeStore", - "typeProperties": { - "dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1", - "servicePrincipalId": "<service principal id>", - "servicePrincipalKey": "<service principal key>", - "tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>", - "subscriptionId": "<subscription of ADLS>", - "resourceGroupName": "<resource group of ADLS>" - } - } -} -``` --> [!NOTE] -> For configuration details, see the [Linked service properties](#linked-service-properties) section. -> --**Azure blob input dataset** --In the following example, data is picked up from a new blob every hour (`"frequency": "Hour", "interval": 1`). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses the year, month, and day portion of the start time. The file name uses the hour portion of the start time. The `"external": true` setting informs the Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. --```JSON -{ - "name": "AzureBlobInput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}", - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --**Azure Data Lake Store output dataset** --The following example copies data to Data Lake Store. New data is copied to Data Lake Store every hour. --```JSON -{ - "name": "AzureDataLakeStoreOutput", - "properties": { - "type": "AzureDataLakeStore", - "linkedServiceName": "AzureDataLakeStoreLinkedService", - "typeProperties": { - "folderPath": "datalake/output/" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Copy activity in a pipeline with a blob source and a Data Lake Store sink** --In the following example, the pipeline contains a copy activity that is configured to use the input and output datasets. The copy activity is scheduled to run every hour. In the pipeline JSON definition, the `source` type is set to `BlobSource`, and the `sink` type is set to `AzureDataLakeStoreSink`. --```json -{ - "name":"SamplePipeline", - "properties": - { - "start":"2014-06-01T18:00:00", - "end":"2014-06-01T19:00:00", - "description":"pipeline with copy activity", - "activities": - [ - { - "name": "AzureBlobtoDataLake", - "description": "Copy Activity", - "type": "Copy", - "inputs": [ - { - "name": "AzureBlobInput" - } - ], - "outputs": [ - { - "name": "AzureDataLakeStoreOutput" - } - ], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "AzureDataLakeStoreSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` --### Example: Copy data from Azure Data Lake Store to an Azure blob -The example code in this section shows: --* A linked service of type [AzureDataLakeStore](#linked-service-properties). -* A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -* An input [dataset](data-factory-create-datasets.md) of type [AzureDataLakeStore](#dataset-properties). -* An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -* A [pipeline](data-factory-create-pipelines.md) with a copy activity that uses [AzureDataLakeStoreSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The code copies time-series data from Data Lake Store to an Azure blob every hour. --**Azure Data Lake Store linked service** --```json -{ - "name": "AzureDataLakeStoreLinkedService", - "properties": { - "type": "AzureDataLakeStore", - "typeProperties": { - "dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1", - "servicePrincipalId": "<service principal id>", - "servicePrincipalKey": "<service principal key>", - "tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>" - } - } -} -``` --> [!NOTE] -> For configuration details, see the [Linked service properties](#linked-service-properties) section. -> --**Azure Storage linked service** --```JSON -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -**Azure Data Lake input dataset** --In this example, setting `"external"` to `true` informs the Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. --```json -{ - "name": "AzureDataLakeStoreInput", - "properties": - { - "type": "AzureDataLakeStore", - "linkedServiceName": "AzureDataLakeStoreLinkedService", - "typeProperties": { - "folderPath": "datalake/input/", - "fileName": "SearchLog.tsv", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - } - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` -**Azure blob output dataset** --In the following example, data is written to a new blob every hour (`"frequency": "Hour", "interval": 1`). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses the year, month, day, and hours portion of the start time. --```JSON -{ - "name": "AzureBlobOutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ], - "format": { - "type": "TextFormat", - "columnDelimiter": "\t", - "rowDelimiter": "\n" - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**A copy activity in a pipeline with an Azure Data Lake Store source and a blob sink** --In the following example, the pipeline contains a copy activity that is configured to use the input and output datasets. The copy activity is scheduled to run every hour. In the pipeline JSON definition, the `source` type is set to `AzureDataLakeStoreSource`, and the `sink` type is set to `BlobSink`. --```json -{ - "name":"SamplePipeline", - "properties":{ - "start":"2014-06-01T18:00:00", - "end":"2014-06-01T19:00:00", - "description":"pipeline for copy activity", - "activities":[ - { - "name": "AzureDakeLaketoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [ - { - "name": "AzureDataLakeStoreInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "typeProperties": { - "source": { - "type": "AzureDataLakeStoreSource", - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` --In the copy activity definition, you can also map columns from the source dataset to columns in the sink dataset. For details, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Performance and tuning -To learn about the factors that affect Copy Activity performance and how to optimize it, see the [Copy Activity performance and tuning guide](data-factory-copy-activity-performance.md) article. |
data-factory | Data Factory Azure Documentdb Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-azure-documentdb-connector.md | - Title: Move data to/from Azure Cosmos DB -description: Learn how move data to/from Azure Cosmos DB collection using Azure Data Factory ------ Previously updated : 04/12/2023---# Move data to and from Azure Cosmos DB using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-azure-documentdb-connector.md) -> * [Version 2 (current version)](../connector-azure-cosmos-db.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Azure Cosmos DB connector in V2](../connector-azure-cosmos-db.md). --This article explains how to use the Copy Activity in Azure Data Factory to move data to/from Azure Cosmos DB for NoSQL. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --You can copy data from any supported source data store to Azure Cosmos DB or from Azure Cosmos DB to any supported sink data store. For a list of data stores supported as sources or sinks by the copy activity, see the [Supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. --> [!IMPORTANT] -> The Azure Cosmos DB connector only supports Azure Cosmos DB for NoSQL. --To copy data as-is to/from JSON files or another Azure Cosmos DB collection, see [Import/Export JSON documents](#importexport-json-documents). --## Getting started -You can create a pipeline with a copy activity that moves data to/from Azure Cosmos DB by using different tools/APIs. --The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. --You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from Azure Cosmos DB, see [JSON examples](#json-examples) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure Cosmos DB: --## Linked service properties -The following table provides description for JSON elements specific to Azure Cosmos DB linked service. --| **Property** | **Description** | **Required** | -| | | | -| type |The type property must be set to: **DocumentDb** |Yes | -| connectionString |Specify information needed to connect to Azure Cosmos DB database. |Yes | --Example: --```json -{ - "name": "CosmosDbLinkedService", - "properties": { - "type": "DocumentDb", - "typeProperties": { - "connectionString": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>" - } - } -} -``` --## Dataset properties -For a full list of sections & properties available for defining datasets please refer to the [Creating datasets](data-factory-create-datasets.md) article. Sections like structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for the dataset of type **DocumentDbCollection** has the following properties. --| **Property** | **Description** | **Required** | -| | | | -| collectionName |Name of the Azure Cosmos DB document collection. |Yes | --Example: --```json -{ - "name": "PersonCosmosDbTable", - "properties": { - "type": "DocumentDbCollection", - "linkedServiceName": "CosmosDbLinkedService", - "typeProperties": { - "collectionName": "Person" - }, - "external": true, - "availability": { - "frequency": "Day", - "interval": 1 - } - } -} -``` -### Schema by Data Factory -For schema-free data stores such as Azure Cosmos DB, the Data Factory service infers the schema in one of the following ways: --1. If you specify the structure of data by using the **structure** property in the dataset definition, the Data Factory service honors this structure as the schema. In this case, if a row does not contain a value for a column, a null value will be provided for it. -2. If you do not specify the structure of data by using the **structure** property in the dataset definition, the Data Factory service infers the schema by using the first row in the data. In this case, if the first row does not contain the full schema, some columns will be missing in the result of copy operation. --Therefore, for schema-free data sources, the best practice is to specify the structure of data using the **structure** property. --## Copy activity properties -For a full list of sections & properties available for defining activities please refer to the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and policy are available for all types of activities. --> [!NOTE] -> The Copy Activity takes only one input and produces only one output. --Properties available in the typeProperties section of the activity on the other hand vary with each activity type and in case of Copy activity they vary depending on the types of sources and sinks. --In case of Copy activity when source is of type **DocumentDbCollectionSource** -the following properties are available in **typeProperties** section: --| **Property** | **Description** | **Allowed values** | **Required** | -| | | | | -| query |Specify the query to read data. |Query string supported by Azure Cosmos DB. <br/><br/>Example: `SELECT c.BusinessEntityID, c.PersonType, c.NameStyle, c.Title, c.Name.First AS FirstName, c.Name.Last AS LastName, c.Suffix, c.EmailPromotion FROM c WHERE c.ModifiedDate > \"2009-01-01T00:00:00\"` |No <br/><br/>If not specified, the SQL statement that is executed: `select <columns defined in structure> from mycollection` | -| nestingSeparator |Special character to indicate that the document is nested |Any character. <br/><br/>Azure Cosmos DB is a NoSQL store for JSON documents, where nested structures are allowed. Azure Data Factory enables user to denote hierarchy via nestingSeparator, which is "." in the above examples. With the separator, the copy activity will generate the "Name" object with three children elements First, Middle and Last, according to "Name.First", "Name.Middle" and "Name.Last" in the table definition. |No | --**DocumentDbCollectionSink** supports the following properties: --| **Property** | **Description** | **Allowed values** | **Required** | -| | | | | -| nestingSeparator |A special character in the source column name to indicate that nested document is needed. <br/><br/>For example above: `Name.First` in the output table produces the following JSON structure in the Azure Cosmos DB document:<br/><br/>"Name": {<br/> "First": "John"<br/>}, |Character that is used to separate nesting levels.<br/><br/>Default value is `.` (dot). |Character that is used to separate nesting levels. <br/><br/>Default value is `.` (dot). | -| writeBatchSize |Number of parallel requests to Azure Cosmos DB service to create documents.<br/><br/>You can fine-tune the performance when copying data to/from Azure Cosmos DB by using this property. You can expect a better performance when you increase writeBatchSize because more parallel requests to Azure Cosmos DB are sent. However you'll need to avoid throttling that can throw the error message: "Request rate is large".<br/><br/>Throttling is decided by a number of factors, including size of documents, number of terms in documents, indexing policy of target collection, etc. For copy operations, you can use a better collection (e.g. S3) to have the most throughput available (2,500 request units/second). |Integer |No (default: 5) | -| writeBatchTimeout |Wait time for the operation to complete before it times out. |timespan<br/><br/> Example: "00:30:00" (30 minutes). |No | --## Import/Export JSON documents -Using this Azure Cosmos DB connector, you can easily: --* Import JSON documents from various sources into Azure Cosmos DB, including Azure Blob storage, Azure Data Lake, on-premises file system, or other file-based stores supported by Azure Data Factory. -* Export JSON documents from Azure Cosmos DB collection into various file-based stores. -* Migrate data between two Azure Cosmos DB collections as-is. --To achieve such schema-agnostic copy, -* When using copy wizard, check the **"Export as-is to JSON files or Azure Cosmos DB collection"** option. -* When using JSON editing, do not specify the "structure" section in Azure Cosmos DB dataset(s) nor "nestingSeparator" property on Azure Cosmos DB source/sink in copy activity. To import from/export to JSON files, in the file store dataset specify format type as "JsonFormat", config "filePattern" and skip the rest format settings, see [JSON format](data-factory-supported-file-and-compression-formats.md#json-format) section on details. --## JSON examples -The following examples provide sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). They show how to copy data to and from Azure Cosmos DB and Azure Blob Storage. However, data can be copied **directly** from any of the sources to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --## Example: Copy data from Azure Cosmos DB to Azure Blob -The sample below shows: --1. A linked service of type [DocumentDb](#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [DocumentDbCollection](#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -5. A [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [DocumentDbCollectionSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data in Azure Cosmos DB to Azure Blob. The JSON properties used in these samples are described in sections following the samples. --**Azure Cosmos DB linked service:** --```json -{ - "name": "CosmosDbLinkedService", - "properties": { - "type": "DocumentDb", - "typeProperties": { - "connectionString": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>" - } - } -} -``` -**Azure Blob storage linked service:** --```json -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -**Azure Document DB input dataset:** --The sample assumes you have a collection named **Person** in an Azure Cosmos DB database. --Setting "external": "true" and specifying externalData policy information the Azure Data Factory service that the table is external to the data factory and not produced by an activity in the data factory. --```json -{ - "name": "PersonCosmosDbTable", - "properties": { - "type": "DocumentDbCollection", - "linkedServiceName": "CosmosDbLinkedService", - "typeProperties": { - "collectionName": "Person" - }, - "external": true, - "availability": { - "frequency": "Day", - "interval": 1 - } - } -} -``` --**Azure Blob output dataset:** --Data is copied to a new blob every hour with the path for the blob reflecting the specific datetime with hour granularity. --```json -{ - "name": "PersonBlobTableOut", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "docdb", - "format": { - "type": "TextFormat", - "columnDelimiter": ",", - "nullValue": "NULL" - } - }, - "availability": { - "frequency": "Day", - "interval": 1 - } - } -} -``` --Sample JSON document in the Person collection in an Azure Cosmos DB database: --```json -{ - "PersonId": 2, - "Name": { - "First": "Jane", - "Middle": "", - "Last": "Doe" - } -} -``` -Azure Cosmos DB supports querying documents using a SQL-like syntax over hierarchical JSON documents. --Example: --```sql -SELECT Person.PersonId, Person.Name.First AS FirstName, Person.Name.Middle as MiddleName, Person.Name.Last AS LastName FROM Person -``` --The following pipeline copies data from the Person collection in the Azure Cosmos DB database to an Azure blob. As part of the copy activity the input and output datasets have been specified. --```json -{ - "name": "DocDbToBlobPipeline", - "properties": { - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "DocumentDbCollectionSource", - "query": "SELECT Person.Id, Person.Name.First AS FirstName, Person.Name.Middle as MiddleName, Person.Name.Last AS LastName FROM Person", - "nestingSeparator": "." - }, - "sink": { - "type": "BlobSink", - "blobWriterAddHeader": true, - "writeBatchSize": 1000, - "writeBatchTimeout": "00:00:59" - } - }, - "inputs": [ - { - "name": "PersonCosmosDbTable" - } - ], - "outputs": [ - { - "name": "PersonBlobTableOut" - } - ], - "policy": { - "concurrency": 1 - }, - "name": "CopyFromDocDbToBlob" - } - ], - "start": "2015-04-01T00:00:00Z", - "end": "2015-04-02T00:00:00Z" - } -} -``` -## Example: Copy data from Azure Blob to Azure Cosmos DB -The sample below shows: --1. A linked service of type DocumentDb. -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type DocumentDbCollection. -5. A [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [BlobSource](data-factory-azure-blob-connector.md#copy-activity-properties) and DocumentDbCollectionSink. --The sample copies data from Azure blob to Azure Cosmos DB. The JSON properties used in these samples are described in sections following the samples. --**Azure Blob storage linked service:** --```json -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -**Azure Cosmos DB linked service:** --```json -{ - "name": "CosmosDbLinkedService", - "properties": { - "type": "DocumentDb", - "typeProperties": { - "connectionString": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>" - } - } -} -``` -**Azure Blob input dataset:** --```json -{ - "name": "PersonBlobTableIn", - "properties": { - "structure": [ - { - "name": "Id", - "type": "Int" - }, - { - "name": "FirstName", - "type": "String" - }, - { - "name": "MiddleName", - "type": "String" - }, - { - "name": "LastName", - "type": "String" - } - ], - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "fileName": "input.csv", - "folderPath": "docdb", - "format": { - "type": "TextFormat", - "columnDelimiter": ",", - "nullValue": "NULL" - } - }, - "external": true, - "availability": { - "frequency": "Day", - "interval": 1 - } - } -} -``` -**Azure Cosmos DB output dataset:** --The sample copies data to a collection named "Person". --```json -{ - "name": "PersonCosmosDbTableOut", - "properties": { - "structure": [ - { - "name": "Id", - "type": "Int" - }, - { - "name": "Name.First", - "type": "String" - }, - { - "name": "Name.Middle", - "type": "String" - }, - { - "name": "Name.Last", - "type": "String" - } - ], - "type": "DocumentDbCollection", - "linkedServiceName": "CosmosDbLinkedService", - "typeProperties": { - "collectionName": "Person" - }, - "availability": { - "frequency": "Day", - "interval": 1 - } - } -} -``` --The following pipeline copies data from Azure Blob storage to the Person collection in the Azure Cosmos DB instance. As part of the copy activity the input and output datasets have been specified. --```json -{ - "name": "BlobToDocDbPipeline", - "properties": { - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "DocumentDbCollectionSink", - "nestingSeparator": ".", - "writeBatchSize": 2, - "writeBatchTimeout": "00:00:00" - }, - "translator": { - "type": "TabularTranslator", - "ColumnMappings": "FirstName: Name.First, MiddleName: Name.Middle, LastName: Name.Last, BusinessEntityID: BusinessEntityID, PersonType: PersonType, NameStyle: NameStyle, Title: Title, Suffix: Suffix, EmailPromotion: EmailPromotion, rowguid: rowguid, ModifiedDate: ModifiedDate" - } - }, - "inputs": [ - { - "name": "PersonBlobTableIn" - } - ], - "outputs": [ - { - "name": "PersonCosmosDbTableOut" - } - ], - "policy": { - "concurrency": 1 - }, - "name": "CopyFromBlobToDocDb" - } - ], - "start": "2015-04-14T00:00:00Z", - "end": "2015-04-15T00:00:00Z" - } -} -``` -If the sample blob input is as --``` -1,John,,Doe -``` -Then the output JSON in Azure Cosmos DB will be: --```json -{ - "Id": 1, - "Name": { - "First": "John", - "Middle": null, - "Last": "Doe" - }, - "id": "a5e8595c-62ec-4554-a118-3940f4ff70b6" -} -``` -Azure Cosmos DB is a NoSQL store for JSON documents, where nested structures are allowed. Azure Data Factory enables user to denote hierarchy via **nestingSeparator**, which is "." in this example. With the separator, the copy activity will generate the "Name" object with three children elements First, Middle and Last, according to "Name.First", "Name.Middle" and "Name.Last" in the table definition. --## Appendix -1. **Question:** - Does the Copy Activity support update of existing records? -- **Answer:** - No. -2. **Question:** - How does a retry of a copy to Azure Cosmos DB deal with already copied records? -- **Answer:** - If records have an "ID" field and the copy operation tries to insert a record with the same ID, the copy operation throws an error. -3. **Question:** - Does Data Factory support [range or hash-based data partitioning](../../cosmos-db/partitioning-overview.md)? -- **Answer:** - No. -4. **Question:** - Can I specify more than one Azure Cosmos DB collection for a table? -- **Answer:** - No. Only one collection can be specified at this time. --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory Azure Ml Batch Execution Activity | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-azure-ml-batch-execution-activity.md | - Title: Create predictive data pipelines using Azure Data Factory -description: Describes how to create create predictive pipelines using Azure Data Factory and Machine Learning Studio (classic) ------ Previously updated : 04/12/2023---# Create predictive pipelines using Machine Learning Studio (classic) and Azure Data Factory ---> [!div class="op_single_selector" title1="Transformation Activities"] -> * [Hive Activity](data-factory-hive-activity.md) -> * [Pig Activity](data-factory-pig-activity.md) -> * [MapReduce Activity](data-factory-map-reduce.md) -> * [Hadoop Streaming Activity](data-factory-hadoop-streaming-activity.md) -> * [Spark Activity](data-factory-spark.md) -> * [ML Studio (classic) Batch Execution Activity](data-factory-azure-ml-batch-execution-activity.md) -> * [ML Studio (classic) Update Resource Activity](data-factory-azure-ml-update-resource-activity.md) -> * [Stored Procedure Activity](data-factory-stored-proc-activity.md) -> * [Data Lake Analytics U-SQL Activity](data-factory-usql-activity.md) -> * [.NET Custom Activity](data-factory-use-custom-activities.md) --## Introduction -> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [transform data using machine learning in Data Factory](../transform-data-using-machine-learning.md). --### Machine Learning Studio (classic) -[ML Studio (classic)](../../machine-learning/index.yml) enables you to build, test, and deploy predictive analytics solutions. From a high-level point of view, it is done in three steps: --1. **Create a training experiment**. You do this step by using ML Studio (classic). Studio (classic) is a collaborative visual development environment that you use to train and test a predictive analytics model using training data. -2. **Convert it to a predictive experiment**. Once your model has been trained with existing data and you are ready to use it to score new data, you prepare and streamline your experiment for scoring. -3. **Deploy it as a web service**. You can publish your scoring experiment as an Azure web service. You can send data to your model via this web service end point and receive result predictions fro the model. --### Azure Data Factory -Data Factory is a cloud-based data integration service that orchestrates and automates the **movement** and **transformation** of data. You can create data integration solutions using Azure Data Factory that can ingest data from various data stores, transform/process the data, and publish the result data to the data stores. --Data Factory service allows you to create data pipelines that move and transform data, and then run the pipelines on a specified schedule (hourly, daily, weekly, etc.). It also provides rich visualizations to display the lineage and dependencies between your data pipelines, and monitor all your data pipelines from a single unified view to easily pinpoint issues and setup monitoring alerts. --See [Introduction to Azure Data Factory](data-factory-introduction.md) and [Build your first pipeline](data-factory-build-your-first-pipeline.md) articles to quickly get started with the Azure Data Factory service. --### Data Factory and Machine Learning Studio (classic) together -Azure Data Factory enables you to easily create pipelines that use a published [ML Studio (classic)][azure-machine-learning] web service for predictive analytics. Using the **Batch Execution Activity** in an Azure Data Factory pipeline, you can invoke an Studio (classic) web service to make predictions on the data in batch. See Invoking an ML Studio (classic) web service using the Batch Execution Activity section for details. --Over time, the predictive models in the Studio (classic) scoring experiments need to be retrained using new input datasets. You can retrain an Studio (classic) model from a Data Factory pipeline by doing the following steps: --1. Publish the training experiment (not predictive experiment) as a web service. You do this step in Studio (classic) as you did to expose predictive experiment as a web service in the previous scenario. -2. Use the Studio (classic) Batch Execution Activity to invoke the web service for the training experiment. Basically, you can use the Studio (classic) Batch Execution activity to invoke both training web service and scoring web service. --After you are done with retraining, update the scoring web service (predictive experiment exposed as a web service) with the newly trained model by using the **ML Studio (classic) Update Resource Activity**. See [Updating models using Update Resource Activity](data-factory-azure-ml-update-resource-activity.md) article for details. --## Invoking a web service using Batch Execution Activity -You use Azure Data Factory to orchestrate data movement and processing, and then perform batch execution using Studio (classic). Here are the top-level steps: --1. Create an ML Studio (classic) linked service. You need the following values: -- 1. **Request URI** for the Batch Execution API. You can find the Request URI by clicking the **BATCH EXECUTION** link in the web services page. - 2. **API key** for the published Studio (classic) web service. You can find the API key by clicking the web service that you have published. - 3. Use the **AzureMLBatchExecution** activity. -- :::image type="content" source="./media/data-factory-azure-ml-batch-execution-activity/AzureMLDashboard.png" alt-text="Machine Learning Studio (classic) Dashboard"::: -- :::image type="content" source="./media/data-factory-azure-ml-batch-execution-activity/batch-uri.png" alt-text="Batch URI"::: --### Scenario: Experiments using Web service inputs/outputs that refer to data in Azure Blob Storage -In this scenario, the Studio (classic) Web service makes predictions using data from a file in an Azure blob storage and stores the prediction results in the blob storage. The following JSON defines a Data Factory pipeline with an AzureMLBatchExecution activity. The activity has the dataset **DecisionTreeInputBlob** as input and **DecisionTreeResultBlob** as the output. The **DecisionTreeInputBlob** is passed as an input to the web service by using the **webServiceInput** JSON property. The **DecisionTreeResultBlob** is passed as an output to the Web service by using the **webServiceOutputs** JSON property. --> [!IMPORTANT] -> If the web service takes multiple inputs, use the **webServiceInputs** property instead of using **webServiceInput**. See the [Web service requires multiple inputs](#web-service-requires-multiple-inputs) section for an example of using the webServiceInputs property. -> -> Datasets that are referenced by the **webServiceInput**/**webServiceInputs** and **webServiceOutputs** properties (in **typeProperties**) must also be included in the Activity **inputs** and **outputs**. -> -> In your Studio (classic) experiment, web service input and output ports and global parameters have default names ("input1", "input2") that you can customize. The names you use for webServiceInputs, webServiceOutputs, and globalParameters settings must exactly match the names in the experiments. You can view the sample request payload on the Batch Execution Help page for your Studio (classic) endpoint to verify the expected mapping. -> -> --```json -{ - "name": "PredictivePipeline", - "properties": { - "description": "use AzureML model", - "activities": [ - { - "name": "MLActivity", - "type": "AzureMLBatchExecution", - "description": "prediction analysis on batch input", - "inputs": [ - { - "name": "DecisionTreeInputBlob" - } - ], - "outputs": [ - { - "name": "DecisionTreeResultBlob" - } - ], - "linkedServiceName": "MyAzureMLLinkedService", - "typeProperties": - { - "webServiceInput": "DecisionTreeInputBlob", - "webServiceOutputs": { - "output1": "DecisionTreeResultBlob" - } - }, - "policy": { - "concurrency": 3, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "02:00:00" - } - } - ], - "start": "2016-02-13T00:00:00Z", - "end": "2016-02-14T00:00:00Z" - } -} -``` --> [!NOTE] -> Only inputs and outputs of the AzureMLBatchExecution activity can be passed as parameters to the Web service. For example, in the above JSON snippet, DecisionTreeInputBlob is an input to the AzureMLBatchExecution activity, which is passed as an input to the Web service via webServiceInput parameter. -> -> --### Example -This example uses Azure Storage to hold both the input and output data. --We recommend that you go through the [Build your first pipeline with Data Factory][adf-build-1st-pipeline] tutorial before going through this example. Use the Data Factory Editor to create Data Factory artifacts (linked services, datasets, pipeline) in this example. --1. Create a **linked service** for your **Azure Storage**. If the input and output files are in different storage accounts, you need two linked services. Here is a JSON example: -- ```json - { - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName= [acctName];AccountKey=[acctKey]" - } - } - } - ``` --2. Create the **input** Azure Data Factory **dataset**. Unlike some other Data Factory datasets, these datasets must contain both **folderPath** and **fileName** values. You can use partitioning to cause each batch execution (each data slice) to process or produce unique input and output files. You may need to include some upstream activity to transform the input into the CSV file format and place it in the storage account for each slice. In that case, you would not include the **external** and **externalData** settings shown in the following example, and your DecisionTreeInputBlob would be the output dataset of a different Activity. -- ```json - { - "name": "DecisionTreeInputBlob", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "azuremltesting/input", - "fileName": "in.csv", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "external": true, - "availability": { - "frequency": "Day", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } - } - ``` -- Your input csv file must have the column header row. If you are using the **Copy Activity** to create/move the csv into the blob storage, you should set the sink property **blobWriterAddHeader** to **true**. For example: -- ```json - sink: - { - "type": "BlobSink", - "blobWriterAddHeader": true - } - ``` -- If the csv file does not have the header row, you may see the following error: **Error in Activity: Error reading string. Unexpected token: StartObject. Path '', line 1, position 1**. --3. Create the **output** Azure Data Factory **dataset**. This example uses partitioning to create a unique output path for each slice execution. Without the partitioning, the activity would overwrite the file. -- ```json - { - "name": "DecisionTreeResultBlob", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "azuremltesting/scored/{folderpart}/", - "fileName": "{filepart}result.csv", - "partitionedBy": [ - { - "name": "folderpart", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyyMMdd" - } - }, - { - "name": "filepart", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HHmmss" - } - } - ], - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Day", - "interval": 15 - } - } - } - ``` --4. Create a **linked service** of type: **AzureMLLinkedService**, providing the API key and model batch execution URL. -- ```json - { - "name": "MyAzureMLLinkedService", - "properties": { - "type": "AzureML", - "typeProperties": { - "mlEndpoint": "https://[batch execution endpoint]/jobs", - "apiKey": "[apikey]" - } - } - } - ``` --5. Finally, author a pipeline containing an **AzureMLBatchExecution** Activity. At runtime, pipeline performs the following steps: -- 1. Gets the location of the input file from your input datasets. - 2. Invokes the Studio (classic) batch execution API - 3. Copies the batch execution output to the blob given in your output dataset. -- > [!NOTE] - > AzureMLBatchExecution activity can have zero or more inputs and one or more outputs. - > - > -- ```json - { - "name": "PredictivePipeline", - "properties": { - "description": "use AzureML model", - "activities": [ - { - "name": "MLActivity", - "type": "AzureMLBatchExecution", - "description": "prediction analysis on batch input", - "inputs": [ - { - "name": "DecisionTreeInputBlob" - } - ], - "outputs": [ - { - "name": "DecisionTreeResultBlob" - } - ], - "linkedServiceName": "MyAzureMLLinkedService", - "typeProperties": - { - "webServiceInput": "DecisionTreeInputBlob", - "webServiceOutputs": { - "output1": "DecisionTreeResultBlob" - } - }, - "policy": { - "concurrency": 3, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "02:00:00" - } - } - ], - "start": "2016-02-13T00:00:00Z", - "end": "2016-02-14T00:00:00Z" - } - } - ``` -- Both **start** and **end** datetime values must be in [ISO format](https://en.wikipedia.org/wiki/ISO_8601), such as `2014-10-14T16:32:41Z`. The **end** time is optional. If you do not specify value for the **end** property, it is calculated as "**start + 48 hours.**" To run the pipeline indefinitely, specify **9999-09-09** as the value for the **end** property. See [JSON Scripting Reference](/previous-versions/azure/dn835050(v=azure.100)) for details about JSON properties. -- > [!NOTE] - > Specifying input for the AzureMLBatchExecution activity is optional. --### Scenario: Experiments using Reader/Writer Modules to refer to data in various storages -Another common scenario when creating Studio (classic) experiments is to use Reader and Writer modules. The reader module is used to load data into an experiment and the writer module is to save data from your experiments. For details about reader and writer modules, see [Reader](/azure/machine-learning/studio-module-reference/import-data) and [Writer](/azure/machine-learning/studio-module-reference/export-data) topics on MSDN Library. --When using the reader and writer modules, it's good practice to use a Web service parameter for each property of these reader/writer modules. These web parameters enable you to configure the values during runtime. For example, you could create an experiment with a reader module that uses an Azure SQL Database instance: `XXX.database.windows.net`. After the web service has been deployed, you want to enable the consumers of the web service to specify another logical SQL Server instance called `YYY.database.windows.net`. You can use a Web service parameter to allow this value to be configured. --> [!NOTE] -> Web service input and output are different from Web service parameters. In the first scenario, you have seen how an input and output can be specified for a Studio (classic) Web service. In this scenario, you pass parameters for a Web service that correspond to properties of reader/writer modules. --Let's look at a scenario for using Web service parameters. You have a deployed Studio (classic) web service that uses a reader module to read data from one of the data sources supported by Studio (classic) (for example: Azure SQL Database). After the batch execution is performed, the results are written using a Writer module (Azure SQL Database). No web service inputs and outputs are defined in the experiments. In this case, we recommend that you configure relevant web service parameters for the reader and writer modules. This configuration allows the reader/writer modules to be configured when using the AzureMLBatchExecution activity. You specify Web service parameters in the **globalParameters** section in the activity JSON as follows. --```json -"typeProperties": { - "globalParameters": { - "Param 1": "Value 1", - "Param 2": "Value 2" - } -} -``` --You can also use [Data Factory Functions](data-factory-functions-variables.md) in passing values for the Web service parameters as shown in the following example: --```json -"typeProperties": { - "globalParameters": { - "Database query": "$$Text.Format('SELECT * FROM myTable WHERE timeColumn = \\'{0:yyyy-MM-dd HH:mm:ss}\\'', Time.AddHours(WindowStart, 0))" - } -} -``` --> [!NOTE] -> The Web service parameters are case-sensitive, so ensure that the names you specify in the activity JSON match the ones exposed by the Web service. --### Using a Reader module to read data from multiple files in Azure Blob -Big data pipelines with activities such as Pig and Hive can produce one or more output files with no extensions. For example, when you specify an external Hive table, the data for the external Hive table can be stored in Azure blob storage with the following name 000000_0. You can use the reader module in an experiment to read multiple files, and use them for predictions. --When using the reader module in a Studio (classic) experiment, you can specify Azure Blob as an input. The files in the Azure blob storage can be the output files (Example: 000000_0) that are produced by a Pig and Hive script running on HDInsight. The reader module allows you to read files (with no extensions) by configuring the **Path to container, directory/blob**. The **Path to container** points to the container and **directory/blob** points to folder that contains the files as shown in the following image. The asterisk that is, \*) **specifies that all the files in the container/folder (that is, data/aggregateddata/year=2014/month-6/\*)** are read as part of the experiment. ---### Example -#### Pipeline with AzureMLBatchExecution activity with Web Service Parameters --```json -{ - "name": "MLWithSqlReaderSqlWriter", - "properties": { - "description": "ML Studio (classic) model with sql azure reader/writer", - "activities": [ - { - "name": "MLSqlReaderSqlWriterActivity", - "type": "AzureMLBatchExecution", - "description": "test", - "inputs": [ - { - "name": "MLSqlInput" - } - ], - "outputs": [ - { - "name": "MLSqlOutput" - } - ], - "linkedServiceName": "MLSqlReaderSqlWriterDecisionTreeModel", - "typeProperties": - { - "webServiceInput": "MLSqlInput", - "webServiceOutputs": { - "output1": "MLSqlOutput" - } - "globalParameters": { - "Database server name": "<myserver>.database.windows.net", - "Database name": "<database>", - "Server user account name": "<user name>", - "Server user account password": "<password>" - } - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "02:00:00" - }, - } - ], - "start": "2016-02-13T00:00:00Z", - "end": "2016-02-14T00:00:00Z" - } -} -``` --In the above JSON example: --* The deployed Studio (classic) Web service uses a reader and a writer module to read/write data from/to an Azure SQL Database. This Web service exposes the following four parameters: Database server name, Database name, Server user account name, and Server user account password. -* Both **start** and **end** datetimes must be in [ISO format](https://en.wikipedia.org/wiki/ISO_8601). For example: 2014-10-14T16:32:41Z. The **end** time is optional. If you do not specify value for the **end** property, it is calculated as "**start + 48 hours.**" To run the pipeline indefinitely, specify **9999-09-09** as the value for the **end** property. See [JSON Scripting Reference](/previous-versions/azure/dn835050(v=azure.100)) for details about JSON properties. --### Other scenarios -#### Web service requires multiple inputs -If the web service takes multiple inputs, use the **webServiceInputs** property instead of using **webServiceInput**. Datasets that are referenced by the **webServiceInputs** must also be included in the Activity **inputs**. --In your ML Studio (classic) experiment, web service input and output ports and global parameters have default names ("input1", "input2") that you can customize. The names you use for webServiceInputs, webServiceOutputs, and globalParameters settings must exactly match the names in the experiments. You can view the sample request payload on the Batch Execution Help page for your Studio (classic) endpoint to verify the expected mapping. --```json -{ - "name": "PredictivePipeline", - "properties": { - "description": "use AzureML model", - "activities": [{ - "name": "MLActivity", - "type": "AzureMLBatchExecution", - "description": "prediction analysis on batch input", - "inputs": [{ - "name": "inputDataset1" - }, { - "name": "inputDataset2" - }], - "outputs": [{ - "name": "outputDataset" - }], - "linkedServiceName": "MyAzureMLLinkedService", - "typeProperties": { - "webServiceInputs": { - "input1": "inputDataset1", - "input2": "inputDataset2" - }, - "webServiceOutputs": { - "output1": "outputDataset" - } - }, - "policy": { - "concurrency": 3, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "02:00:00" - } - }], - "start": "2016-02-13T00:00:00Z", - "end": "2016-02-14T00:00:00Z" - } -} -``` --#### Web Service does not require an input -ML Studio (classic) batch execution web services can be used to run any workflows, for example R or Python scripts, that may not require any inputs. Or, the experiment might be configured with a Reader module that does not expose any GlobalParameters. In that case, the AzureMLBatchExecution Activity would be configured as follows: --```json -{ - "name": "scoring service", - "type": "AzureMLBatchExecution", - "outputs": [ - { - "name": "myBlob" - } - ], - "typeProperties": { - "webServiceOutputs": { - "output1": "myBlob" - } - }, - "linkedServiceName": "mlEndpoint", - "policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "02:00:00" - } -}, -``` --#### Web Service does not require an input/output -The ML Studio (classic) batch execution web service might not have any Web Service output configured. In this example, there is no Web Service input or output, nor are any GlobalParameters configured. There is still an output configured on the activity itself, but it is not given as a webServiceOutput. --```json -{ - "name": "retraining", - "type": "AzureMLBatchExecution", - "outputs": [ - { - "name": "placeholderOutputDataset" - } - ], - "typeProperties": { - }, - "linkedServiceName": "mlEndpoint", - "policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "02:00:00" - } -}, -``` --#### Web Service uses readers and writers, and the activity runs only when other activities have succeeded -The ML Studio (classic) web service reader and writer modules might be configured to run with or without any GlobalParameters. However, you may want to embed service calls in a pipeline that uses dataset dependencies to invoke the service only when some upstream processing has completed. You can also trigger some other action after the batch execution has completed using this approach. In that case, you can express the dependencies using activity inputs and outputs, without naming any of them as Web Service inputs or outputs. --```json -{ - "name": "retraining", - "type": "AzureMLBatchExecution", - "inputs": [ - { - "name": "upstreamData1" - }, - { - "name": "upstreamData2" - } - ], - "outputs": [ - { - "name": "downstreamData" - } - ], - "typeProperties": { - }, - "linkedServiceName": "mlEndpoint", - "policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "02:00:00" - } -}, -``` --The **takeaways** are: --* If your experiment endpoint uses a webServiceInput: it is represented by a blob dataset and is included in the activity inputs and the webServiceInput property. Otherwise, the webServiceInput property is omitted. -* If your experiment endpoint uses webServiceOutput(s): they are represented by blob datasets and are included in the activity outputs and in the webServiceOutputs property. The activity outputs and webServiceOutputs are mapped by the name of each output in the experiment. Otherwise, the webServiceOutputs property is omitted. -* If your experiment endpoint exposes globalParameter(s), they are given in the activity globalParameters property as key, value pairs. Otherwise, the globalParameters property is omitted. The keys are case-sensitive. [Azure Data Factory functions](data-factory-functions-variables.md) may be used in the values. -* Additional datasets may be included in the Activity inputs and outputs properties, without being referenced in the Activity typeProperties. These datasets govern execution using slice dependencies but are otherwise ignored by the AzureMLBatchExecution Activity. ---## Updating models using Update Resource Activity -After you are done with retraining, update the scoring web service (predictive experiment exposed as a web service) with the newly trained model by using the **ML Studio (classic) Update Resource Activity**. See [Updating models using Update Resource Activity](data-factory-azure-ml-update-resource-activity.md) article for details. --### Reader and Writer Modules -A common scenario for using Web service parameters is the use of Azure SQL Readers and Writers. The reader module is used to load data into an experiment from data management services outside Studio (classic). The writer module is to save data from your experiments into data management services outside Studio (classic). --For details about Azure Blob/Azure SQL reader/writer, see [Reader](/azure/machine-learning/studio-module-reference/import-data) and [Writer](/azure/machine-learning/studio-module-reference/export-data) topics on MSDN Library. The example in the previous section used the Azure Blob reader and Azure Blob writer. This section discusses using Azure SQL reader and Azure SQL writer. --## Frequently asked questions -**Q:** I have multiple files that are generated by my big data pipelines. Can I use the AzureMLBatchExecution Activity to work on all the files? --**A:** Yes. See the **Using a Reader module to read data from multiple files in Azure Blob** section for details. --## ML Studio (classic) Batch Scoring Activity -If you are using the **AzureMLBatchScoring** activity to integrate with ML Studio (classic), we recommend that you use the latest **AzureMLBatchExecution** activity. --The AzureMLBatchExecution activity is introduced in the August 2015 release of Azure SDK and Azure PowerShell. --If you want to continue using the AzureMLBatchScoring activity, continue reading through this section. --### ML Studio (classic) Batch Scoring activity using Azure Storage for input/output --```json -{ - "name": "PredictivePipeline", - "properties": { - "description": "use AzureML model", - "activities": [ - { - "name": "MLActivity", - "type": "AzureMLBatchScoring", - "description": "prediction analysis on batch input", - "inputs": [ - { - "name": "ScoringInputBlob" - } - ], - "outputs": [ - { - "name": "ScoringResultBlob" - } - ], - "linkedServiceName": "MyAzureMLLinkedService", - "policy": { - "concurrency": 3, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "02:00:00" - } - } - ], - "start": "2016-02-13T00:00:00Z", - "end": "2016-02-14T00:00:00Z" - } -} -``` --### Web Service Parameters -To specify values for Web service parameters, add a **typeProperties** section to the **AzureMLBatchScoringActivity** section in the pipeline JSON as shown in the following example: --```json -"typeProperties": { - "webServiceParameters": { - "Param 1": "Value 1", - "Param 2": "Value 2" - } -} -``` --You can also use [Data Factory Functions](data-factory-functions-variables.md) in passing values for the Web service parameters as shown in the following example: --```json -"typeProperties": { - "webServiceParameters": { - "Database query": "$$Text.Format('SELECT * FROM myTable WHERE timeColumn = \\'{0:yyyy-MM-dd HH:mm:ss}\\'', Time.AddHours(WindowStart, 0))" - } -} -``` --> [!NOTE] -> The Web service parameters are case-sensitive, so ensure that the names you specify in the activity JSON match the ones exposed by the Web service. --## See also -* [Azure blog post: Getting started with Azure Data Factory and ML Studio (classic)](https://azure.microsoft.com/blog/getting-started-with-azure-data-factory-and-azure-machine-learning-4/) --[adf-build-1st-pipeline]: data-factory-build-your-first-pipeline.md --[azure-machine-learning]: https://azure.microsoft.com/services/machine-learning/ |
data-factory | Data Factory Azure Ml Update Resource Activity | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-azure-ml-update-resource-activity.md | - Title: Update Machine Learning models using Azure Data Factory -description: Describes how to create predictive pipelines using Azure Data Factory v1 and ML Studio (classic) ------- Previously updated : 04/12/2023---# Updating ML Studio (classic) models using Update Resource Activity ---> [!div class="op_single_selector" title1="Transformation Activities"] -> * [Hive Activity](data-factory-hive-activity.md) -> * [Pig Activity](data-factory-pig-activity.md) -> * [MapReduce Activity](data-factory-map-reduce.md) -> * [Hadoop Streaming Activity](data-factory-hadoop-streaming-activity.md) -> * [Spark Activity](data-factory-spark.md) -> * [ML Studio (classic) Batch Execution Activity](data-factory-azure-ml-batch-execution-activity.md) -> * [ML Studio (classic) Update Resource Activity](data-factory-azure-ml-update-resource-activity.md) -> * [Stored Procedure Activity](data-factory-stored-proc-activity.md) -> * [Data Lake Analytics U-SQL Activity](data-factory-usql-activity.md) -> * [.NET Custom Activity](data-factory-use-custom-activities.md) ---> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [update machine learning models in Data Factory](../update-machine-learning-models.md). --This article complements the main Azure Data Factory - ML Studio (classic) integration article: [Create predictive pipelines using ML Studio (classic) and Azure Data Factory](data-factory-azure-ml-batch-execution-activity.md). If you haven't already done so, review the main article before reading through this article. --## Overview -Over time, the predictive models in the ML Studio (classic) scoring experiments need to be retrained using new input datasets. After you are done with retraining, you want to update the scoring web service with the retrained ML model. The typical steps to enable retraining and updating Studio (classic) models via web services are: --1. Create an experiment in [ML Studio (classic)](https://studio.azureml.net). -2. When you are satisfied with the model, use ML Studio (classic) to publish web services for both the **training experiment** and scoring/**predictive experiment**. --The following table describes the web services used in this example. See [Retrain ML Studio (classic) models programmatically](../../machine-learning/classic/retrain-machine-learning-model.md) for details. --- **Training web service** - Receives training data and produces trained models. The output of the retraining is an .ilearner file in an Azure Blob storage. The **default endpoint** is automatically created for you when you publish the training experiment as a web service. You can create more endpoints but the example uses only the default endpoint.-- **Scoring web service** - Receives unlabeled data examples and makes predictions. The output of prediction could have various forms, such as a .csv file or rows in Azure SQL Database, depending on the configuration of the experiment. The default endpoint is automatically created for you when you publish the predictive experiment as a web service. --The following picture depicts the relationship between training and scoring endpoints in ML Studio (classic). ---You can invoke the **training web service** by using the **ML Studio (classic) Batch Execution Activity**. Invoking a training web service is same as invoking an ML Studio (classic) web service (scoring web service) for scoring data. The preceding sections cover how to invoke an ML Studio (classic) web service from an Azure Data Factory pipeline in detail. --You can invoke the **scoring web service** by using the **ML Studio (classic) Update Resource Activity** to update the web service with the newly trained model. The following examples provide linked service definitions: --## Scoring web service is a classic web service -If the scoring web service is a **classic web service**, create the second **non-default and updatable endpoint** by using the Azure portal. See [Create Endpoints](../../machine-learning/classic/create-endpoint.md) article for steps. After you create the non-default updatable endpoint, do the following steps: --* Click **BATCH EXECUTION** to get the URI value for the **mlEndpoint** JSON property. -* Click **UPDATE RESOURCE** link to get the URI value for the **updateResourceEndpoint** JSON property. The API key is on the endpoint page itself (in the bottom-right corner). ---The following example provides a sample JSON definition for the AzureML linked service. The linked service uses the apiKey for authentication. --```json -{ - "name": "updatableScoringEndpoint2", - "properties": { - "type": "AzureML", - "typeProperties": { - "mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/xxx/services/--scoring experiment--/jobs", - "apiKey": "endpoint2Key", - "updateResourceEndpoint": "https://management.azureml.net/workspaces/xxx/webservices/--scoring experiment--/endpoints/endpoint2" - } - } -} -``` --## Scoring web service is Azure Resource Manager web service -If the web service is the new type of web service that exposes an Azure Resource Manager endpoint, you do not need to add the second **non-default** endpoint. The **updateResourceEndpoint** in the linked service is of the format: --``` -https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resource-group-name}/providers/Microsoft.MachineLearning/webServices/{web-service-name}?api-version=2016-05-01-preview. -``` --You can get values for place holders in the URL when querying the web service on the [ML Studio (classic) Web Services Portal](https://services.azureml.net/). The new type of update resource endpoint requires a Microsoft Entra token. Specify **servicePrincipalId** and **servicePrincipalKey** in the Studio (classic) linked service. See [how to create service principal and assign permissions to manage Azure resource](../../active-directory/develop/howto-create-service-principal-portal.md). Here is a sample AzureML linked service definition: --```json -{ - "name": "AzureMLLinkedService", - "properties": { - "type": "AzureML", - "description": "The linked service for AML web service.", - "typeProperties": { - "mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/0000000000000000000000000000000000000/services/0000000000000000000000000000000000000/jobs?api-version=2.0", - "apiKey": "xxxxxxxxxxxx", - "updateResourceEndpoint": "https://management.azure.com/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/myRG/providers/Microsoft.MachineLearning/webServices/myWebService?api-version=2016-05-01-preview", - "servicePrincipalId": "000000000-0000-0000-0000-0000000000000", - "servicePrincipalKey": "xxxxx", - "tenant": "mycompany.com" - } - } -} -``` --The following scenario provides more details. It has an example for retraining and updating Studio (classic) models from an Azure Data Factory pipeline. --## Scenario: retraining and updating a Studio (classic) model -This section provides a sample pipeline that uses the **ML Studio (classic) Batch Execution activity** to retrain a model. The pipeline also uses the **ML Studio (classic) Update Resource activity** to update the model in the scoring web service. The section also provides JSON snippets for all the linked services, datasets, and pipeline in the example. --Here is the diagram view of the sample pipeline. As you can see, the Studio (classic) Batch Execution Activity takes the training input and produces a training output (iLearner file). The Studio (classic) Update Resource Activity takes this training output and updates the model in the scoring web service endpoint. The Update Resource Activity does not produce any output. The placeholderBlob is just a dummy output dataset that is required by the Azure Data Factory service to run the pipeline. ---### Azure Blob storage linked service: -The Azure Storage holds the following data: --* training data. The input data for the Studio (classic) training web service. -* iLearner file. The output from the Studio (classic) training web service. This file is also the input to the Update Resource activity. --Here is the sample JSON definition of the linked service: --```JSON -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=name;AccountKey=key" - } - } -} -``` --### Training input dataset: -The following dataset represents the input training data for the Studio (classic) training web service. The Studio (classic) Batch Execution activity takes this dataset as an input. --```JSON -{ - "name": "trainingData", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "labeledexamples", - "fileName": "labeledexamples.arff", - "format": { - "type": "TextFormat" - } - }, - "availability": { - "frequency": "Week", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --### Training output dataset: -The following dataset represents the output iLearner file from the ML Studio (classic) training web service. The ML Studio (classic) Batch Execution Activity produces this dataset. This dataset is also the input to the ML Studio (classic) Update Resource activity. --```JSON -{ - "name": "trainedModelBlob", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "trainingoutput", - "fileName": "model.ilearner", - "format": { - "type": "TextFormat" - } - }, - "availability": { - "frequency": "Week", - "interval": 1 - } - } -} -``` --### Linked service for Studio (classic) training endpoint -The following JSON snippet defines a Studio (classic) linked service that points to the default endpoint of the training web service. --```JSON -{ - "name": "trainingEndpoint", - "properties": { - "type": "AzureML", - "typeProperties": { - "mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/xxx/services/--training experiment--/jobs", - "apiKey": "myKey" - } - } -} -``` --In **ML Studio (classic)**, do the following to get values for **mlEndpoint** and **apiKey**: --1. Click **WEB SERVICES** on the left menu. -2. Click the **training web service** in the list of web services. -3. Click copy next to **API key** text box. Paste the key in the clipboard into the Data Factory JSON editor. -4. In the **ML Studio (classic)**, click **BATCH EXECUTION** link. -5. Copy the **Request URI** from the **Request** section and paste it into the Data Factory JSON editor. --### Linked Service for Studio (classic) updatable scoring endpoint: -The following JSON snippet defines a Studio (classic) linked service that points to the non-default updatable endpoint of the scoring web service. --```JSON -{ - "name": "updatableScoringEndpoint2", - "properties": { - "type": "AzureML", - "typeProperties": { - "mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/00000000eb0abe4d6bbb1d7886062747d7/services/00000000026734a5889e02fbb1f65cefd/jobs?api-version=2.0", - "apiKey": "sooooooooooh3WvG1hBfKS2BNNcfwSO7hhY6dY98noLfOdqQydYDIXyf2KoIaN3JpALu/AKtflHWMOCuicm/Q==", - "updateResourceEndpoint": "https://management.azure.com/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/Default-MachineLearning-SouthCentralUS/providers/Microsoft.MachineLearning/webServices/myWebService?api-version=2016-05-01-preview", - "servicePrincipalId": "fe200044-c008-4008-a005-94000000731", - "servicePrincipalKey": "zWa0000000000Tp6FjtZOspK/WMA2tQ08c8U+gZRBlw=", - "tenant": "mycompany.com" - } - } -} -``` --### Placeholder output dataset: -The Studio (classic) Update Resource activity does not generate any output. However, Azure Data Factory requires an output dataset to drive the schedule of a pipeline. Therefore, we use a dummy/placeholder dataset in this example. --```JSON -{ - "name": "placeholderBlob", - "properties": { - "availability": { - "frequency": "Week", - "interval": 1 - }, - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "any", - "format": { - "type": "TextFormat" - } - } - } -} -``` --### Pipeline -The pipeline has two activities: **AzureMLBatchExecution** and **AzureMLUpdateResource**. The ML Studio (classic) Batch Execution activity takes the training data as input and produces an iLearner file as an output. The activity invokes the training web service (training experiment exposed as a web service) with the input training data and receives the ilearner file from the webservice. The placeholderBlob is just a dummy output dataset that is required by the Azure Data Factory service to run the pipeline. ---```JSON -{ - "name": "pipeline", - "properties": { - "activities": [ - { - "name": "retraining", - "type": "AzureMLBatchExecution", - "inputs": [ - { - "name": "trainingData" - } - ], - "outputs": [ - { - "name": "trainedModelBlob" - } - ], - "typeProperties": { - "webServiceInput": "trainingData", - "webServiceOutputs": { - "output1": "trainedModelBlob" - } - }, - "linkedServiceName": "trainingEndpoint", - "policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "02:00:00" - } - }, - { - "type": "AzureMLUpdateResource", - "typeProperties": { - "trainedModelName": "Training Exp for ADF ML [trained model]", - "trainedModelDatasetName" : "trainedModelBlob" - }, - "inputs": [ - { - "name": "trainedModelBlob" - } - ], - "outputs": [ - { - "name": "placeholderBlob" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1, - "retry": 3 - }, - "name": "AzureML Update Resource", - "linkedServiceName": "updatableScoringEndpoint2" - } - ], - "start": "2016-02-13T00:00:00Z", - "end": "2016-02-14T00:00:00Z" - } -} -``` |
data-factory | Data Factory Azure Search Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-azure-search-connector.md | - Title: Push data to Search index by using Data Factory -description: 'Learn about how to push data to Azure Cognitive Search Index by using Azure Data Factory.' ---- Previously updated : 04/12/2023-----# Push data to an Azure Cognitive Search index by using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-azure-search-connector.md) -> * [Version 2 (current version)](../connector-azure-search.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Azure Cognitive Search connector in V2](../connector-azure-search.md). --This article describes how to use the Copy Activity to push data from a supported source data store to an Azure Cognitive Search index. Supported source data stores are listed in the Source column of the [supported sources and sinks](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. This article builds on the [data movement activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with Copy Activity and supported data store combinations. --## Enabling connectivity -To allow Data Factory service connect to an on-premises data store, you install Data Management Gateway in your on-premises environment. You can install gateway on the same machine that hosts the source data store or on a separate machine to avoid competing for resources with the data store. --Data Management Gateway connects on-premises data sources to cloud services in a secure and managed way. See [Move data between on-premises and cloud](data-factory-move-data-between-onprem-and-cloud.md) article for details about Data Management Gateway. --## Getting started -You can create a pipeline with a copy activity that pushes data from a source data store to a search index by using different tools/APIs. --The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. --You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data to search index, see [JSON example: Copy data from SQL Server to an Azure Cognitive Search index](#json-example-copy-data-from-sql-server-to-azure-cognitive-search-index) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to a search Index: --## Linked service properties --The following table provides descriptions for JSON elements that are specific to the Azure Cognitive Search linked service. --| Property | Description | Required | -| -- | -- | -- | -| type | The type property must be set to: **AzureSearch**. | Yes | -| url | URL for the search service. | Yes | -| key | Admin key for the search service. | Yes | --## Dataset properties --For a full list of sections and properties that are available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types. The **typeProperties** section is different for each type of dataset. The typeProperties section for a dataset of the type **AzureSearchIndex** has the following properties: --| Property | Description | Required | -| -- | -- | -- | -| type | The type property must be set to **AzureSearchIndex**.| Yes | -| indexName | Name of the search index. Data Factory does not create the index. The index must exist in Azure Cognitive Search. | Yes | ---## Copy activity properties -For a full list of sections and properties that are available for defining activities, see the [Creating pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and various policies are available for all types of activities. Whereas, properties available in the typeProperties section vary with each activity type. For Copy Activity, they vary depending on the types of sources and sinks. --For Copy Activity, when the sink is of the type **AzureSearchIndexSink**, the following properties are available in typeProperties section: --| Property | Description | Allowed values | Required | -| -- | -- | -- | -- | -| WriteBehavior | Specifies whether to merge or replace when a document already exists in the index. See the [WriteBehavior property](#writebehavior-property).| Merge (default)<br/>Upload| No | -| WriteBatchSize | Uploads data into the search index when the buffer size reaches writeBatchSize. See the [WriteBatchSize property](#writebatchsize-property) for details. | 1 to 1,000. Default value is 1000. | No | --### WriteBehavior property -AzureSearchSink upserts when writing data. In other words, when writing a document, if the document key already exists in the search index, Azure Cognitive Search updates the existing document rather than throwing a conflict exception. --The AzureSearchSink provides the following two upsert behaviors (by using AzureSearch SDK): --- **Merge**: combine all the columns in the new document with the existing one. For columns with null value in the new document, the value in the existing one is preserved.-- **Upload**: The new document replaces the existing one. For columns not specified in the new document, the value is set to null whether there is a non-null value in the existing document or not.--The default behavior is **Merge**. --### WriteBatchSize Property -Azure Cognitive Search service supports writing documents as a batch. A batch can contain 1 to 1,000 Actions. An action handles one document to perform the upload/merge operation. --### Data type support -The following table specifies whether an Azure Cognitive Search data type is supported or not. --| Azure Cognitive Search data type | Supported in Azure Cognitive Search Sink | -| - | | -| String | Y | -| Int32 | Y | -| Int64 | Y | -| Double | Y | -| Boolean | Y | -| DataTimeOffset | Y | -| String Array | N | -| GeographyPoint | N | --## JSON example: Copy data from SQL Server to Azure Cognitive Search index --The following sample shows: --1. A linked service of type [AzureSearch](#linked-service-properties). -2. A linked service of type [OnPremisesSqlServer](data-factory-sqlserver-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [SqlServerTable](data-factory-sqlserver-connector.md#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureSearchIndex](#dataset-properties). -4. A [pipeline](data-factory-create-pipelines.md) with a Copy activity that uses [SqlSource](data-factory-sqlserver-connector.md#copy-activity-properties) and [AzureSearchIndexSink](#copy-activity-properties). --The sample copies time-series data from a SQL Server database to search index hourly. The JSON properties used in this sample are described in sections following the samples. --As a first step, setup the data management gateway on your on-premises machine. The instructions are in the [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article. --**Azure Cognitive Search linked service:** --```JSON -{ - "name": "AzureSearchLinkedService", - "properties": { - "type": "AzureSearch", - "typeProperties": { - "url": "https://<service>.search.windows.net", - "key": "<AdminKey>" - } - } -} -``` --**SQL Server linked service** --```JSON -{ - "Name": "SqlServerLinkedService", - "properties": { - "type": "OnPremisesSqlServer", - "typeProperties": { - "connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated Security=False;User ID=<username>;Password=<password>;", - "gatewayName": "<gatewayname>" - } - } -} -``` --**SQL Server input dataset** --The sample assumes you have created a table "MyTable" in SQL Server and it contains a column called "timestampcolumn" for time series data. You can query over multiple tables within the same database using a single dataset, but a single table must be used for the dataset's tableName typeProperty. --Setting "external": "true" informs Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. --```JSON -{ - "name": "SqlServerDataset", - "properties": { - "type": "SqlServerTable", - "linkedServiceName": "SqlServerLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --**Azure Cognitive Search output dataset:** --The sample copies data to an Azure Cognitive Search index named **products**. Data Factory does not create the index. To test the sample, create an index with this name. Create the search index with the same number of columns as in the input dataset. New entries are added to the search index every hour. --```JSON -{ - "name": "AzureSearchIndexDataset", - "properties": { - "type": "AzureSearchIndex", - "linkedServiceName": "AzureSearchLinkedService", - "typeProperties" : { - "indexName": "products", - }, - "availability": { - "frequency": "Minute", - "interval": 15 - } - } -} -``` --**Copy activity in a pipeline with SQL source and Azure Cognitive Search Index sink:** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **SqlSource** and **sink** type is set to **AzureSearchIndexSink**. The SQL query specified for the **SqlReaderQuery** property selects the data in the past hour to copy. --```JSON -{ - "name":"SamplePipeline", - "properties":{ - "start":"2014-06-01T18:00:00", - "end":"2014-06-01T19:00:00", - "description":"pipeline for copy activity", - "activities":[ - { - "name": "SqlServertoAzureSearchIndex", - "description": "copy activity", - "type": "Copy", - "inputs": [ - { - "name": " SqlServerInput" - } - ], - "outputs": [ - { - "name": "AzureSearchIndexDataset" - } - ], - "typeProperties": { - "source": { - "type": "SqlSource", - "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "AzureSearchIndexSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` --If you are copying data from a cloud data store into Azure Cognitive Search, `executionLocation` property is required. The following JSON snippet shows the change needed under Copy Activity `typeProperties` as an example. Check [Copy data between cloud data stores](data-factory-data-movement-activities.md#global) section for supported values and more details. --```JSON -"typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "AzureSearchIndexSink" - }, - "executionLocation": "West US" -} -``` ---## Copy from a cloud source -If you are copying data from a cloud data store into Azure Cognitive Search, `executionLocation` property is required. The following JSON snippet shows the change needed under Copy Activity `typeProperties` as an example. Check [Copy data between cloud data stores](data-factory-data-movement-activities.md#global) section for supported values and more details. --```JSON -"typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "AzureSearchIndexSink" - }, - "executionLocation": "West US" -} -``` --You can also map columns from source dataset to columns from sink dataset in the copy activity definition. For details, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Performance and tuning -See the [Copy Activity performance and tuning guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) and various ways to optimize it. --## Next steps -See the following articles: --* [Copy Activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions for creating a pipeline with a Copy Activity. |
data-factory | Data Factory Azure Sql Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-azure-sql-connector.md | - Title: Copy data to/from Azure SQL Database -description: Learn how to copy data to/from Azure SQL Database using Azure Data Factory. ---- Previously updated : 04/12/2023----# Copy data to and from Azure SQL Database using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-azure-sql-connector.md) -> * [Version 2 (current version)](../connector-azure-sql-database.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Azure SQL Database connector in V2](../connector-azure-sql-database.md). --This article explains how to use the Copy Activity in Azure Data Factory to move data to and from Azure SQL Database. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --## Supported scenarios -You can copy data **from Azure SQL Database** to the following data stores: ---You can copy data from the following data stores **to Azure SQL Database**: ---## Supported authentication type -Azure SQL Database connector supports basic authentication. --## Getting started -You can create a pipeline with a copy activity that moves data to/from an Azure SQL Database by using different tools/APIs. --The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. --You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create a **data factory**. A data factory may contain one or more pipelines. -2. Create **linked services** to link input and output data stores to your data factory. For example, if you are copying data from an Azure blob storage to Azure SQL Database, you create two linked services to link your Azure storage account and Azure SQL Database to your data factory. For linked service properties that are specific to Azure SQL Database, see [linked service properties](#linked-service-properties) section. -3. Create **datasets** to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the blob container and folder that contains the input data. And, you create another dataset to specify the SQL table in Azure SQL Database that holds the data copied from the blob storage. For dataset properties that are specific to Azure Data Lake Store, see [dataset properties](#dataset-properties) section. -4. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use BlobSource as a source and SqlSink as a sink for the copy activity. Similarly, if you are copying from Azure SQL Database to Azure Blob Storage, you use SqlSource and BlobSink in the copy activity. For copy activity properties that are specific to Azure SQL Database, see [copy activity properties](#copy-activity-properties) section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from an Azure SQL Database, see [JSON examples](#json-examples-for-copying-data-to-and-from-sql-database) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure SQL Database: --## Linked service properties -An Azure SQL linked service links Azure SQL Database to your data factory. The following table provides description for JSON elements specific to Azure SQL linked service. --| Property | Description | Required | -| | | | -| type |The type property must be set to: **AzureSqlDatabase** |Yes | -| connectionString |Specify information needed to connect to the Azure SQL Database instance for the connectionString property. Only basic authentication is supported. |Yes | --> [!IMPORTANT] -> Configure [Azure SQL Database Firewall](/previous-versions/azure/ee621782(v=azure.100)#ConnectingFromAzure) the database server to [allow Azure Services to access the server](/previous-versions/azure/ee621782(v=azure.100)#ConnectingFromAzure). Additionally, if you are copying data to Azure SQL Database from outside Azure including from on-premises data sources with data factory gateway, configure appropriate IP address range for the machine that is sending data to Azure SQL Database. --## Dataset properties -To specify a dataset to represent input or output data in Azure SQL Database, you set the type property of the dataset to: **AzureSqlTable**. Set the **linkedServiceName** property of the dataset to the name of the Azure SQL linked service. --For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The **typeProperties** section for the dataset of type **AzureSqlTable** has the following properties: --| Property | Description | Required | -| | | | -| tableName |Name of the table or view in the Azure SQL Database instance that linked service refers to. |Yes | --## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and policy are available for all types of activities. --> [!NOTE] -> The Copy Activity takes only one input and produces only one output. --Whereas, properties available in the **typeProperties** section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --If you are moving data from Azure SQL Database, you set the source type in the copy activity to **SqlSource**. Similarly, if you are moving data to Azure SQL Database, you set the sink type in the copy activity to **SqlSink**. This section provides a list of properties supported by SqlSource and SqlSink. --### SqlSource -In copy activity, when the source is of type **SqlSource**, the following properties are available in **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| sqlReaderQuery |Use the custom query to read data. |SQL query string. Example: `select * from MyTable`. |No | -| sqlReaderStoredProcedureName |Name of the stored procedure that reads data from the source table. |Name of the stored procedure. The last SQL statement must be a SELECT statement in the stored procedure. |No | -| storedProcedureParameters |Parameters for the stored procedure. |Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. |No | --If the **sqlReaderQuery** is specified for the SqlSource, the Copy Activity runs this query against the Azure SQL Database source to get the data. Alternatively, you can specify a stored procedure by specifying the **sqlReaderStoredProcedureName** and **storedProcedureParameters** (if the stored procedure takes parameters). --If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section of the dataset JSON are used to build a query (`select column1, column2 from mytable`) to run against the Azure SQL Database. If the dataset definition does not have the structure, all columns are selected from the table. --> [!NOTE] -> When you use **sqlReaderStoredProcedureName**, you still need to specify a value for the **tableName** property in the dataset JSON. There are no validations performed against this table though. -> -> --### SqlSource example --```JSON -"source": { - "type": "SqlSource", - "sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters", - "storedProcedureParameters": { - "stringData": { "value": "str3" }, - "identifier": { "value": "$$Text.Format('{0:yyyy}', SliceStart)", "type": "Int"} - } -} -``` --**The stored procedure definition:** --```SQL -CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters -( - @stringData varchar(20), - @identifier int -) -AS -SET NOCOUNT ON; -BEGIN - select * - from dbo.UnitTestSrcTable - where dbo.UnitTestSrcTable.stringData != stringData - and dbo.UnitTestSrcTable.identifier != identifier -END -GO -``` --### SqlSink -**SqlSink** supports the following properties: --| Property | Description | Allowed values | Required | -| | | | | -| writeBatchTimeout |Wait time for the batch insert operation to complete before it times out. |timespan<br/><br/> Example: "00:30:00" (30 minutes). |No | -| writeBatchSize |Inserts data into the SQL table when the buffer size reaches writeBatchSize. |Integer (number of rows) |No (default: 10000) | -| sqlWriterCleanupScript |Specify a query for Copy Activity to execute such that data of a specific slice is cleaned up. For more information, see [repeatable copy](#repeatable-copy). |A query statement. |No | -| sliceIdentifierColumnName |Specify a column name for Copy Activity to fill with auto generated slice identifier, which is used to clean up data of a specific slice when rerun. For more information, see [repeatable copy](#repeatable-copy). |Column name of a column with data type of binary(32). |No | -| sqlWriterStoredProcedureName |Name of the stored procedure that defines how to apply source data into target table, e.g. to do upserts or transform using your own business logic. <br/><br/>Note this stored procedure will be **invoked per batch**. If you want to do operation that only runs once and has nothing to do with source data e.g. delete/truncate, use `sqlWriterCleanupScript` property. |Name of the stored procedure. |No | -| storedProcedureParameters |Parameters for the stored procedure. |Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. |No | -| sqlWriterTableType |Specify a table type name to be used in the stored procedure. Copy activity makes the data being moved available in a temp table with this table type. Stored procedure code can then merge the data being copied with existing data. |A table type name. |No | --#### SqlSink example --```JSON -"sink": { - "type": "SqlSink", - "writeBatchSize": 1000000, - "writeBatchTimeout": "00:05:00", - "sqlWriterStoredProcedureName": "CopyTestStoredProcedureWithParameters", - "sqlWriterTableType": "CopyTestTableType", - "storedProcedureParameters": { - "identifier": { "value": "1", "type": "Int" }, - "stringData": { "value": "str1" }, - "decimalData": { "value": "1", "type": "Decimal" } - } -} -``` --## JSON examples for copying data to and from SQL Database -The following examples provide sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). They show how to copy data to and from Azure SQL Database and Azure Blob Storage. However, data can be copied **directly** from any of sources to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --### Example: Copy data from Azure SQL Database to Azure Blob -The same defines the following Data Factory entities: --1. A linked service of type [AzureSqlDatabase](#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [AzureSqlTable](#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [Azure Blob](data-factory-azure-blob-connector.md#dataset-properties). -5. A [pipeline](data-factory-create-pipelines.md) with a Copy activity that uses [SqlSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies time-series data (hourly, daily, etc.) from a table in Azure SQL Database to a blob every hour. The JSON properties used in these samples are described in sections following the samples. --**Azure SQL Database linked service:** --```JSON -{ - "name": "AzureSqlLinkedService", - "properties": { - "type": "AzureSqlDatabase", - "typeProperties": { - "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" - } - } -} -``` -See the Azure SQL Linked Service section for the list of properties supported by this linked service. --**Azure Blob storage linked service:** --```JSON -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -See the [Azure Blob](data-factory-azure-blob-connector.md#azure-storage-linked-service) article for the list of properties supported by this linked service. ---**Azure SQL input dataset:** --The sample assumes you have created a table "MyTable" in Azure SQL and it contains a column called "timestampcolumn" for time series data. --Setting "external": "true" informs the Azure Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. --```JSON -{ - "name": "AzureSqlInput", - "properties": { - "type": "AzureSqlTable", - "linkedServiceName": "AzureSqlLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --See the Azure SQL dataset type properties section for the list of properties supported by this dataset type. --**Azure Blob output dataset:** --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```JSON -{ - "name": "AzureBlobOutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}/", - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ], - "format": { - "type": "TextFormat", - "columnDelimiter": "\t", - "rowDelimiter": "\n" - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` -See the [Azure Blob dataset type properties](data-factory-azure-blob-connector.md#dataset-properties) section for the list of properties supported by this dataset type. --**A copy activity in a pipeline with SQL source and Blob sink:** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **SqlSource** and **sink** type is set to **BlobSink**. The SQL query specified for the **SqlReaderQuery** property selects the data in the past hour to copy. --```JSON -{ - "name":"SamplePipeline", - "properties":{ - "start":"2014-06-01T18:00:00", - "end":"2014-06-01T19:00:00", - "description":"pipeline for copy activity", - "activities":[ - { - "name": "AzureSQLtoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [ - { - "name": "AzureSQLInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "typeProperties": { - "source": { - "type": "SqlSource", - "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` -In the example, **sqlReaderQuery** is specified for the SqlSource. The Copy Activity runs this query against the Azure SQL Database source to get the data. Alternatively, you can specify a stored procedure by specifying the **sqlReaderStoredProcedureName** and **storedProcedureParameters** (if the stored procedure takes parameters). --If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section of the dataset JSON are used to build a query to run against the Azure SQL Database. For example: `select column1, column2 from mytable`. If the dataset definition does not have the structure, all columns are selected from the table. --See the [Sql Source](#sqlsource) section and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties) for the list of properties supported by SqlSource and BlobSink. --### Example: Copy data from Azure Blob to Azure SQL Database -The sample defines the following Data Factory entities: --1. A linked service of type [AzureSqlDatabase](#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureSqlTable](#dataset-properties). -5. A [pipeline](data-factory-create-pipelines.md) with Copy activity that uses [BlobSource](data-factory-azure-blob-connector.md#copy-activity-properties) and [SqlSink](#copy-activity-properties). --The sample copies time-series data (hourly, daily, etc.) from Azure blob to a table in Azure SQL Database every hour. The JSON properties used in these samples are described in sections following the samples. --**Azure SQL linked service:** --```JSON -{ - "name": "AzureSqlLinkedService", - "properties": { - "type": "AzureSqlDatabase", - "typeProperties": { - "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" - } - } -} -``` -See the Azure SQL Linked Service section for the list of properties supported by this linked service. --**Azure Blob storage linked service:** --```JSON -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -See the [Azure Blob](data-factory-azure-blob-connector.md#azure-storage-linked-service) article for the list of properties supported by this linked service. ---**Azure Blob input dataset:** --Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. "external": "true" setting informs the Data Factory service that this table is external to the data factory and is not produced by an activity in the data factory. --```JSON -{ - "name": "AzureBlobInput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/", - "fileName": "{Hour}.csv", - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ], - "format": { - "type": "TextFormat", - "columnDelimiter": ",", - "rowDelimiter": "\n" - } - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` -See the [Azure Blob dataset type properties](data-factory-azure-blob-connector.md#dataset-properties) section for the list of properties supported by this dataset type. --**Azure SQL Database output dataset:** --The sample copies data to a table named "MyTable" in Azure SQL. Create the table in Azure SQL with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour. --```JSON -{ - "name": "AzureSqlOutput", - "properties": { - "type": "AzureSqlTable", - "linkedServiceName": "AzureSqlLinkedService", - "typeProperties": { - "tableName": "MyOutputTable" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` -See the Azure SQL dataset type properties section for the list of properties supported by this dataset type. --**A copy activity in a pipeline with Blob source and SQL sink:** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **BlobSource** and **sink** type is set to **SqlSink**. --```JSON -{ - "name":"SamplePipeline", - "properties":{ - "start":"2014-06-01T18:00:00", - "end":"2014-06-01T19:00:00", - "description":"pipeline with copy activity", - "activities":[ - { - "name": "AzureBlobtoSQL", - "description": "Copy Activity", - "type": "Copy", - "inputs": [ - { - "name": "AzureBlobInput" - } - ], - "outputs": [ - { - "name": "AzureSqlOutput" - } - ], - "typeProperties": { - "source": { - "type": "BlobSource", - "blobColumnSeparators": "," - }, - "sink": { - "type": "SqlSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` -See the [Sql Sink](#sqlsink) section and [BlobSource](data-factory-azure-blob-connector.md#copy-activity-properties) for the list of properties supported by SqlSink and BlobSource. --## Identity columns in the target database -This section provides an example for copying data from a source table without an identity column to a destination table with an identity column. --**Source table:** --```SQL -create table dbo.SourceTbl -( - name varchar(100), - age int -) -``` -**Destination table:** --```SQL -create table dbo.TargetTbl -( - identifier int identity(1,1), - name varchar(100), - age int -) -``` -Notice that the target table has an identity column. --**Source dataset JSON definition** --```JSON -{ - "name": "SampleSource", - "properties": { - "type": " SqlServerTable", - "linkedServiceName": "TestIdentitySQL", - "typeProperties": { - "tableName": "SourceTbl" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": {} - } -} -``` -**Destination dataset JSON definition** --```JSON -{ - "name": "SampleTarget", - "properties": { - "structure": [ - { "name": "name" }, - { "name": "age" } - ], - "type": "AzureSqlTable", - "linkedServiceName": "TestIdentitySQLSource", - "typeProperties": { - "tableName": "TargetTbl" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": false, - "policy": {} - } -} -``` --Notice that as your source and target table have different schema (target has an additional column with identity). In this scenario, you need to specify **structure** property in the target dataset definition, which doesn't include the identity column. --## Invoke stored procedure from SQL sink -For an example of invoking a stored procedure from SQL sink in a copy activity of a pipeline, see [Invoke stored procedure for SQL sink in copy activity](data-factory-invoke-stored-procedure-from-copy-activity.md) article. --## Type mapping for Azure SQL Database -As mentioned in the [data movement activities](data-factory-data-movement-activities.md) article Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: --1. Convert from native source types to .NET type -2. Convert from .NET type to native sink type --When moving data to and from Azure SQL Database, the following mappings are used from SQL type to .NET type and vice versa. The mapping is same as the SQL Server Data Type Mapping for ADO.NET. --| SQL Server Database Engine type | .NET Framework type | -| | | -| bigint |Int64 | -| binary |Byte[] | -| bit |Boolean | -| char |String, Char[] | -| date |DateTime | -| Datetime |DateTime | -| datetime2 |DateTime | -| Datetimeoffset |DateTimeOffset | -| Decimal |Decimal | -| FILESTREAM attribute (varbinary(max)) |Byte[] | -| Float |Double | -| image |Byte[] | -| int |Int32 | -| money |Decimal | -| nchar |String, Char[] | -| ntext |String, Char[] | -| numeric |Decimal | -| nvarchar |String, Char[] | -| real |Single | -| rowversion |Byte[] | -| smalldatetime |DateTime | -| smallint |Int16 | -| smallmoney |Decimal | -| sql_variant |Object * | -| text |String, Char[] | -| time |TimeSpan | -| timestamp |Byte[] | -| tinyint |Byte | -| uniqueidentifier |Guid | -| varbinary |Byte[] | -| varchar |String, Char[] | -| xml |Xml | --## Map source to sink columns -To learn about mapping columns in source dataset to columns in sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Repeatable copy -When copying data to SQL Server Database, the copy activity appends data to the sink table by default. To perform an UPSERT instead, See [Repeatable write to SqlSink](data-factory-repeatable-copy.md#repeatable-write-to-sqlsink) article. --When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See [Repeatable read from relational sources](data-factory-repeatable-copy.md#repeatable-read-from-relational-sources). --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory Azure Sql Data Warehouse Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-azure-sql-data-warehouse-connector.md | - Title: Copy data to/from Azure Synapse Analytics -description: Learn how to copy data to/from Azure Synapse Analytics using Azure Data Factory ---- Previously updated : 04/12/2023----# Copy data to and from Azure Synapse Analytics using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-azure-sql-data-warehouse-connector.md) -> * [Version 2 (current version)](../connector-azure-sql-data-warehouse.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Azure Synapse Analytics connector in V2](../connector-azure-sql-data-warehouse.md). --This article explains how to use the Copy Activity in Azure Data Factory to move data to/from Azure Synapse Analytics. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --> [!TIP] -> To achieve best performance, use PolyBase to load data into Azure Synapse Analytics. The [Use PolyBase to load data into Azure Synapse Analytics](#use-polybase-to-load-data-into-azure-synapse-analytics) section has details. For a walkthrough with a use case, see [Load 1 TB into Azure Synapse Analytics under 15 minutes with Azure Data Factory](data-factory-load-sql-data-warehouse.md). --## Supported scenarios -You can copy data **from Azure Synapse Analytics** to the following data stores: ---You can copy data from the following data stores **to Azure Synapse Analytics**: ---> [!TIP] -> When copying data from SQL Server or Azure SQL Database to Azure Synapse Analytics, if the table does not exist in the destination store, Data Factory can automatically create the table in Azure Synapse Analytics by using the schema of the table in the source data store. See [Auto table creation](#auto-table-creation) for details. --## Supported authentication type -Azure Synapse Analytics connector support basic authentication. --## Getting started -You can create a pipeline with a copy activity that moves data to/from an Azure Synapse Analytics by using different tools/APIs. --The easiest way to create a pipeline that copies data to/from Azure Synapse Analytics is to use the Copy data wizard. See [Tutorial: Load data into Azure Synapse Analytics with Data Factory](../load-azure-sql-data-warehouse.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. --You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create a **data factory**. A data factory may contain one or more pipelines. -2. Create **linked services** to link input and output data stores to your data factory. For example, if you are copying data from an Azure blob storage to an Azure Synapse Analytics, you create two linked services to link your Azure storage account and Azure Synapse Analytics to your data factory. For linked service properties that are specific to Azure Synapse Analytics, see [linked service properties](#linked-service-properties) section. -3. Create **datasets** to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the blob container and folder that contains the input data. And, you create another dataset to specify the table in the Azure Synapse Analytics that holds the data copied from the blob storage. For dataset properties that are specific to Azure Synapse Analytics, see [dataset properties](#dataset-properties) section. -4. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use BlobSource as a source and SqlDWSink as a sink for the copy activity. Similarly, if you are copying from Azure Synapse Analytics to Azure Blob Storage, you use SqlDWSource and BlobSink in the copy activity. For copy activity properties that are specific to Azure Synapse Analytics, see [copy activity properties](#copy-activity-properties) section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from Azure Synapse Analytics, see [JSON examples](#json-examples-for-copying-data-to-and-from-azure-synapse-analytics) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure Synapse Analytics: --## Linked service properties -The following table provides description for JSON elements specific to Azure Synapse Analytics linked service. --| Property | Description | Required | -| | | | -| type |The type property must be set to: **AzureSqlDW** |Yes | -| connectionString |Specify information needed to connect to the Azure Synapse Analytics instance for the connectionString property. Only basic authentication is supported. |Yes | --> [!IMPORTANT] -> Configure [Azure SQL Database Firewall](/previous-versions/azure/ee621782(v=azure.100)#ConnectingFromAzure) and the database server to [allow Azure Services to access the server](/previous-versions/azure/ee621782(v=azure.100)#ConnectingFromAzure). Additionally, if you are copying data to Azure Synapse Analytics from outside Azure including from on-premises data sources with data factory gateway, configure appropriate IP address range for the machine that is sending data to Azure Synapse Analytics. --## Dataset properties -For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The **typeProperties** section for the dataset of type **AzureSqlDWTable** has the following properties: --| Property | Description | Required | -| | | | -| tableName |Name of the table or view in the Azure Synapse Analytics database that the linked service refers to. |Yes | --## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and policy are available for all types of activities. --> [!NOTE] -> The Copy Activity takes only one input and produces only one output. --Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --### SqlDWSource -When source is of type **SqlDWSource**, the following properties are available in **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| sqlReaderQuery |Use the custom query to read data. |SQL query string. For example: select * from MyTable. |No | -| sqlReaderStoredProcedureName |Name of the stored procedure that reads data from the source table. |Name of the stored procedure. The last SQL statement must be a SELECT statement in the stored procedure. |No | -| storedProcedureParameters |Parameters for the stored procedure. |Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. |No | --If the **sqlReaderQuery** is specified for the SqlDWSource, the Copy Activity runs this query against the Azure Synapse Analytics source to get the data. --Alternatively, you can specify a stored procedure by specifying the **sqlReaderStoredProcedureName** and **storedProcedureParameters** (if the stored procedure takes parameters). --If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section of the dataset JSON are used to build a query to run against Azure Synapse Analytics. Example: `select column1, column2 from mytable`. If the dataset definition does not have the structure, all columns are selected from the table. --#### SqlDWSource example --```JSON -"source": { - "type": "SqlDWSource", - "sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters", - "storedProcedureParameters": { - "stringData": { "value": "str3" }, - "identifier": { "value": "$$Text.Format('{0:yyyy}', SliceStart)", "type": "Int"} - } -} -``` -**The stored procedure definition:** --```SQL -CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters -( - @stringData varchar(20), - @identifier int -) -AS -SET NOCOUNT ON; -BEGIN - select * - from dbo.UnitTestSrcTable - where dbo.UnitTestSrcTable.stringData != stringData - and dbo.UnitTestSrcTable.identifier != identifier -END -GO -``` --### SqlDWSink -**SqlDWSink** supports the following properties: --| Property | Description | Allowed values | Required | -| | | | | -| sqlWriterCleanupScript |Specify a query for Copy Activity to execute such that data of a specific slice is cleaned up. For details, see [repeatability section](#repeatability-during-copy). |A query statement. |No | -| allowPolyBase |Indicates whether to use PolyBase (when applicable) instead of BULKINSERT mechanism. <br/><br/> **Using PolyBase is the recommended way to load data into Azure Synapse Analytics.** See [Use PolyBase to load data into Azure Synapse Analytics](#use-polybase-to-load-data-into-azure-synapse-analytics) section for constraints and details. |True <br/>False (default) |No | -| polyBaseSettings |A group of properties that can be specified when the **allowPolybase** property is set to **true**. | |No | -| rejectValue |Specifies the number or percentage of rows that can be rejected before the query fails. <br/><br/>Learn more about the PolyBase's reject options in the **Arguments** section of [CREATE EXTERNAL TABLE (Transact-SQL)](/sql/t-sql/statements/create-external-table-transact-sql) topic. |0 (default), 1, 2, … |No | -| rejectType |Specifies whether the rejectValue option is specified as a literal value or a percentage. |Value (default), Percentage |No | -| rejectSampleValue |Determines the number of rows to retrieve before the PolyBase recalculates the percentage of rejected rows. |1, 2, … |Yes, if **rejectType** is **percentage** | -| useTypeDefault |Specifies how to handle missing values in delimited text files when PolyBase retrieves data from the text file.<br/><br/>Learn more about this property from the Arguments section in [CREATE EXTERNAL FILE FORMAT (Transact-SQL)](/sql/t-sql/statements/create-external-file-format-transact-sql). |True, False (default) |No | -| writeBatchSize |Inserts data into the SQL table when the buffer size reaches writeBatchSize |Integer (number of rows) |No (default: 10000) | -| writeBatchTimeout |Wait time for the batch insert operation to complete before it times out. |timespan<br/><br/> Example: "00:30:00" (30 minutes). |No | --#### SqlDWSink example --```JSON -"sink": { - "type": "SqlDWSink", - "allowPolyBase": true -} -``` --## Use PolyBase to load data into Azure Synapse Analytics -Using **[PolyBase](/sql/relational-databases/polybase/polybase-guide)** is an efficient way of loading large amount of data into Azure Synapse Analytics with high throughput. You can see a large gain in the throughput by using PolyBase instead of the default BULKINSERT mechanism. See [copy performance reference number](data-factory-copy-activity-performance.md#performance-reference) with detailed comparison. For a walkthrough with a use case, see [Load 1 TB into Azure Synapse Analytics under 15 minutes with Azure Data Factory](data-factory-load-sql-data-warehouse.md). --* If your source data is in **Azure Blob or Azure Data Lake Store**, and the format is compatible with PolyBase, you can directly copy to Azure Synapse Analytics using PolyBase. See **[Direct copy using PolyBase](#direct-copy-using-polybase)** with details. -* If your source data store and format is not originally supported by PolyBase, you can use the **[Staged Copy using PolyBase](#staged-copy-using-polybase)** feature instead. It also provides you better throughput by automatically converting the data into PolyBase-compatible format and storing the data in Azure Blob storage. It then loads data into Azure Synapse Analytics. --Set the `allowPolyBase` property to **true** as shown in the following example for Azure Data Factory to use PolyBase to copy data into Azure Synapse Analytics. When you set allowPolyBase to true, you can specify PolyBase specific properties using the `polyBaseSettings` property group. see the [SqlDWSink](#sqldwsink) section for details about properties that you can use with polyBaseSettings. --```JSON -"sink": { - "type": "SqlDWSink", - "allowPolyBase": true, - "polyBaseSettings": - { - "rejectType": "percentage", - "rejectValue": 10.0, - "rejectSampleValue": 100, - "useTypeDefault": true - } -} -``` --### Direct copy using PolyBase -Azure Synapse Analytics PolyBase directly support Azure Blob and Azure Data Lake Store (using service principal) as source and with specific file format requirements. If your source data meets the criteria described in this section, you can directly copy from source data store to Azure Synapse Analytics using PolyBase. Otherwise, you can use [Staged Copy using PolyBase](#staged-copy-using-polybase). --> [!TIP] -> To copy data from Data Lake Store to Azure Synapse Analytics efficiently, learn more from [Azure Data Factory makes it even easier and convenient to uncover insights from data when using Data Lake Store with Azure Synapse Analytics](/archive/blogs/azuredatalake/azure-data-factory-makes-it-even-easier-and-convenient-to-uncover-insights-from-data-when-using-data-lake-store-with-sql-data-warehouse). --If the requirements are not met, Azure Data Factory checks the settings and automatically falls back to the BULKINSERT mechanism for the data movement. --1. **Source linked service** is of type: **AzureStorage** or **AzureDataLakeStore with service principal authentication**. -2. The **input dataset** is of type: **AzureBlob** or **AzureDataLakeStore**, and the format type under `type` properties is **OrcFormat**, **ParquetFormat**, or **TextFormat** with the following configurations: -- 1. `rowDelimiter` must be **\n**. - 2. `nullValue` is set to **empty string** (""), or `treatEmptyAsNull` is set to **true**. - 3. `encodingName` is set to **utf-8**, which is **default** value. - 4. `escapeChar`, `quoteChar`, `firstRowAsHeader`, and `skipLineCount` are not specified. - 5. `compression` can be **no compression**, **GZip**, or **Deflate**. -- ```JSON - "typeProperties": { - "folderPath": "<blobpath>", - "format": { - "type": "TextFormat", - "columnDelimiter": "<any delimiter>", - "rowDelimiter": "\n", - "nullValue": "", - "encodingName": "utf-8" - }, - "compression": { - "type": "GZip", - "level": "Optimal" - } - }, - ``` --3. There is no `skipHeaderLineCount` setting under **BlobSource** or **AzureDataLakeStore** for the Copy activity in the pipeline. -4. There is no `sliceIdentifierColumnName` setting under **SqlDWSink** for the Copy activity in the pipeline. (PolyBase guarantees that all data is updated or nothing is updated in a single run. To achieve **repeatability**, you could use `sqlWriterCleanupScript`). -5. There is no `columnMapping` being used in the associated in Copy activity. --### Staged Copy using PolyBase -When your source data doesn't meet the criteria introduced in the previous section, you can enable copying data via an interim staging Azure Blob Storage (cannot be Premium Storage). In this case, Azure Data Factory automatically performs transformations on the data to meet data format requirements of PolyBase, then use PolyBase to load data into Azure Synapse Analytics, and at last clean-up your temp data from the Blob storage. See [Staged Copy](data-factory-copy-activity-performance.md#staged-copy) for details on how copying data via a staging Azure Blob works in general. --> [!NOTE] -> When copying data from an on premises data store into Azure Synapse Analytics using PolyBase and staging, if your Data Management Gateway version is below 2.4, JRE (Java Runtime Environment) is required on your gateway machine that is used to transform your source data into proper format. Suggest you upgrade your gateway to the latest to avoid such dependency. -> --To use this feature, create an [Azure Storage linked service](data-factory-azure-blob-connector.md#azure-storage-linked-service) that refers to the Azure Storage Account that has the interim blob storage, then specify the `enableStaging` and `stagingSettings` properties for the Copy Activity as shown in the following code: --```json -"activities":[ -{ - "name": "Sample copy activity from SQL Server to Azure Synapse Analytics via PolyBase", - "type": "Copy", - "inputs": [{ "name": "OnpremisesSQLServerInput" }], - "outputs": [{ "name": "AzureSQLDWOutput" }], - "typeProperties": { - "source": { - "type": "SqlSource", - }, - "sink": { - "type": "SqlDwSink", - "allowPolyBase": true - }, - "enableStaging": true, - "stagingSettings": { - "linkedServiceName": "MyStagingBlob" - } - } -} -] -``` --## Best practices when using PolyBase -The following sections provide additional best practices to the ones that are mentioned in [Best practices for Azure Synapse Analytics](../../synapse-analytics/sql/best-practices-dedicated-sql-pool.md). --### Required database permission -To use PolyBase, it requires the user being used to load data into Azure Synapse Analytics has the ["CONTROL" permission](/sql/relational-databases/security/permissions-database-engine) on the target database. One way to achieve that is to add that user as a member of "db_owner" role. Learn how to do that by following [this section](../../synapse-analytics/sql-data-warehouse/sql-data-warehouse-overview-manage-security.md#authorization). --### Row size and data type limitation -Polybase loads are limited to loading rows both smaller than **1 MB** and cannot load to VARCHR(MAX), NVARCHAR(MAX) or VARBINARY(MAX). Refer to [here](../../synapse-analytics/sql-data-warehouse/sql-data-warehouse-service-capacity-limits.md#loads). --If you have source data with rows of size greater than 1 MB, you may want to split the source tables vertically into several small ones where the largest row size of each of them does not exceed the limit. The smaller tables can then be loaded using PolyBase and merged together in Azure Synapse Analytics. --### Azure Synapse Analytics resource class -To achieve best possible throughput, consider to assign larger resource class to the user being used to load data into Azure Synapse Analytics via PolyBase. Learn how to do that by following [Change a user resource class example](../../synapse-analytics/sql-data-warehouse/resource-classes-for-workload-management.md). --### tableName in Azure Synapse Analytics -The following table provides examples on how to specify the **tableName** property in dataset JSON for various combinations of schema and table name. --| DB Schema | Table name | tableName JSON property | -| | | | -| dbo |MyTable |MyTable or dbo.MyTable or [dbo].[MyTable] | -| dbo1 |MyTable |dbo1.MyTable or [dbo1].[MyTable] | -| dbo |My.Table |[My.Table] or [dbo].[My.Table] | -| dbo1 |My.Table |[dbo1].[My.Table] | --If you see the following error, it could be an issue with the value you specified for the tableName property. See the table for the correct way to specify values for the tableName JSON property. --``` -Type=System.Data.SqlClient.SqlException,Message=Invalid object name 'stg.Account_test'.,Source=.Net SqlClient Data Provider -``` --### Columns with default values -Currently, PolyBase feature in Data Factory only accepts the same number of columns as in the target table. Say, you have a table with four columns and one of them is defined with a default value. The input data should still contain four columns. Providing a 3-column input dataset would yield an error similar to the following message: --``` -All columns of the table must be specified in the INSERT BULK statement. -``` -NULL value is a special form of default value. If the column is nullable, the input data (in blob) for that column could be empty (cannot be missing from the input dataset). PolyBase inserts NULL for them in the Azure Synapse Analytics. --## Auto table creation -If you are using Copy Wizard to copy data from SQL Server or Azure SQL Database to Azure Synapse Analytics and the table that corresponds to the source table does not exist in the destination store, Data Factory can automatically create the table in the data warehouse by using the source table schema. --Data Factory creates the table in the destination store with the same table name in the source data store. The data types for columns are chosen based on the following type mapping. If needed, it performs type conversions to fix any incompatibilities between source and destination stores. It also uses Round Robin table distribution. --| Source SQL Database column type | Destination Azure Synapse Analytics column type (size limitation) | -| | | -| Int | Int | -| BigInt | BigInt | -| SmallInt | SmallInt | -| TinyInt | TinyInt | -| Bit | Bit | -| Decimal | Decimal | -| Numeric | Decimal | -| Float | Float | -| Money | Money | -| Real | Real | -| SmallMoney | SmallMoney | -| Binary | Binary | -| Varbinary | Varbinary (up to 8000) | -| Date | Date | -| DateTime | DateTime | -| DateTime2 | DateTime2 | -| Time | Time | -| DateTimeOffset | DateTimeOffset | -| SmallDateTime | SmallDateTime | -| Text | Varchar (up to 8000) | -| NText | NVarChar (up to 4000) | -| Image | VarBinary (up to 8000) | -| UniqueIdentifier | UniqueIdentifier | -| Char | Char | -| NChar | NChar | -| VarChar | VarChar (up to 8000) | -| NVarChar | NVarChar (up to 4000) | -| Xml | Varchar (up to 8000) | ---## Type mapping for Azure Synapse Analytics -As mentioned in the [data movement activities](data-factory-data-movement-activities.md) article, Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: --1. Convert from native source types to .NET type -2. Convert from .NET type to native sink type --When moving data to & from Azure Synapse Analytics, the following mappings are used from SQL type to .NET type and vice versa. --The mapping is same as the [SQL Server Data Type Mapping for ADO.NET](/dotnet/framework/data/adonet/sql-server-data-type-mappings). --| SQL Server Database Engine type | .NET Framework type | -| | | -| bigint |Int64 | -| binary |Byte[] | -| bit |Boolean | -| char |String, Char[] | -| date |DateTime | -| Datetime |DateTime | -| datetime2 |DateTime | -| Datetimeoffset |DateTimeOffset | -| Decimal |Decimal | -| FILESTREAM attribute (varbinary(max)) |Byte[] | -| Float |Double | -| image |Byte[] | -| int |Int32 | -| money |Decimal | -| nchar |String, Char[] | -| ntext |String, Char[] | -| numeric |Decimal | -| nvarchar |String, Char[] | -| real |Single | -| rowversion |Byte[] | -| smalldatetime |DateTime | -| smallint |Int16 | -| smallmoney |Decimal | -| sql_variant |Object * | -| text |String, Char[] | -| time |TimeSpan | -| timestamp |Byte[] | -| tinyint |Byte | -| uniqueidentifier |Guid | -| varbinary |Byte[] | -| varchar |String, Char[] | -| xml |Xml | --You can also map columns from source dataset to columns from sink dataset in the copy activity definition. For details, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## JSON examples for copying data to and from Azure Synapse Analytics -The following examples provide sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). They show how to copy data to and from Azure Synapse Analytics and Azure Blob Storage. However, data can be copied **directly** from any of sources to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --### Example: Copy data from Azure Synapse Analytics to Azure Blob -The sample defines the following Data Factory entities: --1. A linked service of type [AzureSqlDW](#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [AzureSqlDWTable](#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -5. A [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [SqlDWSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies time-series (hourly, daily, etc.) data from a table in Azure Synapse Analytics database to a blob every hour. The JSON properties used in these samples are described in sections following the samples. --**Azure Synapse Analytics linked service:** --```JSON -{ - "name": "AzureSqlDWLinkedService", - "properties": { - "type": "AzureSqlDW", - "typeProperties": { - "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" - } - } -} -``` -**Azure Blob storage linked service:** --```JSON -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -**Azure Synapse Analytics input dataset:** --The sample assumes you have created a table "MyTable" in Azure Synapse Analytics and it contains a column called "timestampcolumn" for time series data. --Setting "external": "true" informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. --```JSON -{ - "name": "AzureSqlDWInput", - "properties": { - "type": "AzureSqlDWTable", - "linkedServiceName": "AzureSqlDWLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` -**Azure Blob output dataset:** --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```JSON -{ - "name": "AzureBlobOutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ], - "format": { - "type": "TextFormat", - "columnDelimiter": "\t", - "rowDelimiter": "\n" - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Copy activity in a pipeline with SqlDWSource and BlobSink:** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **SqlDWSource** and **sink** type is set to **BlobSink**. The SQL query specified for the **SqlReaderQuery** property selects the data in the past hour to copy. --```JSON -{ - "name":"SamplePipeline", - "properties":{ - "start":"2014-06-01T18:00:00", - "end":"2014-06-01T19:00:00", - "description":"pipeline for copy activity", - "activities":[ - { - "name": "AzureSQLDWtoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [ - { - "name": "AzureSqlDWInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "typeProperties": { - "source": { - "type": "SqlDWSource", - "sqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` -> [!NOTE] -> In the example, **sqlReaderQuery** is specified for the SqlDWSource. The Copy Activity runs this query against the Azure Synapse Analytics source to get the data. -> -> Alternatively, you can specify a stored procedure by specifying the **sqlReaderStoredProcedureName** and **storedProcedureParameters** (if the stored procedure takes parameters). -> -> If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section of the dataset JSON are used to build a query (select column1, column2 from mytable) to run against Azure Synapse Analytics. If the dataset definition does not have the structure, all columns are selected from the table. -> -> --### Example: Copy data from Azure Blob to Azure Synapse Analytics -The sample defines the following Data Factory entities: --1. A linked service of type [AzureSqlDW](#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureSqlDWTable](#dataset-properties). -5. A [pipeline](data-factory-create-pipelines.md) with Copy activity that uses [BlobSource](data-factory-azure-blob-connector.md#copy-activity-properties) and [SqlDWSink](#copy-activity-properties). --The sample copies time-series data (hourly, daily, etc.) from Azure blob to a table in an Azure Synapse Analytics database every hour. The JSON properties used in these samples are described in sections following the samples. --**Azure Synapse Analytics linked service:** --```JSON -{ - "name": "AzureSqlDWLinkedService", - "properties": { - "type": "AzureSqlDW", - "typeProperties": { - "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" - } - } -} -``` -**Azure Blob storage linked service:** --```JSON -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -**Azure Blob input dataset:** --Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. "external": "true" setting informs the Data Factory service that this table is external to the data factory and is not produced by an activity in the data factory. --```JSON -{ - "name": "AzureBlobInput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}", - "fileName": "{Hour}.csv", - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ], - "format": { - "type": "TextFormat", - "columnDelimiter": ",", - "rowDelimiter": "\n" - } - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` -**Azure Synapse Analytics output dataset:** --The sample copies data to a table named "MyTable" in Azure Synapse Analytics. Create the table in Azure Synapse Analytics with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour. --```JSON -{ - "name": "AzureSqlDWOutput", - "properties": { - "type": "AzureSqlDWTable", - "linkedServiceName": "AzureSqlDWLinkedService", - "typeProperties": { - "tableName": "MyOutputTable" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` -**Copy activity in a pipeline with BlobSource and SqlDWSink:** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **BlobSource** and **sink** type is set to **SqlDWSink**. --```JSON -{ - "name":"SamplePipeline", - "properties":{ - "start":"2014-06-01T18:00:00", - "end":"2014-06-01T19:00:00", - "description":"pipeline with copy activity", - "activities":[ - { - "name": "AzureBlobtoSQLDW", - "description": "Copy Activity", - "type": "Copy", - "inputs": [ - { - "name": "AzureBlobInput" - } - ], - "outputs": [ - { - "name": "AzureSqlDWOutput" - } - ], - "typeProperties": { - "source": { - "type": "BlobSource", - "blobColumnSeparators": "," - }, - "sink": { - "type": "SqlDWSink", - "allowPolyBase": true - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` -For a walkthrough, see the see [Load 1 TB into Azure Synapse Analytics under 15 minutes with Azure Data Factory](data-factory-load-sql-data-warehouse.md) and [Load data with Azure Data Factory](../load-azure-sql-data-warehouse.md) article in the Azure Synapse Analytics documentation. --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory Azure Table Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-azure-table-connector.md | - Title: Move data to/from Azure Table -description: Learn how to move data to/from Azure Table Storage using Azure Data Factory. ---- Previously updated : 04/12/2023----# Move data to and from Azure Table using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-azure-table-connector.md) -> * [Version 2 (current version)](../connector-azure-table-storage.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Azure Table Storage connector in V2](../connector-azure-table-storage.md). --This article explains how to use the Copy Activity in Azure Data Factory to move data to/from Azure Table Storage. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --You can copy data from any supported source data store to Azure Table Storage or from Azure Table Storage to any supported sink data store. For a list of data stores supported as sources or sinks by the copy activity, see the [Supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. ---## Getting started -You can create a pipeline with a copy activity that moves data to/from an Azure Table Storage by using different tools/APIs. --The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. --You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from an Azure Table Storage, see [JSON examples](#json-examples) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure Table Storage: --## Linked service properties -There are two types of linked services you can use to link an Azure blob storage to an Azure data factory. They are: **AzureStorage** linked service and **AzureStorageSas** linked service. The Azure Storage linked service provides the data factory with global access to the Azure Storage. Whereas, The Azure Storage SAS (Shared Access Signature) linked service provides the data factory with restricted/time-bound access to the Azure Storage. There are no other differences between these two linked services. Choose the linked service that suits your needs. The following sections provide more details on these two linked services. ---## Dataset properties -For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The **typeProperties** section for the dataset of type **AzureTable** has the following properties. --| Property | Description | Required | -| | | | -| tableName |Name of the table in the Azure Table Database instance that linked service refers to. |Yes. When a tableName is specified without an azureTableSourceQuery, all records from the table are copied to the destination. If an azureTableSourceQuery is also specified, records from the table that satisfies the query are copied to the destination. | --### Schema by Data Factory -For schema-free data stores such as Azure Table, the Data Factory service infers the schema in one of the following ways: --1. If you specify the structure of data by using the **structure** property in the dataset definition, the Data Factory service honors this structure as the schema. In this case, if a row does not contain a value for a column, a null value is provided for it. -2. If you don't specify the structure of data by using the **structure** property in the dataset definition, Data Factory infers the schema by using the first row in the data. In this case, if the first row does not contain the full schema, some columns are missed in the result of copy operation. --Therefore, for schema-free data sources, the best practice is to specify the structure of data using the **structure** property. --## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output datasets, and policies are available for all types of activities. --Properties available in the typeProperties section of the activity on the other hand vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --**AzureTableSource** supports the following properties in typeProperties section: --| Property | Description | Allowed values | Required | -| | | | | -| azureTableSourceQuery |Use the custom query to read data. |Azure table query string. See examples in the next section. |No. When a tableName is specified without an azureTableSourceQuery, all records from the table are copied to the destination. If an azureTableSourceQuery is also specified, records from the table that satisfies the query are copied to the destination. | -| azureTableSourceIgnoreTableNotFound |Indicate whether swallow the exception of table not exist. |TRUE<br/>FALSE |No | --### azureTableSourceQuery examples -If Azure Table column is of string type: --```JSON -azureTableSourceQuery": "$$Text.Format('PartitionKey ge \\'{0:yyyyMMddHH00_0000}\\' and PartitionKey le \\'{0:yyyyMMddHH00_9999}\\'', SliceStart)" -``` --If Azure Table column is of datetime type: --```JSON -"azureTableSourceQuery": "$$Text.Format('DeploymentEndTime gt datetime\\'{0:yyyy-MM-ddTHH:mm:ssZ}\\' and DeploymentEndTime le datetime\\'{1:yyyy-MM-ddTHH:mm:ssZ}\\'', SliceStart, SliceEnd)" -``` --**AzureTableSink** supports the following properties in typeProperties section: --| Property | Description | Allowed values | Required | -| | | | | -| azureTableDefaultPartitionKeyValue |Default partition key value that can be used by the sink. |A string value. |No | -| azureTablePartitionKeyName |Specify name of the column whose values are used as partition keys. If not specified, AzureTableDefaultPartitionKeyValue is used as the partition key. |A column name. |No | -| azureTableRowKeyName |Specify name of the column whose column values are used as row key. If not specified, use a GUID for each row. |A column name. |No | -| azureTableInsertType |The mode to insert data into Azure table.<br/><br/>This property controls whether existing rows in the output table with matching partition and row keys have their values replaced or merged. <br/><br/>To learn about how these settings (merge and replace) work, see [Insert or Merge Entity](/rest/api/storageservices/Insert-Or-Merge-Entity) and [Insert or Replace Entity](/rest/api/storageservices/Insert-Or-Replace-Entity) topics. <br/><br> This setting applies at the row level, not the table level, and neither option deletes rows in the output table that do not exist in the input. |merge (default)<br/>replace |No | -| writeBatchSize |Inserts data into the Azure table when the writeBatchSize or writeBatchTimeout is hit. |Integer (number of rows) |No (default: 10000) | -| writeBatchTimeout |Inserts data into the Azure table when the writeBatchSize or writeBatchTimeout is hit |timespan<br/><br/>Example: "00:20:00" (20 minutes) |No (Default to storage client default timeout value 90 sec) | --### azureTablePartitionKeyName -Map a source column to a destination column using the translator JSON property before you can use the destination column as the azureTablePartitionKeyName. --In the following example, source column DivisionID is mapped to the destination column: DivisionID. --```JSON -"translator": { - "type": "TabularTranslator", - "columnMappings": "DivisionID: DivisionID, FirstName: FirstName, LastName: LastName" -} -``` -The DivisionID is specified as the partition key. --```JSON -"sink": { - "type": "AzureTableSink", - "azureTablePartitionKeyName": "DivisionID", - "writeBatchSize": 100, - "writeBatchTimeout": "01:00:00" -} -``` -## JSON examples -The following examples provide sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). They show how to copy data to and from Azure Table Storage and Azure Blob Database. However, data can be copied **directly** from any of the sources to any of the supported sinks. For more information, see the section "Supported data stores and formats" in [Move data by using Copy Activity](data-factory-data-movement-activities.md). --## Example: Copy data from Azure Table to Azure Blob -The following sample shows: --1. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties) (used for both table & blob). -2. An input [dataset](data-factory-create-datasets.md) of type [AzureTable](#dataset-properties). -3. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -4. The [pipeline](data-factory-create-pipelines.md) with Copy activity that uses AzureTableSource and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data belonging to the default partition in an Azure Table to a blob every hour. The JSON properties used in these samples are described in sections following the samples. --**Azure storage linked service:** --```JSON -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -Azure Data Factory supports two types of Azure Storage linked --**Azure Table input dataset:** --The sample assumes you have created a table "MyTable" in Azure Table. --Setting "external": "true" informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. --```JSON -{ - "name": "AzureTableInput", - "properties": { - "type": "AzureTable", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --**Azure Blob output dataset:** --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```JSON -{ - "name": "AzureBlobOutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ], - "format": { - "type": "TextFormat", - "columnDelimiter": "\t", - "rowDelimiter": "\n" - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Copy activity in a pipeline with AzureTableSource and BlobSink:** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **AzureTableSource** and **sink** type is set to **BlobSink**. The SQL query specified with **AzureTableSourceQuery** property selects the data from the default partition every hour to copy. --```JSON -{ - "name":"SamplePipeline", - "properties":{ - "start":"2014-06-01T18:00:00", - "end":"2014-06-01T19:00:00", - "description":"pipeline for copy activity", - "activities":[ - { - "name": "AzureTabletoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [ - { - "name": "AzureTableInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "typeProperties": { - "source": { - "type": "AzureTableSource", - "AzureTableSourceQuery": "PartitionKey eq 'DefaultPartitionKey'" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` --## Example: Copy data from Azure Blob to Azure Table -The following sample shows: --1. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties) (used for both table & blob) -2. An input [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -3. An output [dataset](data-factory-create-datasets.md) of type [AzureTable](#dataset-properties). -4. The [pipeline](data-factory-create-pipelines.md) with Copy activity that uses [BlobSource](data-factory-azure-blob-connector.md#copy-activity-properties) and [AzureTableSink](#copy-activity-properties). --The sample copies time-series data from an Azure blob to an Azure table hourly. The JSON properties used in these samples are described in sections following the samples. --**Azure storage (for both Azure Table & Blob) linked service:** --```JSON -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --Azure Data Factory supports two types of Azure Storage linked --**Azure Blob input dataset:** --Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. "external": "true" setting informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. --```JSON -{ - "name": "AzureBlobInput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}", - "fileName": "{Hour}.csv", - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ], - "format": { - "type": "TextFormat", - "columnDelimiter": ",", - "rowDelimiter": "\n" - } - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --**Azure Table output dataset:** --The sample copies data to a table named "MyTable" in Azure Table. Create an Azure table with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour. --```JSON -{ - "name": "AzureTableOutput", - "properties": { - "type": "AzureTable", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "tableName": "MyOutputTable" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Copy activity in a pipeline with BlobSource and AzureTableSink:** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **BlobSource** and **sink** type is set to **AzureTableSink**. --```JSON -{ - "name":"SamplePipeline", - "properties":{ - "start":"2014-06-01T18:00:00", - "end":"2014-06-01T19:00:00", - "description":"pipeline with copy activity", - "activities":[ - { - "name": "AzureBlobtoTable", - "description": "Copy Activity", - "type": "Copy", - "inputs": [ - { - "name": "AzureBlobInput" - } - ], - "outputs": [ - { - "name": "AzureTableOutput" - } - ], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "AzureTableSink", - "writeBatchSize": 100, - "writeBatchTimeout": "01:00:00" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` -## Type Mapping for Azure Table -As mentioned in the [data movement activities](data-factory-data-movement-activities.md) article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach. --1. Convert from native source types to .NET type -2. Convert from .NET type to native sink type --When moving data to & from Azure Table, the following [mappings defined by Azure Table service](/rest/api/storageservices/Understanding-the-Table-Service-Data-Model) are used from Azure Table OData types to .NET type and vice versa. --| OData Data Type | .NET Type | Details | -| | | | -| Edm.Binary |byte[] |An array of bytes up to 64 KB. | -| Edm.Boolean |bool |A Boolean value. | -| Edm.DateTime |DateTime |A 64-bit value expressed as Coordinated Universal Time (UTC). The supported DateTime range begins from 12:00 midnight, January 1, 1601 A.D. (C.E.), UTC. The range ends at December 31, 9999. | -| Edm.Double |double |A 64-bit floating point value. | -| Edm.Guid |Guid |A 128-bit globally unique identifier. | -| Edm.Int32 |Int32 |A 32-bit integer. | -| Edm.Int64 |Int64 |A 64-bit integer. | -| Edm.String |String |A UTF-16-encoded value. String values may be up to 64 KB. | --### Type Conversion Sample -The following sample is for copying data from an Azure Blob to Azure Table with type conversions. --Suppose the Blob dataset is in CSV format and contains three columns. One of them is a datetime column with a custom datetime format using abbreviated French names for day of the week. --Define the Blob Source dataset as follows along with type definitions for the columns. --```JSON -{ - "name": " AzureBlobInput", - "properties": - { - "structure": - [ - { "name": "userid", "type": "Int64"}, - { "name": "name", "type": "String"}, - { "name": "lastlogindate", "type": "Datetime", "culture": "fr-fr", "format": "ddd-MM-YYYY"} - ], - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder", - "fileName":"myfile.csv", - "format": - { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "external": true, - "availability": - { - "frequency": "Hour", - "interval": 1, - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` -Given the type mapping from Azure Table OData type to .NET type, you would define the table in Azure Table with the following schema. --**Azure Table schema:** --| Column name | Type | -| | | -| userid |Edm.Int64 | -| name |Edm.String | -| lastlogindate |Edm.DateTime | --Next, define the Azure Table dataset as follows. You do not need to specify "structure" section with the type information since the type information is already specified in the underlying data store. --```JSON -{ - "name": "AzureTableOutput", - "properties": { - "type": "AzureTable", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "tableName": "MyOutputTable" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --In this case, Data Factory automatically does type conversions including the Datetime field with the custom datetime format using the "fr-fr" culture when moving data from Blob to Azure Table. --> [!NOTE] -> To map columns from source dataset to columns from sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Performance and Tuning -To learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it, see [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md). |
data-factory | Data Factory Build Your First Pipeline Using Arm | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-build-your-first-pipeline-using-arm.md | - Title: Build your first data factory (Resource Manager template) -description: In this tutorial, you create a sample Azure Data Factory pipeline using an Azure Resource Manager template. ------- Previously updated : 04/12/2023---# Tutorial: Build your first Azure data factory using Azure Resource Manager template -> [!div class="op_single_selector"] -> * [Overview and prerequisites](data-factory-build-your-first-pipeline.md) -> * [Visual Studio](data-factory-build-your-first-pipeline-using-vs.md) -> * [PowerShell](data-factory-build-your-first-pipeline-using-powershell.md) -> * [Resource Manager Template](data-factory-build-your-first-pipeline-using-arm.md) -> * [REST API](data-factory-build-your-first-pipeline-using-rest-api.md) -> --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Quickstart: Create a data factory using Azure Data Factory](../quickstart-create-data-factory-dot-net.md). --In this article, you use an Azure Resource Manager template to create your first Azure data factory. To do the tutorial using other tools/SDKs, select one of the options from the drop-down list. --The pipeline in this tutorial has one activity: **HDInsight Hive activity**. This activity runs a hive script on an Azure HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a month between the specified start and end times. --> [!NOTE] -> The data pipeline in this tutorial transforms input data to produce output data. For a tutorial on how to copy data using Azure Data Factory, see [Tutorial: Copy data from Blob Storage to SQL Database](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). -> -> The pipeline in this tutorial has only one activity of type: HDInsightHive. A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see [scheduling and execution in Data Factory](data-factory-scheduling-and-execution.md#multiple-activities-in-a-pipeline). --## Prerequisites ---* Read through [Tutorial Overview](data-factory-build-your-first-pipeline.md) article and complete the **prerequisite** steps. -* Follow instructions in [How to install and configure Azure PowerShell](/powershell/azure/) article to install latest version of Azure PowerShell on your computer. -* See [Authoring Azure Resource Manager Templates](../../azure-resource-manager/templates/syntax.md) to learn about Azure Resource Manager templates. --## In this tutorial --| Entity | Description | -| | | -| Azure Storage linked service |Links your Azure Storage account to the data factory. The Azure Storage account holds the input and output data for the pipeline in this sample. | -| HDInsight on-demand linked service |Links an on-demand HDInsight cluster to the data factory. The cluster is automatically created for you to process data and is deleted after the processing is done. | -| Azure Blob input dataset |Refers to the Azure Storage linked service. The linked service refers to an Azure Storage account and the Azure Blob dataset specifies the container, folder, and file name in the storage that holds the input data. | -| Azure Blob output dataset |Refers to the Azure Storage linked service. The linked service refers to an Azure Storage account and the Azure Blob dataset specifies the container, folder, and file name in the storage that holds the output data. | -| Data pipeline |The pipeline has one activity of type HDInsightHive, which consumes the input dataset and produces the output dataset. | --A data factory can have one or more pipelines. A pipeline can have one or more activities in it. There are two types of activities: [data movement activities](data-factory-data-movement-activities.md) and [data transformation activities](data-factory-data-transformation-activities.md). In this tutorial, you create a pipeline with one activity (Hive activity). --The following section provides the complete Resource Manager template for defining Data Factory entities so that you can quickly run through the tutorial and test the template. To understand how each Data Factory entity is defined, see [Data Factory entities in the template](#data-factory-entities-in-the-template) section. To learn about the JSON syntax and properties for Data Factory resources in a template, see [Microsoft.DataFactory resource types](/azure/templates/microsoft.datafactory/allversions). --## Data Factory JSON template -The top-level Resource Manager template for defining a data factory is: --```json -{ - "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", - "contentVersion": "1.0.0.0", - "parameters": { ... - }, - "variables": { ... - }, - "resources": [ - { - "name": "[parameters('dataFactoryName')]", - "apiVersion": "[variables('apiVersion')]", - "type": "Microsoft.DataFactory/factories", - "location": "westus", - "resources": [ - { ... }, - { ... }, - { ... }, - { ... } - ] - } - ] -} -``` -Create a JSON file named **ADFTutorialARM.json** in **C:\ADFGetStarted** folder with the following content: --```json -{ - "contentVersion": "1.0.0.0", - "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", - "parameters": { - "storageAccountName": { "type": "string", "metadata": { "description": "Name of the Azure storage account that contains the input/output data." } }, - "storageAccountKey": { "type": "securestring", "metadata": { "description": "Key for the Azure storage account." } }, - "blobContainer": { "type": "string", "metadata": { "description": "Name of the blob container in the Azure Storage account." } }, - "inputBlobFolder": { "type": "string", "metadata": { "description": "The folder in the blob container that has the input file." } }, - "inputBlobName": { "type": "string", "metadata": { "description": "Name of the input file/blob." } }, - "outputBlobFolder": { "type": "string", "metadata": { "description": "The folder in the blob container that will hold the transformed data." } }, - "hiveScriptFolder": { "type": "string", "metadata": { "description": "The folder in the blob container that contains the Hive query file." } }, - "hiveScriptFile": { "type": "string", "metadata": { "description": "Name of the hive query (HQL) file." } } - }, - "variables": { - "dataFactoryName": "[concat('HiveTransformDF', uniqueString(resourceGroup().id))]", - "azureStorageLinkedServiceName": "AzureStorageLinkedService", - "hdInsightOnDemandLinkedServiceName": "HDInsightOnDemandLinkedService", - "blobInputDatasetName": "AzureBlobInput", - "blobOutputDatasetName": "AzureBlobOutput", - "pipelineName": "HiveTransformPipeline" - }, - "resources": [ - { - "name": "[variables('dataFactoryName')]", - "apiVersion": "2015-10-01", - "type": "Microsoft.DataFactory/factories", - "location": "West US", - "resources": [ - { - "type": "linkedservices", - "name": "[variables('azureStorageLinkedServiceName')]", - "dependsOn": [ - "[variables('dataFactoryName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "AzureStorage", - "description": "Azure Storage linked service", - "typeProperties": { - "connectionString": "[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',parameters('storageAccountKey'))]" - } - } - }, - { - "type": "linkedservices", - "name": "[variables('hdInsightOnDemandLinkedServiceName')]", - "dependsOn": [ - "[variables('dataFactoryName')]", - "[variables('azureStorageLinkedServiceName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "HDInsightOnDemand", - "typeProperties": { - "version": "3.5", - "clusterSize": 1, - "timeToLive": "00:05:00", - "osType": "Linux", - "linkedServiceName": "[variables('azureStorageLinkedServiceName')]" - } - } - }, - { - "type": "datasets", - "name": "[variables('blobInputDatasetName')]", - "dependsOn": [ - "[variables('dataFactoryName')]", - "[variables('azureStorageLinkedServiceName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "[variables('azureStorageLinkedServiceName')]", - "typeProperties": { - "fileName": "[parameters('inputBlobName')]", - "folderPath": "[concat(parameters('blobContainer'), '/', parameters('inputBlobFolder'))]", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Month", - "interval": 1 - }, - "external": true - } - }, - { - "type": "datasets", - "name": "[variables('blobOutputDatasetName')]", - "dependsOn": [ - "[variables('dataFactoryName')]", - "[variables('azureStorageLinkedServiceName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "[variables('azureStorageLinkedServiceName')]", - "typeProperties": { - "folderPath": "[concat(parameters('blobContainer'), '/', parameters('outputBlobFolder'))]", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Month", - "interval": 1 - } - } - }, - { - "type": "datapipelines", - "name": "[variables('pipelineName')]", - "dependsOn": [ - "[variables('dataFactoryName')]", - "[variables('azureStorageLinkedServiceName')]", - "[variables('hdInsightOnDemandLinkedServiceName')]", - "[variables('blobInputDatasetName')]", - "[variables('blobOutputDatasetName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "description": "Pipeline that transforms data using Hive script.", - "activities": [ - { - "type": "HDInsightHive", - "typeProperties": { - "scriptPath": "[concat(parameters('blobContainer'), '/', parameters('hiveScriptFolder'), '/', parameters('hiveScriptFile'))]", - "scriptLinkedService": "[variables('azureStorageLinkedServiceName')]", - "defines": { - "inputtable": "[concat('wasb://', parameters('blobContainer'), '@', parameters('storageAccountName'), '.blob.core.windows.net/', parameters('inputBlobFolder'))]", - "partitionedtable": "[concat('wasb://', parameters('blobContainer'), '@', parameters('storageAccountName'), '.blob.core.windows.net/', parameters('outputBlobFolder'))]" - } - }, - "inputs": [ - { - "name": "[variables('blobInputDatasetName')]" - } - ], - "outputs": [ - { - "name": "[variables('blobOutputDatasetName')]" - } - ], - "policy": { - "concurrency": 1, - "retry": 3 - }, - "scheduler": { - "frequency": "Month", - "interval": 1 - }, - "name": "RunSampleHiveActivity", - "linkedServiceName": "[variables('hdInsightOnDemandLinkedServiceName')]" - } - ], - "start": "2017-07-01T00:00:00Z", - "end": "2017-07-02T00:00:00Z", - "isPaused": false - } - } - ] - } - ] -} -``` --> [!NOTE] -> You can find another example of Resource Manager template for creating an Azure data factory on [Tutorial: Create a pipeline with Copy Activity using an Azure Resource Manager template](data-factory-copy-activity-tutorial-using-azure-resource-manager-template.md). -> -> --## Parameters JSON -Create a JSON file named **ADFTutorialARM-Parameters.json** that contains parameters for the Azure Resource Manager template. --> [!IMPORTANT] -> Specify the name and key of your Azure Storage account for the **storageAccountName** and **storageAccountKey** parameters in this parameter file. -> -> --```json -{ - "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#", - "contentVersion": "1.0.0.0", - "parameters": { - "storageAccountName": { - "value": "<Name of your Azure Storage account>" - }, - "storageAccountKey": { - "value": "<Key of your Azure Storage account>" - }, - "blobContainer": { - "value": "adfgetstarted" - }, - "inputBlobFolder": { - "value": "inputdata" - }, - "inputBlobName": { - "value": "input.log" - }, - "outputBlobFolder": { - "value": "partitioneddata" - }, - "hiveScriptFolder": { - "value": "script" - }, - "hiveScriptFile": { - "value": "partitionweblogs.hql" - } - } -} -``` --> [!IMPORTANT] -> You may have separate parameter JSON files for development, testing, and production environments that you can use with the same Data Factory JSON template. By using a Power Shell script, you can automate deploying Data Factory entities in these environments. --## Create data factory --1. Start **Azure PowerShell** and run the following command: -- * Run the following command and enter the user name and password that you use to sign in to the Azure portal. -- ```powershell - Connect-AzAccount - ``` -- * Run the following command to view all the subscriptions for this account. -- ```powershell - Get-AzSubscription - ``` -- * Run the following command to select the subscription that you want to work with. This subscription should be the same as the one you used in the Azure portal. -- ```powershell - Get-AzSubscription -SubscriptionName <SUBSCRIPTION NAME> | Set-AzContext - ``` --2. Run the following command to deploy Data Factory entities using the Resource Manager template you created in Step 1. -- ```powershell - New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -TemplateFile C:\ADFGetStarted\ADFTutorialARM.json -TemplateParameterFile C:\ADFGetStarted\ADFTutorialARM-Parameters.json - ``` --## Monitor pipeline --1. After logging in to the [Azure portal](https://portal.azure.com/), Click **Browse** and select **Data factories**. - :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-arm/BrowseDataFactories.png" alt-text="Browse->Data factories"::: -2. In the **Data Factories** blade, click the data factory (**TutorialFactoryARM**) you created. -3. In the **Data Factory** blade for your data factory, click **Diagram**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-arm/DiagramTile.png" alt-text="Diagram Tile"::: -4. In the **Diagram View**, you see an overview of the pipelines, and datasets used in this tutorial. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-arm/DiagramView.png" alt-text="Diagram View"::: -5. In the Diagram View, double-click the dataset **AzureBlobOutput**. You see that the slice that is currently being processed. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-arm/AzureBlobOutput.png" alt-text="Screenshot that shows the AzureBlobOutput dataset."::: -6. When processing is done, you see the slice in **Ready** state. Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the pipeline to take **approximately 30 minutes** to process the slice. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-arm/SliceReady.png" alt-text="Dataset"::: -7. When the slice is in **Ready** state, check the **partitioneddata** folder in the **adfgetstarted** container in your blob storage for the output data. --See [Monitor datasets and pipeline](data-factory-monitor-manage-pipelines.md) for instructions on how to use the Azure portal blades to monitor the pipeline and datasets you have created in this tutorial. --You can also use Monitor and Manage App to monitor your data pipelines. See [Monitor and manage Azure Data Factory pipelines using Monitoring App](data-factory-monitor-manage-app.md) for details about using the application. --> [!IMPORTANT] -> The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container. -> -> --## Data Factory entities in the template -### Define data factory -You define a data factory in the Resource Manager template as shown in the following sample: --```json -"resources": [ - { - "name": "[variables('dataFactoryName')]", - "apiVersion": "2015-10-01", - "type": "Microsoft.DataFactory/factories", - "location": "West US" - } -] -``` --The dataFactoryName is defined as: --```json -"dataFactoryName": "[concat('HiveTransformDF', uniqueString(resourceGroup().id))]", -``` -It is a unique string based on the resource group ID. --### Defining Data Factory entities -The following Data Factory entities are defined in the JSON template: --* [Azure Storage linked service](#azure-storage-linked-service) -* [HDInsight on-demand linked service](#hdinsight-on-demand-linked-service) -* [Azure blob input dataset](#azure-blob-input-dataset) -* [Azure blob output dataset](#azure-blob-output-dataset) -* [Data pipeline with a copy activity](#data-pipeline) --#### Azure Storage linked service -You specify the name and key of your Azure storage account in this section. See [Azure Storage linked service](data-factory-azure-blob-connector.md#azure-storage-linked-service) for details about JSON properties used to define an Azure Storage linked service. --```json -{ - "type": "linkedservices", - "name": "[variables('azureStorageLinkedServiceName')]", - "dependsOn": [ - "[variables('dataFactoryName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "AzureStorage", - "description": "Azure Storage linked service", - "typeProperties": { - "connectionString": "[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',parameters('storageAccountKey'))]" - } - } -} -``` -The **connectionString** uses the storageAccountName and storageAccountKey parameters. The values for these parameters passed by using a configuration file. The definition also uses variables: azureStorageLinkedService and dataFactoryName defined in the template. --#### HDInsight on-demand linked service -See [Compute linked services](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service) article for details about JSON properties used to define an HDInsight on-demand linked service. --```json -{ - "type": "linkedservices", - "name": "[variables('hdInsightOnDemandLinkedServiceName')]", - "dependsOn": [ - "[variables('dataFactoryName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "HDInsightOnDemand", - "typeProperties": { - "version": "3.5", - "clusterSize": 1, - "timeToLive": "00:05:00", - "osType": "Linux", - "linkedServiceName": "[variables('azureStorageLinkedServiceName')]" - } - } -} -``` --Note the following points: --* The Data Factory creates a **Linux-based** HDInsight cluster for you with the above JSON. See [On-demand HDInsight Linked Service](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service) for details. -* You could use **your own HDInsight cluster** instead of using an on-demand HDInsight cluster. See [HDInsight Linked Service](data-factory-compute-linked-services.md#azure-hdinsight-linked-service) for details. -* The HDInsight cluster creates a **default container** in the blob storage you specified in the JSON (**linkedServiceName**). HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice needs to be processed unless there is an existing live cluster (**timeToLive**) and is deleted when the processing is done. -- As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern: "adf**yourdatafactoryname**-**linkedservicename**-datetimestamp". Use tools such as [Microsoft Azure Storage Explorer](https://storageexplorer.com/) to delete containers in your Azure blob storage. --See [On-demand HDInsight Linked Service](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service) for details. --#### Azure blob input dataset -You specify the names of blob container, folder, and file that contains the input data. See [Azure Blob dataset properties](data-factory-azure-blob-connector.md#dataset-properties) for details about JSON properties used to define an Azure Blob dataset. --```json -{ - "type": "datasets", - "name": "[variables('blobInputDatasetName')]", - "dependsOn": [ - "[variables('dataFactoryName')]", - "[variables('azureStorageLinkedServiceName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "[variables('azureStorageLinkedServiceName')]", - "typeProperties": { - "fileName": "[parameters('inputBlobName')]", - "folderPath": "[concat(parameters('blobContainer'), '/', parameters('inputBlobFolder'))]", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Month", - "interval": 1 - }, - "external": true - } -} -``` -This definition uses the following parameters defined in parameter template: blobContainer, inputBlobFolder, and inputBlobName. --#### Azure Blob output dataset -You specify the names of blob container and folder that holds the output data. See [Azure Blob dataset properties](data-factory-azure-blob-connector.md#dataset-properties) for details about JSON properties used to define an Azure Blob dataset. --```json -{ - "type": "datasets", - "name": "[variables('blobOutputDatasetName')]", - "dependsOn": [ - "[variables('dataFactoryName')]", - "[variables('azureStorageLinkedServiceName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "[variables('azureStorageLinkedServiceName')]", - "typeProperties": { - "folderPath": "[concat(parameters('blobContainer'), '/', parameters('outputBlobFolder'))]", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Month", - "interval": 1 - } - } -} -``` --This definition uses the following parameters defined in the parameter template: blobContainer and outputBlobFolder. --#### Data pipeline -You define a pipeline that transform data by running Hive script on an on-demand Azure HDInsight cluster. See [Pipeline JSON](data-factory-create-pipelines.md#pipeline-json) for descriptions of JSON elements used to define a pipeline in this example. --```json -{ - "type": "datapipelines", - "name": "[variables('pipelineName')]", - "dependsOn": [ - "[variables('dataFactoryName')]", - "[variables('azureStorageLinkedServiceName')]", - "[variables('hdInsightOnDemandLinkedServiceName')]", - "[variables('blobInputDatasetName')]", - "[variables('blobOutputDatasetName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "description": "Pipeline that transforms data using Hive script.", - "activities": [ - { - "type": "HDInsightHive", - "typeProperties": { - "scriptPath": "[concat(parameters('blobContainer'), '/', parameters('hiveScriptFolder'), '/', parameters('hiveScriptFile'))]", - "scriptLinkedService": "[variables('azureStorageLinkedServiceName')]", - "defines": { - "inputtable": "[concat('wasb://', parameters('blobContainer'), '@', parameters('storageAccountName'), '.blob.core.windows.net/', parameters('inputBlobFolder'))]", - "partitionedtable": "[concat('wasb://', parameters('blobContainer'), '@', parameters('storageAccountName'), '.blob.core.windows.net/', parameters('outputBlobFolder'))]" - } - }, - "inputs": [ - { - "name": "[variables('blobInputDatasetName')]" - } - ], - "outputs": [ - { - "name": "[variables('blobOutputDatasetName')]" - } - ], - "policy": { - "concurrency": 1, - "retry": 3 - }, - "scheduler": { - "frequency": "Month", - "interval": 1 - }, - "name": "RunSampleHiveActivity", - "linkedServiceName": "[variables('hdInsightOnDemandLinkedServiceName')]" - } - ], - "start": "2017-07-01T00:00:00Z", - "end": "2017-07-02T00:00:00Z", - "isPaused": false - } -} -``` --## Reuse the template -In the tutorial, you created a template for defining Data Factory entities and a template for passing values for parameters. To use the same template to deploy Data Factory entities to different environments, you create a parameter file for each environment and use it when deploying to that environment. --Example: --```powershell -New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -TemplateFile ADFTutorialARM.json -TemplateParameterFile ADFTutorialARM-Parameters-Dev.json --New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -TemplateFile ADFTutorialARM.json -TemplateParameterFile ADFTutorialARM-Parameters-Test.json --New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -TemplateFile ADFTutorialARM.json -TemplateParameterFile ADFTutorialARM-Parameters-Production.json -``` --Notice that the first command uses parameter file for the development environment, second one for the test environment, and the third one for the production environment. --You can also reuse the template to perform repeated tasks. For example, you need to create many data factories with one or more pipelines that implement the same logic but each data factory uses different Azure storage and Azure SQL Database accounts. In this scenario, you use the same template in the same environment (dev, test, or production) with different parameter files to create data factories. --## Resource Manager template for creating a gateway -Here is a sample Resource Manager template for creating a logical gateway in the back. Install a gateway on your on-premises computer or Azure IaaS VM and register the gateway with Data Factory service using a key. See [Move data between on-premises and cloud](data-factory-move-data-between-onprem-and-cloud.md) for details. --```json -{ - "contentVersion": "1.0.0.0", - "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", - "parameters": { - }, - "variables": { - "dataFactoryName": "GatewayUsingArmDF", - "apiVersion": "2015-10-01", - "singleQuote": "'" - }, - "resources": [ - { - "name": "[variables('dataFactoryName')]", - "apiVersion": "[variables('apiVersion')]", - "type": "Microsoft.DataFactory/factories", - "location": "eastus", - "resources": [ - { - "dependsOn": [ "[concat('Microsoft.DataFactory/factories/', variables('dataFactoryName'))]" ], - "type": "gateways", - "apiVersion": "[variables('apiVersion')]", - "name": "GatewayUsingARM", - "properties": { - "description": "my gateway" - } - } - ] - } - ] -} -``` -This template creates a data factory named GatewayUsingArmDF with a gateway named: GatewayUsingARM. --## See Also --| Topic | Description | -|: |: | -| [Pipelines](data-factory-create-pipelines.md) |This article helps you understand pipelines and activities in Azure Data Factory and how to use them to construct end-to-end data-driven workflows for your scenario or business. | -| [Datasets](data-factory-create-datasets.md) |This article helps you understand datasets in Azure Data Factory. | -| [Scheduling and execution](data-factory-scheduling-and-execution.md) |This article explains the scheduling and execution aspects of Azure Data Factory application model. | -| [Monitor and manage pipelines using Monitoring App](data-factory-monitor-manage-app.md) |This article describes how to monitor, manage, and debug pipelines using the Monitoring & Management App. | |
data-factory | Data Factory Build Your First Pipeline Using Editor | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-build-your-first-pipeline-using-editor.md | - Title: Build your first data factory (Azure portal) -description: In this tutorial, you create a sample Azure Data Factory pipeline by using the Data Factory Editor in the Azure portal. ------ Previously updated : 04/12/2023---# Tutorial: Build your first data factory by using the Azure portal -> [!div class="op_single_selector"] -> * [Overview and prerequisites](data-factory-build-your-first-pipeline.md) -> * [Visual Studio](data-factory-build-your-first-pipeline-using-vs.md) -> * [PowerShell](data-factory-build-your-first-pipeline-using-powershell.md) -> * [Azure Resource Manager template](data-factory-build-your-first-pipeline-using-arm.md) -> * [REST API](data-factory-build-your-first-pipeline-using-rest-api.md) ---> [!NOTE] -> This article applies to version 1 of Azure Data Factory, which is generally available. If you use the current version of the Data Factory service, see [Quickstart: Create a data factory by using Data Factory](../quickstart-create-data-factory-dot-net.md). --> [!WARNING] -> The JSON editor in Azure Portal for authoring & deploying ADF v1 pipelines will be turned OFF on 31st July 2019. After 31st July 2019, you can continue to use [ADF v1 PowerShell cmdlets](/powershell/module/az.datafactory/), [ADF v1 .NET SDK](/dotnet/api/microsoft.azure.management.datafactories.models), [ADF v1 REST APIs](/rest/api/datafactory/) to author & deploy your ADF v1 pipelines. --In this article, you learn how to use the [Azure portal](https://portal.azure.com/) to create your first data factory. To do the tutorial by using other tools/SDKs, select one of the options from the drop-down list. --The pipeline in this tutorial has one activity: an Azure HDInsight Hive activity. This activity runs a Hive script on an HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a month between the specified start and end times. --> [!NOTE] -> The data pipeline in this tutorial transforms input data to produce output data. For a tutorial on how to copy data by using Data Factory, see [Tutorial: Copy data from Azure Blob storage to Azure SQL Database](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). -> -> A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see [Scheduling and execution in Data Factory](data-factory-scheduling-and-execution.md#multiple-activities-in-a-pipeline). --## Prerequisites -Read [Tutorial overview](data-factory-build-your-first-pipeline.md), and follow the steps in the "Prerequisites" section. --This article doesn't provide a conceptual overview of the Data Factory service. For more information about the service, read [Introduction to Azure Data Factory](data-factory-introduction.md). --## Create a data factory -A data factory can have one or more pipelines. A pipeline can have one or more activities in it. An example is a Copy activity that copies data from a source to a destination data store. Another example is an HDInsight Hive activity that runs a Hive script to transform input data to produce output data. --To create a data factory, follow these steps: --1. Sign in to the [Azure portal](https://portal.azure.com/). --1. Select **New** > **Data + Analytics** > **Data Factory**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/create-blade.png" alt-text="Create blade"::: --1. On the **New data factory** blade, under **Name**, enter **GetStartedDF**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/new-data-factory-blade.png" alt-text="New data factory blade"::: -- > [!IMPORTANT] - > The name of the data factory must be globally unique. If you receive the error "Data factory name GetStartedDF is not available," change the name of the data factory. For example, use yournameGetStartedDF, and create the data factory again. For more information on naming rules, see [Data Factory: Naming rules](data-factory-naming-rules.md). - > - > The name of the data factory might be registered as a DNS name in the future, and it might become publicly visible. - > - > -1. Under **Subscription**, select the Azure subscription where you want the data factory to be created. --1. Select an existing resource group, or create a resource group. For the tutorial, create a resource group named **ADFGetStartedRG**. --1. Under **Location**, select a location for the data factory. Only regions supported by the Data Factory service are shown in the drop-down list. --1. Select the **Pin to dashboard** check box. --1. Select **Create**. -- > [!IMPORTANT] - > To create Data Factory instances, you must be a member of the [Data Factory contributor](../../role-based-access-control/built-in-roles.md#data-factory-contributor) role at the subscription/resource group level. - > - > -1. On the dashboard, you see the following tile with the status **Deploying Data Factory**: -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/creating-data-factory-image.png" alt-text="Deploying Data Factory status"::: --1. After the data factory is created, you see the **Data factory** page, which shows you the contents of the data factory. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/data-factory-blade.png" alt-text="Data factory blade"::: --Before you create a pipeline in the data factory, you need to create a few data factory entities first. You first create linked services to link data stores/computes to your data store. Then you define input and output datasets to represent input/output data in linked data stores. Finally, you create the pipeline with an activity that uses these datasets. --## Create linked services -In this step, you link your Azure Storage account and an on-demand HDInsight cluster to your data factory. The storage account holds the input and output data for the pipeline in this sample. The HDInsight linked service is used to run a Hive script specified in the activity of the pipeline in this sample. Identify what [data store](data-factory-data-movement-activities.md)/[compute services](data-factory-compute-linked-services.md) are used in your scenario. Then link those services to the data factory by creating linked services. --### Create a Storage linked service -In this step, you link your storage account to your data factory. In this tutorial, you use the same storage account to store input/output data and the HQL script file. --1. On the **Data factory** blade for **GetStartedDF**, select **Author and deploy**. You see the Data Factory Editor. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/data-factory-author-deploy.png" alt-text="Author and deploy tile"::: --1. Select **New data store**, and choose **Azure Storage**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/new-data-store-azure-storage-menu.png" alt-text="New data store blade"::: --1. You see the JSON script for creating a Storage linked service in the editor. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/azure-storage-linked-service.png" alt-text="Storage linked service"::: --1. Replace **account name** with the name of your storage account. Replace **account key** with the access key of the storage account. To learn how to get your storage access key, see [Manage storage account access keys](../../storage/common/storage-account-keys-manage.md). --1. Select **Deploy** on the command bar to deploy the linked service. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/deploy-button.png" alt-text="Deploy button"::: -- After the linked service is deployed successfully, the Draft-1 window disappears. You see **AzureStorageLinkedService** in the tree view on the left. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/StorageLinkedServiceInTree.png" alt-text="AzureStorageLinkedService"::: --### Create an HDInsight linked service -In this step, you link an on-demand HDInsight cluster to your data factory. The HDInsight cluster is automatically created at runtime. After it's done processing and idle for the specified amount of time, it's deleted. --1. In the Data Factory Editor, select **More** > **New compute** > **On-demand HDInsight cluster**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/new-compute-menu.png" alt-text="New compute"::: --1. Copy and paste the following snippet to the Draft-1 window. The JSON snippet describes the properties that are used to create the HDInsight cluster on demand. -- ```JSON - { - "name": "HDInsightOnDemandLinkedService", - "properties": { - "type": "HDInsightOnDemand", - "typeProperties": { - "version": "3.5", - "clusterSize": 1, - "timeToLive": "00:05:00", - "osType": "Linux", - "linkedServiceName": "AzureStorageLinkedService" - } - } - } - ``` -- The following table provides descriptions for the JSON properties used in the snippet. -- | Property | Description | - |: |: | - | clusterSize |Specifies the size of the HDInsight cluster. | - | timeToLive | Specifies the idle time for the HDInsight cluster before it's deleted. | - | linkedServiceName | Specifies the storage account that is used to store the logs that are generated by HDInsight. | -- Note the following points: -- a. The data factory creates a Linux-based HDInsight cluster for you with the JSON properties. For more information, see [On-demand HDInsight linked service](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service). -- b. You can use your own HDInsight cluster instead of using an on-demand HDInsight cluster. For more information, see [HDInsight linked service](data-factory-compute-linked-services.md#azure-hdinsight-linked-service). -- c. The HDInsight cluster creates a default container in the blob storage you specified in the JSON property (**linkedServiceName**). HDInsight doesn't delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, an HDInsight cluster is created every time a slice is processed unless there is an existing live cluster (**timeToLive**). The cluster is automatically deleted when the processing is finished. -- As more slices are processed, you see many containers in your blob storage. If you don't need them for troubleshooting of the jobs, you might want to delete them to reduce the storage cost. The names of these containers follow a pattern: "adf**yourdatafactoryname**-**linkedservicename**-datetimestamp." Use tools such as [Azure Storage Explorer](https://storageexplorer.com/) to delete containers in your blob storage. -- For more information, see [On-demand HDInsight linked service](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service). --1. Select **Deploy** on the command bar to deploy the linked service. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/ondemand-hdinsight-deploy.png" alt-text="Deploy option"::: --1. Confirm that you see both **AzureStorageLinkedService** and **HDInsightOnDemandLinkedService** in the tree view on the left. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/tree-view-linked-services.png" alt-text="Screenshot that shows that AzureStorageLinkedService and HDInsightOnDemandLinkedService are linked together."::: --## Create datasets -In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to AzureStorageLinkedService that you created previously in this tutorial. The linked service points to a storage account. Datasets specify the container, folder, and file name in the storage that holds input and output data. --### Create the input dataset -1. In the Data Factory Editor, select **More** > **New dataset** > **Azure Blob storage**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/new-data-set.png" alt-text="New dataset"::: --1. Copy and paste the following snippet to the Draft-1 window. In the JSON snippet, you create a dataset called **AzureBlobInput** that represents input data for an activity in the pipeline. In addition, you specify that the input data is in the blob container called **adfgetstarted** and the folder called **inputdata**. -- ```JSON - { - "name": "AzureBlobInput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "fileName": "input.log", - "folderPath": "adfgetstarted/inputdata", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Month", - "interval": 1 - }, - "external": true, - "policy": {} - } - } - ``` - The following table provides descriptions for the JSON properties used in the snippet. -- | Property | Nested under | Description | - |: |: |: | - | type | properties |The type property is set to **AzureBlob** because data resides in blob storage. | - | linkedServiceName | format |Refers to AzureStorageLinkedService that you created previously. | - | folderPath | typeProperties | Specifies the blob container and the folder that contains input blobs. | - | fileName | typeProperties |This property is optional. If you omit this property, all the files from folderPath are picked. In this tutorial, only the input.log file is processed. | - | type | format |The log files are in text format, so use **TextFormat**. | - | columnDelimiter | format |Columns in the log files are delimited by the comma character (`,`). | - | frequency/interval | availability |Frequency is set to **Month** and the interval is **1**, which means that the input slices are available monthly. | - | external | properties | This property is set to **true** if the input data isn't generated by this pipeline. In this tutorial, the input.log file isn't generated by this pipeline, so the property is set to **true**. | -- For more information about these JSON properties, see [Azure Blob connector](data-factory-azure-blob-connector.md#dataset-properties). --1. Select **Deploy** on the command bar to deploy the newly created dataset. You see the dataset in the tree view on the left. --### Create the output dataset -Now, you create the output dataset to represent the output data stored in the blob storage. --1. In the Data Factory Editor, select **More** > **New dataset** > **Azure Blob storage**. --1. Copy and paste the following snippet to the Draft-1 window. In the JSON snippet, you create a dataset called **AzureBlobOutput** to specify the structure of the data that is produced by the Hive script. You also specify that the results are stored in the blob container called **adfgetstarted** and the folder called **partitioneddata**. The **availability** section specifies that the output dataset is produced monthly. -- ```JSON - { - "name": "AzureBlobOutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "adfgetstarted/partitioneddata", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Month", - "interval": 1 - } - } - } - ``` - For descriptions of these properties, see the section "Create the input dataset." You do not set the external property on an output dataset because the dataset is produced by the Data Factory service. --1. Select **Deploy** on the command bar to deploy the newly created dataset. --1. Verify that the dataset is created successfully. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/tree-view-data-set.png" alt-text="Tree view with linked services"::: --## Create a pipeline -In this step, you create your first pipeline with an HDInsight Hive activity. The input slice is available monthly (frequency is Month, interval is 1). The output slice is produced monthly. The scheduler property for the activity is also set to monthly. The settings for the output dataset and the activity scheduler must match. Currently, the output dataset is what drives the schedule, so you must create an output dataset even if the activity doesn't produce any output. If the activity doesn't take any input, you can skip creating the input dataset. The properties used in the following JSON snippet are explained at the end of this section. --1. In the Data Factory Editor, select **More** > **New pipeline**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/new-pipeline-button.png" alt-text="New pipeline option"::: --1. Copy and paste the following snippet to the Draft-1 window. -- > [!IMPORTANT] - > Replace **storageaccountname** with the name of your storage account in the JSON snippet. - > - > -- ```JSON - { - "name": "MyFirstPipeline", - "properties": { - "description": "My first Azure Data Factory pipeline", - "activities": [ - { - "type": "HDInsightHive", - "typeProperties": { - "scriptPath": "adfgetstarted/script/partitionweblogs.hql", - "scriptLinkedService": "AzureStorageLinkedService", - "defines": { - "inputtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata", - "partitionedtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata" - } - }, - "inputs": [ - { - "name": "AzureBlobInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "policy": { - "concurrency": 1, - "retry": 3 - }, - "scheduler": { - "frequency": "Month", - "interval": 1 - }, - "name": "RunSampleHiveActivity", - "linkedServiceName": "HDInsightOnDemandLinkedService" - } - ], - "start": "2017-07-01T00:00:00Z", - "end": "2017-07-02T00:00:00Z", - "isPaused": false - } - } - ``` -- In the JSON snippet, you create a pipeline that consists of a single activity that uses Hive to process data on an HDInsight cluster. -- The Hive script file, **partitionweblogs.hql**, is stored in the storage account, which is specified by scriptLinkedService that is called **AzureStorageLinkedService**. You can find it in the **script** folder in the container **adfgetstarted**. -- The **defines** section is used to specify the runtime settings that are passed to the Hive script as Hive configuration values. Examples are ${hiveconf:inputtable} and ${hiveconf:partitionedtable}. -- The **start** and **end** properties of the pipeline specify the active period of the pipeline. -- In the activity JSON, you specify that the Hive script runs on the compute specified by **linkedServiceName**: **HDInsightOnDemandLinkedService**. -- > [!NOTE] - > For more information about the JSON properties used in the example, see the "Pipeline JSON" section in [Pipelines and activities in Data Factory](data-factory-create-pipelines.md). - > - > -1. Confirm the following: -- a. The **input.log** file exists in the **inputdata** folder of the **adfgetstarted** container in the blob storage. -- b. The **partitionweblogs.hql** file exists in the **script** folder of the **adfgetstarted** container in the blob storage. If you don't see these files, follow the steps in the "Prerequisites" section in [Tutorial overview](data-factory-build-your-first-pipeline.md). -- c. You replaced **storageaccountname** with the name of your storage account in the pipeline JSON. --1. Select **Deploy** on the command bar to deploy the pipeline. Because the **start** and **end** times are set in the past and **isPaused** is set to **false**, the pipeline (activity in the pipeline) runs immediately after you deploy it. --1. Confirm that you see the pipeline in the tree view. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/tree-view-pipeline.png" alt-text="Tree view with pipeline"::: ----## Monitor a pipeline -### Monitor a pipeline by using the Diagram view -1. On the **Data factory** blade, select **Diagram**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/diagram-tile.png" alt-text="Diagram tile"::: --1. In the **Diagram** view, you see an overview of the pipelines and datasets used in this tutorial. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/diagram-view-2.png" alt-text="Diagram view"::: --1. To view all activities in the pipeline, right-click the pipeline in the diagram, and select **Open pipeline**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/open-pipeline-menu.png" alt-text="Open pipeline menu"::: --1. Confirm that you see **Hive Activity** in the pipeline. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/open-pipeline-view.png" alt-text="Open pipeline view"::: -- To go back to the previous view, select **Data factory** in the menu at the top. --1. In the **Diagram** view, double-click the dataset **AzureBlobInput**. Confirm that the slice is in the **Ready** state. It might take a couple of minutes for the slice to show up as **Ready**. If it doesn't appear after you wait for some time, see if you have the input file (**input.log**) placed in the right container (**adfgetstarted**) and folder (**inputdata**). -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/input-slice-ready.png" alt-text="Input slice in Ready state"::: --1. Close the **AzureBlobInput** blade. --1. In the **Diagram** view, double-click the dataset **AzureBlobOutput**. You see the slice that is currently being processed. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/dataset-blade.png" alt-text="Dataset processing in progress"::: --1. After the processing is finished, you see the slice in the **Ready** state. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/dataset-slice-ready.png" alt-text="Dataset in Ready state"::: -- > [!IMPORTANT] - > Creation of an on-demand HDInsight cluster usually takes approximately 20 minutes. Expect the pipeline to take approximately 30 minutes to process the slice. - > - > --1. When the slice is in the **Ready** state, check the **partitioneddata** folder in the **adfgetstarted** container in your blob storage for the output data. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/three-ouptut-files.png" alt-text="Output data"::: --1. Select the slice to see more information about it in a **Data slice** blade. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/data-slice-details.png" alt-text="Data slice information"::: --1. In the **Activity runs** list, select an activity run to see more information about it. (In this scenario, it's a Hive activity.) The information appears in an **Activity run details** blade. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/activity-window-blade.png" alt-text="Activity run details window"::: -- From the log files, you can see the Hive query that was executed and the status information. These logs are useful for troubleshooting any issues. - For more information, see [Monitor and manage pipelines by using Azure portal blades](data-factory-monitor-manage-pipelines.md). --> [!IMPORTANT] -> The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial again, upload the input file (**input.log**) to the **inputdata** folder of the **adfgetstarted** container. -> -> --### Monitor a pipeline by using the Monitor & Manage app -You also can use the Monitor & Manage application to monitor your pipelines. For more information about how to use this application, see [Monitor and manage Data Factory pipelines by using the Monitor & Manage app](data-factory-monitor-manage-app.md). --1. Select the **Monitor & Manage** tile on the home page for your data factory. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/monitor-and-manage-tile.png" alt-text="Monitor & Manage tile"::: --1. In the Monitor & Manage application, change the **Start time** and **End time** to match the start and end times of your pipeline. Select **Apply**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/monitor-and-manage-app.png" alt-text="Monitor & Manage app"::: --1. Select an activity window in the **Activity Windows** list to see information about it. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/activity-window-details.png" alt-text="Activity Windows list"::: --## Summary -In this tutorial, you created a data factory to process data by running a Hive script on an HDInsight Hadoop cluster. You used the Data Factory Editor in the Azure portal to do the following: --* Create a data factory. -* Create two linked - * A Storage linked service to link your blob storage that holds input/output files to the data factory. - * An HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the data factory. Data Factory creates an HDInsight Hadoop cluster just in time to process input data and produce output data. -* Create two datasets, which describe input and output data for an HDInsight Hive activity in the pipeline. -* Create a pipeline with an HDInsight Hive activity. --## Next steps -In this article, you created a pipeline with a transformation activity (HDInsight activity) that runs a Hive script on an on-demand HDInsight cluster. To see how to use a Copy activity to copy data from blob storage to Azure SQL Database, see [Tutorial: Copy data from Blob storage to SQL Database](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --## See also -| Topic | Description | -|: |: | -| [Pipelines](data-factory-create-pipelines.md) |This article helps you understand pipelines and activities in Data Factory and how to use them to construct end-to-end data-driven workflows for your scenario or business. | -| [Datasets](data-factory-create-datasets.md) |This article helps you understand datasets in Data Factory. | -| [Scheduling and execution](data-factory-scheduling-and-execution.md) |This article explains the scheduling and execution aspects of the Data Factory application model. | -| [Monitor and manage pipelines by using the Monitor & Manage app](data-factory-monitor-manage-app.md) |This article describes how to monitor, manage, and debug pipelines by using the Monitor & Manage app. | |
data-factory | Data Factory Build Your First Pipeline Using Powershell | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-build-your-first-pipeline-using-powershell.md | - Title: Build your first data factory (PowerShell) -description: In this tutorial, you create a sample Azure Data Factory pipeline using Azure PowerShell. ------- Previously updated : 04/12/2023---# Tutorial: Build your first Azure data factory using Azure PowerShell -> [!div class="op_single_selector"] -> * [Overview and prerequisites](data-factory-build-your-first-pipeline.md) -> * [Visual Studio](data-factory-build-your-first-pipeline-using-vs.md) -> * [PowerShell](data-factory-build-your-first-pipeline-using-powershell.md) -> * [Resource Manager Template](data-factory-build-your-first-pipeline-using-arm.md) -> * [REST API](data-factory-build-your-first-pipeline-using-rest-api.md) ---> [!NOTE] -> This article applies to version 1 of Data Factory. Version 1 is in maintenance mode. The document exists for legacy users. If you are using the current version of the Data Factory service, see [Quickstart: Create a data factory using Azure Data Factory](../quickstart-create-data-factory-powershell.md). --In this article, you use Azure PowerShell to create your first Azure data factory. To do the tutorial using other tools/SDKs, select one of the options from the drop-down list. --The pipeline in this tutorial has one activity: **HDInsight Hive activity**. This activity runs a hive script on an Azure HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a month between the specified start and end times. --> [!NOTE] -> The data pipeline in this tutorial transforms input data to produce output data. It does not copy data from a source data store to a destination data store. For a tutorial on how to copy data using Azure Data Factory, see [Tutorial: Copy data from Blob Storage to SQL Database](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). -> -> A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see [scheduling and execution in Data Factory](data-factory-scheduling-and-execution.md#multiple-activities-in-a-pipeline). --## Prerequisites ---* Read through [Tutorial Overview](data-factory-build-your-first-pipeline.md) article and complete the **prerequisite** steps. -* Follow instructions in [How to install and configure Azure PowerShell](/powershell/azure/) article to install latest version of Azure PowerShell on your computer. -* (optional) This article does not cover all the Data Factory cmdlets. See [Data Factory Cmdlet Reference](/powershell/module/az.datafactory) for comprehensive documentation on Data Factory cmdlets. --## Create data factory -In this step, you use Azure PowerShell to create an Azure Data Factory named **FirstDataFactoryPSH**. A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive script to transform input data. Let's start with creating the data factory in this step. --1. Start Azure PowerShell and run the following command. Keep Azure PowerShell open until the end of this tutorial. If you close and reopen, you need to run these commands again. -- * Run the following command and enter the user name and password that you use to sign in to the Azure portal. -- ```powershell - Connect-AzAccount - ``` -- * Run the following command to view all the subscriptions for this account. -- ```powershell - Get-AzSubscription - ``` -- * Run the following command to select the subscription that you want to work with. This subscription should be the same as the one you used in the Azure portal. -- ```powershell - Get-AzSubscription -SubscriptionName <SUBSCRIPTION NAME> | Set-AzContext - ``` --2. Create an Azure resource group named **ADFTutorialResourceGroup** by running the following command: -- ```powershell - New-AzResourceGroup -Name ADFTutorialResourceGroup -Location "West US" - ``` -- Some of the steps in this tutorial assume that you use the resource group named ADFTutorialResourceGroup. If you use a different resource group, you need to use it in place of ADFTutorialResourceGroup in this tutorial. --3. Run the **New-AzDataFactory** cmdlet that creates a data factory named **FirstDataFactoryPSH**. -- ```powershell - New-AzDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name FirstDataFactoryPSH -Location "West US" - ``` --Note the following: --* The name of the Azure Data Factory must be globally unique. If you receive the error **Data factory name ΓÇ£FirstDataFactoryPSHΓÇ¥ is not available**, change the name (for example, yournameFirstDataFactoryPSH). Use this name in place of ADFTutorialFactoryPSH while performing steps in this tutorial. See [Data Factory - Naming Rules](data-factory-naming-rules.md) topic for naming rules for Data Factory artifacts. -* To create Data Factory instances, you need to be a contributor/administrator of the Azure subscription -* The name of the data factory may be registered as a DNS name in the future and hence become publicly visible. -* If you receive the error: "**This subscription is not registered to use namespace Microsoft.DataFactory**", do one of the following and try publishing again: -- * In Azure PowerShell, run the following command to register the Data Factory provider: -- ```powershell - Register-AzResourceProvider -ProviderNamespace Microsoft.DataFactory - ``` -- You can run the following command to confirm that the Data Factory provider is registered: -- ```powershell - Get-AzResourceProvider - ``` -- * Login using the Azure subscription into the [Azure portal](https://portal.azure.com) and navigate to a Data Factory blade (or) create a data factory in the Azure portal. This action automatically registers the provider for you. --Before creating a pipeline, you need to create a few Data Factory entities first. You first create linked services to link data stores/computes to your data store, define input and output datasets to represent input/output data in linked data stores, and then create the pipeline with an activity that uses these datasets. --## Create linked services -In this step, you link your Azure Storage account and an on-demand Azure HDInsight cluster to your data factory. The Azure Storage account holds the input and output data for the pipeline in this sample. The HDInsight linked service is used to run a Hive script specified in the activity of the pipeline in this sample. Identify what data store/compute services are used in your scenario and link those services to the data factory by creating linked services. --### Create Azure Storage linked service -In this step, you link your Azure Storage account to your data factory. You use the same Azure Storage account to store input/output data and the HQL script file. --1. Create a JSON file named StorageLinkedService.json in the C:\ADFGetStarted folder with the following content. Create the folder ADFGetStarted if it does not already exist. -- ```json - { - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "description": "", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } - } - ``` - Replace **account name** with the name of your Azure Storage account and **account key** with the access key of the Azure Storage account. To learn how to get your storage access key, see [Manage storage account access keys](../../storage/common/storage-account-keys-manage.md). -2. In Azure PowerShell, switch to the ADFGetStarted folder. -3. You can use the **New-AzDataFactoryLinkedService** cmdlet that creates a linked service. This cmdlet and other Data Factory cmdlets you use in this tutorial requires you to pass values for the *ResourceGroupName* and *DataFactoryName* parameters. Alternatively, you can use **Get-AzDataFactory** to get a **DataFactory** object and pass the object without typing *ResourceGroupName* and *DataFactoryName* each time you run a cmdlet. Run the following command to assign the output of the **Get-AzDataFactory** cmdlet to a **$df** variable. -- ```powershell - $df = Get-AzDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name FirstDataFactoryPSH - ``` -4. Now, run the **New-AzDataFactoryLinkedService** cmdlet that creates the linked **StorageLinkedService** service. -- ```powershell - New-AzDataFactoryLinkedService $df -File .\StorageLinkedService.json - ``` -- If you hadn't run the **Get-AzDataFactory** cmdlet and assigned the output to the **$df** variable, you would have to specify values for the *ResourceGroupName* and *DataFactoryName* parameters as follows. -- ```powershell - New-AzDataFactoryLinkedService -ResourceGroupName ADFTutorialResourceGroup -DataFactoryName FirstDataFactoryPSH -File .\StorageLinkedService.json - ``` -- If you close Azure PowerShell in the middle of the tutorial, you have to run the **Get-AzDataFactory** cmdlet next time you start Azure PowerShell to complete the tutorial. --### Create Azure HDInsight linked service -In this step, you link an on-demand HDInsight cluster to your data factory. The HDInsight cluster is automatically created at runtime and deleted after it is done processing and idle for the specified amount of time. You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See [Compute Linked Services](data-factory-compute-linked-services.md) for details. --1. Create a JSON file named **HDInsightOnDemandLinkedService**.json in the **C:\ADFGetStarted** folder with the following content. -- ```json - { - "name": "HDInsightOnDemandLinkedService", - "properties": { - "type": "HDInsightOnDemand", - "typeProperties": { - "version": "3.5", - "clusterSize": 1, - "timeToLive": "00:05:00", - "osType": "Linux", - "linkedServiceName": "StorageLinkedService" - } - } - } - ``` - The following table provides descriptions for the JSON properties used in the snippet: -- | Property | Description | - |: |: | - | clusterSize |Specifies the size of the HDInsight cluster. | - | timeToLive |Specifies that the idle time for the HDInsight cluster, before it is deleted. | - | linkedServiceName |Specifies the storage account that is used to store the logs that are generated by HDInsight | -- Note the following points: -- * The Data Factory creates a **Linux-based** HDInsight cluster for you with the JSON. See [On-demand HDInsight Linked Service](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service) for details. - * You could use **your own HDInsight cluster** instead of using an on-demand HDInsight cluster. See [HDInsight Linked Service](data-factory-compute-linked-services.md#azure-hdinsight-linked-service) for details. - * The HDInsight cluster creates a **default container** in the blob storage you specified in the JSON (**linkedServiceName**). HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is processed unless there is an existing live cluster (**timeToLive**). The cluster is automatically deleted when the processing is done. -- As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern: "adf**yourdatafactoryname**-**linkedservicename**-datetimestamp". Use tools such as [Microsoft Azure Storage Explorer](https://storageexplorer.com/) to delete containers in your Azure blob storage. -- See [On-demand HDInsight Linked Service](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service) for details. -2. Run the **New-AzDataFactoryLinkedService** cmdlet that creates the linked service called HDInsightOnDemandLinkedService. -- ```powershell - New-AzDataFactoryLinkedService $df -File .\HDInsightOnDemandLinkedService.json - ``` --## Create datasets -In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to the **StorageLinkedService** you have created earlier in this tutorial. The linked service points to an Azure Storage account and datasets specify container, folder, file name in the storage that holds input and output data. --### Create input dataset -1. Create a JSON file named **InputTable.json** in the **C:\ADFGetStarted** folder with the following content: -- ```json - { - "name": "AzureBlobInput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "fileName": "input.log", - "folderPath": "adfgetstarted/inputdata", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Month", - "interval": 1 - }, - "external": true, - "policy": {} - } - } - ``` - The JSON defines a dataset named **AzureBlobInput**, which represents input data for an activity in the pipeline. In addition, it specifies that the input data is located in the blob container called **adfgetstarted** and the folder called **inputdata**. -- The following table provides descriptions for the JSON properties used in the snippet: -- | Property | Description | - |: |: | - | type |The type property is set to AzureBlob because data resides in Azure blob storage. | - | linkedServiceName |refers to the StorageLinkedService you created earlier. | - | fileName |This property is optional. If you omit this property, all the files from the folderPath are picked. In this case, only the input.log is processed. | - | type |The log files are in text format, so we use TextFormat. | - | columnDelimiter |columns in the log files are delimited by the comma character (,). | - | frequency/interval |frequency set to Month and interval is 1, which means that the input slices are available monthly. | - | external |this property is set to true if the input data is not generated by the Data Factory service. | -2. Run the following command in Azure PowerShell to create the Data Factory dataset: -- ```powershell - New-AzDataFactoryDataset $df -File .\InputTable.json - ``` --### Create output dataset -Now, you create the output dataset to represent the output data stored in the Azure Blob storage. --1. Create a JSON file named **OutputTable.json** in the **C:\ADFGetStarted** folder with the following content: -- ```json - { - "name": "AzureBlobOutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "adfgetstarted/partitioneddata", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Month", - "interval": 1 - } - } - } - ``` -- The JSON defines a dataset named **AzureBlobOutput**, which represents output data for an activity in the pipeline. In addition, it specifies that the results are stored in the blob container called **adfgetstarted** and the folder called **partitioneddata**. The **availability** section specifies that the output dataset is produced on a monthly basis. --2. Run the following command in Azure PowerShell to create the Data Factory dataset: -- ```powershell - New-AzDataFactoryDataset $df -File .\OutputTable.json - ``` --## Create pipeline -In this step, you create your first pipeline with a **HDInsightHive** activity. Input slice is available monthly (frequency: Month, interval: 1), output slice is produced monthly, and the scheduler property for the activity is also set to monthly. The settings for the output dataset and the activity scheduler must match. Currently, output dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce any output. If the activity doesn't take any input, you can skip creating the input dataset. The properties used in the following JSON are explained at the end of this section. --1. Create a JSON file named MyFirstPipelinePSH.json in the C:\ADFGetStarted folder with the following content: -- > [!IMPORTANT] - > Replace **storageaccountname** with the name of your storage account in the JSON. - > - > -- ```json - { - "name": "MyFirstPipeline", - "properties": { - "description": "My first Azure Data Factory pipeline", - "activities": [ - { - "type": "HDInsightHive", - "typeProperties": { - "scriptPath": "adfgetstarted/script/partitionweblogs.hql", - "scriptLinkedService": "StorageLinkedService", - "defines": { - "inputtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata", - "partitionedtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata" - } - }, - "inputs": [ - { - "name": "AzureBlobInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "policy": { - "concurrency": 1, - "retry": 3 - }, - "scheduler": { - "frequency": "Month", - "interval": 1 - }, - "name": "RunSampleHiveActivity", - "linkedServiceName": "HDInsightOnDemandLinkedService" - } - ], - "start": "2017-07-01T00:00:00Z", - "end": "2017-07-02T00:00:00Z", - "isPaused": false - } - } - ``` - In the JSON snippet, you are creating a pipeline that consists of a single activity that uses Hive to process Data on an HDInsight cluster. -- The Hive script file, **partitionweblogs.hql**, is stored in the Azure Storage account (specified by the scriptLinkedService, called **StorageLinkedService**), and in **script** folder in the container **adfgetstarted**. -- The **defines** section is used to specify the runtime settings that be passed to the hive script as Hive configuration values (e.g ${hiveconf:inputtable}, ${hiveconf:partitionedtable}). -- The **start** and **end** properties of the pipeline specifies the active period of the pipeline. -- In the activity JSON, you specify that the Hive script runs on the compute specified by the **linkedServiceName** - **HDInsightOnDemandLinkedService**. -- > [!NOTE] - > See "Pipeline JSON" in [Pipelines and activities in Azure Data Factory](data-factory-create-pipelines.md) for details about JSON properties that are used in the example. --2. Confirm that you see the **input.log** file in the **adfgetstarted/inputdata** folder in the Azure blob storage, and run the following command to deploy the pipeline. Since the **start** and **end** times are set in the past and **isPaused** is set to false, the pipeline (activity in the pipeline) runs immediately after you deploy. -- ```powershell - New-AzDataFactoryPipeline $df -File .\MyFirstPipelinePSH.json - ``` --3. Congratulations, you have successfully created your first pipeline using Azure PowerShell! --## Monitor pipeline -In this step, you use Azure PowerShell to monitor whatΓÇÖs going on in an Azure data factory. --1. Run **Get-AzDataFactory** and assign the output to a **$df** variable. -- ```powershell - $df = Get-AzDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name FirstDataFactoryPSH - ``` --2. Run **Get-AzDataFactorySlice** to get details about all slices of the **EmpSQLTable**, which is the output table of the pipeline. -- ```powershell - Get-AzDataFactorySlice $df -DatasetName AzureBlobOutput -StartDateTime 2017-07-01 - ``` -- Notice that the StartDateTime you specify here is the same start time specified in the pipeline JSON. Here is the sample output: -- ```output - ResourceGroupName : ADFTutorialResourceGroup - DataFactoryName : FirstDataFactoryPSH - DatasetName : AzureBlobOutput - Start : 7/1/2017 12:00:00 AM - End : 7/2/2017 12:00:00 AM - RetryCount : 0 - State : InProgress - SubState : - LatencyStatus : - LongRetryCount : 0 - ``` --3. Run **Get-AzDataFactoryRun** to get the details of activity runs for a specific slice. -- ```powershell - Get-AzDataFactoryRun $df -DatasetName AzureBlobOutput -StartDateTime 2017-07-01 - ``` -- Here is the sample output: -- ```output - Id : 0f6334f2-d56c-4d48-b427-d4f0fb4ef883_635268096000000000_635292288000000000_AzureBlobOutput - ResourceGroupName : ADFTutorialResourceGroup - DataFactoryName : FirstDataFactoryPSH - DatasetName : AzureBlobOutput - ProcessingStartTime : 12/18/2015 4:50:33 AM - ProcessingEndTime : 12/31/9999 11:59:59 PM - PercentComplete : 0 - DataSliceStart : 7/1/2017 12:00:00 AM - DataSliceEnd : 7/2/2017 12:00:00 AM - Status : AllocatingResources - Timestamp : 12/18/2015 4:50:33 AM - RetryAttempt : 0 - Properties : {} - ErrorMessage : - ActivityName : RunSampleHiveActivity - PipelineName : MyFirstPipeline - Type : Script - ``` -- You can keep running this cmdlet until you see the slice in **Ready** state or **Failed** state. When the slice is in Ready state, check the **partitioneddata** folder in the **adfgetstarted** container in your blob storage for the output data. Creation of an on-demand HDInsight cluster usually takes some time. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-powershell/three-ouptut-files.png" alt-text="output data"::: --> [!IMPORTANT] -> Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the pipeline to take **approximately 30 minutes** to process the slice. -> -> The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container. -> -> --## Summary -In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop cluster. You used the Data Factory Editor in the Azure portal to do the following steps: --1. Created an Azure **data factory**. -2. Created two **linked services**: - 1. **Azure Storage** linked service to link your Azure blob storage that holds input/output files to the data factory. - 2. **Azure HDInsight** on-demand linked service to link an on-demand HDInsight Hadoop cluster to the data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data and produce output data. -3. Created two **datasets**, which describe input and output data for HDInsight Hive activity in the pipeline. -4. Created a **pipeline** with a **HDInsight Hive** activity. --## Next steps -In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script on an on-demand Azure HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to Azure SQL, see [Tutorial: Copy data from an Azure Blob to Azure SQL](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --## See Also --| Topic | Description | -|: |: | -| [Data Factory Cmdlet Reference](/powershell/module/az.datafactory) |See comprehensive documentation on Data Factory cmdlets | -| [Pipelines](data-factory-create-pipelines.md) |This article helps you understand pipelines and activities in Azure Data Factory and how to use them to construct end-to-end data-driven workflows for your scenario or business. | -| [Datasets](data-factory-create-datasets.md) |This article helps you understand datasets in Azure Data Factory. | -| [Scheduling and Execution](data-factory-scheduling-and-execution.md) |This article explains the scheduling and execution aspects of Azure Data Factory application model. | -| [Monitor and manage pipelines using Monitoring App](data-factory-monitor-manage-app.md) |This article describes how to monitor, manage, and debug pipelines using the Monitoring & Management App. | |
data-factory | Data Factory Build Your First Pipeline Using Rest Api | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-build-your-first-pipeline-using-rest-api.md | - Title: Build your first data factory (REST) -description: In this tutorial, you create a sample Azure Data Factory pipeline using Data Factory REST API. ------ Previously updated : 04/12/2023---# Tutorial: Build your first Azure data factory using Data Factory REST API -> [!div class="op_single_selector"] -> * [Overview and prerequisites](data-factory-build-your-first-pipeline.md) -> * [Visual Studio](data-factory-build-your-first-pipeline-using-vs.md) -> * [PowerShell](data-factory-build-your-first-pipeline-using-powershell.md) -> * [Resource Manager Template](data-factory-build-your-first-pipeline-using-arm.md) -> * [REST API](data-factory-build-your-first-pipeline-using-rest-api.md) -> -> ---> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Quickstart: Create a data factory using Azure Data Factory](../quickstart-create-data-factory-rest-api.md). --In this article, you use Data Factory REST API to create your first Azure data factory. To do the tutorial using other tools/SDKs, select one of the options from the drop-down list. --The pipeline in this tutorial has one activity: **HDInsight Hive activity**. This activity runs a hive script on an Azure HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a month between the specified start and end times. --> [!NOTE] -> This article does not cover all the REST API. For comprehensive documentation on REST API, see [Data Factory REST API Reference](/rest/api/datafactory/). -> -> A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see [scheduling and execution in Data Factory](data-factory-scheduling-and-execution.md#multiple-activities-in-a-pipeline). ---## Prerequisites ---* Read through [Tutorial Overview](data-factory-build-your-first-pipeline.md) article and complete the **prerequisite** steps. -* Install [Curl](https://curl.haxx.se/dlwiz/) on your machine. You use the CURL tool with REST commands to create a data factory. -* Follow instructions from [this article](../../active-directory/develop/howto-create-service-principal-portal.md) to: - 1. Create a Web application named **ADFGetStartedApp** in Microsoft Entra ID. - 2. Get **client ID** and **secret key**. - 3. Get **tenant ID**. - 4. Assign the **ADFGetStartedApp** application to the **Data Factory Contributor** role. -* Install [Azure PowerShell](/powershell/azure/). -* Launch **PowerShell** and run the following command. Keep Azure PowerShell open until the end of this tutorial. If you close and reopen, you need to run the commands again. - 1. Run **Connect-AzAccount** and enter the user name and password that you use to sign in to the Azure portal. - 2. Run **Get-AzSubscription** to view all the subscriptions for this account. - 3. Run **Get-AzSubscription -SubscriptionName NameOfAzureSubscription | Set-AzContext** to select the subscription that you want to work with. Replace **NameOfAzureSubscription** with the name of your Azure subscription. -* Create an Azure resource group named **ADFTutorialResourceGroup** by running the following command in the PowerShell: -- ```powershell - New-AzResourceGroup -Name ADFTutorialResourceGroup -Location "West US" - ``` -- Some of the steps in this tutorial assume that you use the resource group named ADFTutorialResourceGroup. If you use a different resource group, you need to use the name of your resource group in place of ADFTutorialResourceGroup in this tutorial. --## Create JSON definitions -Create following JSON files in the folder where curl.exe is located. --### datafactory.json -> [!IMPORTANT] -> Name must be globally unique, so you may want to prefix/suffix ADFCopyTutorialDF to make it a unique name. -> -> --```JSON -{ - "name": "FirstDataFactoryREST", - "location": "WestUS" -} -``` --### azurestoragelinkedservice.json -> [!IMPORTANT] -> Replace **accountname** and **accountkey** with name and key of your Azure storage account. To learn how to get your storage access key, see [Manage storage account access keys](../../storage/common/storage-account-keys-manage.md). -> -> --```JSON -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --### hdinsightondemandlinkedservice.json --```JSON -{ - "name": "HDInsightOnDemandLinkedService", - "properties": { - "type": "HDInsightOnDemand", - "typeProperties": { - "version": "3.5", - "clusterSize": 1, - "timeToLive": "00:05:00", - "osType": "Linux", - "linkedServiceName": "AzureStorageLinkedService" - } - } -} -``` --The following table provides descriptions for the JSON properties used in the snippet: --| Property | Description | -|: |: | -| clusterSize |Size of the HDInsight cluster. | -| timeToLive |Specifies that the idle time for the HDInsight cluster, before it is deleted. | -| linkedServiceName |Specifies the storage account that is used to store the logs that are generated by HDInsight | --Note the following points: --* The Data Factory creates a **Linux-based** HDInsight cluster for you with the above JSON. See [On-demand HDInsight Linked Service](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service) for details. -* You could use **your own HDInsight cluster** instead of using an on-demand HDInsight cluster. See [HDInsight Linked Service](data-factory-compute-linked-services.md#azure-hdinsight-linked-service) for details. -* The HDInsight cluster creates a **default container** in the blob storage you specified in the JSON (**linkedServiceName**). HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is processed unless there is an existing live cluster (**timeToLive**) and is deleted when the processing is done. -- As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern: "adf**yourdatafactoryname**-**linkedservicename**-datetimestamp". Use tools such as [Microsoft Azure Storage Explorer](https://storageexplorer.com/) to delete containers in your Azure blob storage. --See [On-demand HDInsight Linked Service](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service) for details. --### inputdataset.json --```JSON -{ - "name": "AzureBlobInput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "fileName": "input.log", - "folderPath": "adfgetstarted/inputdata", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Month", - "interval": 1 - }, - "external": true, - "policy": {} - } -} -``` --The JSON defines a dataset named **AzureBlobInput**, which represents input data for an activity in the pipeline. In addition, it specifies that the input data is located in the blob container called **adfgetstarted** and the folder called **inputdata**. --The following table provides descriptions for the JSON properties used in the snippet: --| Property | Description | -|: |: | -| type |The type property is set to AzureBlob because data resides in Azure blob storage. | -| linkedServiceName |refers to the StorageLinkedService you created earlier. | -| fileName |This property is optional. If you omit this property, all the files from the folderPath are picked. In this case, only the input.log is processed. | -| type |The log files are in text format, so we use TextFormat. | -| columnDelimiter |columns in the log files are delimited by a comma character (,) | -| frequency/interval |frequency set to Month and interval is 1, which means that the input slices are available monthly. | -| external |this property is set to true if the input data is not generated by the Data Factory service. | --### outputdataset.json --```JSON -{ - "name": "AzureBlobOutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "adfgetstarted/partitioneddata", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Month", - "interval": 1 - } - } -} -``` --The JSON defines a dataset named **AzureBlobOutput**, which represents output data for an activity in the pipeline. In addition, it specifies that the results are stored in the blob container called **adfgetstarted** and the folder called **partitioneddata**. The **availability** section specifies that the output dataset is produced on a monthly basis. --### pipeline.json -> [!IMPORTANT] -> Replace **storageaccountname** with name of your Azure storage account. -> -> --```JSON -{ - "name": "MyFirstPipeline", - "properties": { - "description": "My first Azure Data Factory pipeline", - "activities": [{ - "type": "HDInsightHive", - "typeProperties": { - "scriptPath": "adfgetstarted/script/partitionweblogs.hql", - "scriptLinkedService": "AzureStorageLinkedService", - "defines": { - "inputtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata", - "partitionedtable": "wasb://adfgetstarted@<storageaccountname>t.blob.core.windows.net/partitioneddata" - } - }, - "inputs": [{ - "name": "AzureBlobInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "policy": { - "concurrency": 1, - "retry": 3 - }, - "scheduler": { - "frequency": "Month", - "interval": 1 - }, - "name": "RunSampleHiveActivity", - "linkedServiceName": "HDInsightOnDemandLinkedService" - }], - "start": "2017-07-10T00:00:00Z", - "end": "2017-07-11T00:00:00Z", - "isPaused": false - } -} -``` --In the JSON snippet, you are creating a pipeline that consists of a single activity that uses Hive to process data on a HDInsight cluster. --The Hive script file, **partitionweblogs.hql**, is stored in the Azure storage account (specified by the scriptLinkedService, called **StorageLinkedService**), and in **script** folder in the container **adfgetstarted**. --The **defines** section specifies runtime settings that are passed to the hive script as Hive configuration values (e.g ${hiveconf:inputtable}, ${hiveconf:partitionedtable}). --The **start** and **end** properties of the pipeline specifies the active period of the pipeline. --In the activity JSON, you specify that the Hive script runs on the compute specified by the **linkedServiceName** - **HDInsightOnDemandLinkedService**. --> [!NOTE] -> See "Pipeline JSON" in [Pipelines and activities in Azure Data Factory](data-factory-create-pipelines.md) for details about JSON properties used in the preceding example. -> -> --## Set global variables -In Azure PowerShell, execute the following commands after replacing the values with your own: --> [!IMPORTANT] -> See [Prerequisites](#prerequisites) section for instructions on getting client ID, client secret, tenant ID, and subscription ID. -> -> --```powershell -$client_id = "<client ID of application in AAD>" -$client_secret = "<client key of application in AAD>" -$tenant = "<Azure tenant ID>"; -$subscription_id="<Azure subscription ID>"; --$rg = "ADFTutorialResourceGroup" -$adf = "FirstDataFactoryREST" -``` ---<a name='authenticate-with-aad'></a> --## Authenticate with Microsoft Entra ID --```powershell -$cmd = { .\curl.exe -X POST https://login.microsoftonline.com/$tenant/oauth2/token -F grant_type=client_credentials -F resource=https://management.core.windows.net/ -F client_id=$client_id -F client_secret=$client_secret }; -$responseToken = Invoke-Command -scriptblock $cmd; -$accessToken = (ConvertFrom-Json $responseToken).access_token; --(ConvertFrom-Json $responseToken) -``` ---## Create data factory -In this step, you create an Azure Data Factory named **FirstDataFactoryREST**. A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive script to transform data. Run the following commands to create the data factory: --1. Assign the command to variable named **cmd**. -- Confirm that the name of the data factory you specify here (ADFCopyTutorialDF) matches the name specified in the **datafactory.json**. -- ```powershell - $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@datafactory.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/FirstDataFactoryREST?api-version=2015-10-01}; - ``` -2. Run the command by using **Invoke-Command**. -- ```powershell - $results = Invoke-Command -scriptblock $cmd; - ``` -3. View the results. If the data factory has been successfully created, you see the JSON for the data factory in the **results**; otherwise, you see an error message. -- ```powershell - Write-Host $results - ``` --Note the following points: --* The name of the Azure Data Factory must be globally unique. If you see the error in results: **Data factory name ΓÇ£FirstDataFactoryRESTΓÇ¥ is not available**, do the following steps: - 1. Change the name (for example, yournameFirstDataFactoryREST) in the **datafactory.json** file. See [Data Factory - Naming Rules](data-factory-naming-rules.md) topic for naming rules for Data Factory artifacts. - 2. In the first command where the **$cmd** variable is assigned a value, replace FirstDataFactoryREST with the new name and run the command. - 3. Run the next two commands to invoke the REST API to create the data factory and print the results of the operation. -* To create Data Factory instances, you need to be a contributor/administrator of the Azure subscription -* The name of the data factory may be registered as a DNS name in the future and hence become publicly visible. -* If you receive the error: "**This subscription is not registered to use namespace Microsoft.DataFactory**", do one of the following and try publishing again: -- * In Azure PowerShell, run the following command to register the Data Factory provider: -- ```powershell - Register-AzResourceProvider -ProviderNamespace Microsoft.DataFactory - ``` -- You can run the following command to confirm that the Data Factory provider is registered: -- ```powershell - Get-AzResourceProvider - ``` -- * Login using the Azure subscription into the [Azure portal](https://portal.azure.com) and navigate to a Data Factory blade (or) create a data factory in the Azure portal. This action automatically registers the provider for you. --Before creating a pipeline, you need to create a few Data Factory entities first. You first create linked services to link data stores/computes to your data store, define input and output datasets to represent data in linked data stores. --## Create linked services -In this step, you link your Azure Storage account and an on-demand Azure HDInsight cluster to your data factory. The Azure Storage account holds the input and output data for the pipeline in this sample. The HDInsight linked service is used to run a Hive script specified in the activity of the pipeline in this sample. --### Create Azure Storage linked service -In this step, you link your Azure Storage account to your data factory. With this tutorial, you use the same Azure Storage account to store input/output data and the HQL script file. --1. Assign the command to variable named **cmd**. -- ```powershell - $cmd = { .\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data ΓÇ£@azurestoragelinkedservice.jsonΓÇ¥ https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/linkedservices/AzureStorageLinkedService?api-version=2015-10-01 }; - ``` --2. Run the command by using **Invoke-Command**. -- ```powershell - $results = Invoke-Command -scriptblock $cmd; - ``` --3. View the results. If the linked service has been successfully created, you see the JSON for the linked service in the **results**; otherwise, you see an error message. -- ```powershell - Write-Host $results - ``` --### Create Azure HDInsight linked service -In this step, you link an on-demand HDInsight cluster to your data factory. The HDInsight cluster is automatically created at runtime and deleted after it is done processing and idle for the specified amount of time. You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See [Compute Linked Services](data-factory-compute-linked-services.md) for details. --1. Assign the command to variable named **cmd**. -- ```powershell - $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@hdinsightondemandlinkedservice.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/linkedservices/hdinsightondemandlinkedservice?api-version=2015-10-01}; - ``` --2. Run the command by using **Invoke-Command**. -- ```powershell - $results = Invoke-Command -scriptblock $cmd; - ``` --3. View the results. If the linked service has been successfully created, you see the JSON for the linked service in the **results**; otherwise, you see an error message. -- ```powershell - Write-Host $results - ``` --## Create datasets -In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to the **StorageLinkedService** you have created earlier in this tutorial. The linked service points to an Azure Storage account and datasets specify container, folder, file name in the storage that holds input and output data. --### Create input dataset -In this step, you create the input dataset to represent input data stored in the Azure Blob storage. --1. Assign the command to variable named **cmd**. -- ```powershell - $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@inputdataset.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/datasets/AzureBlobInput?api-version=2015-10-01}; - ``` -2. Run the command by using **Invoke-Command**. -- ```powershell - $results = Invoke-Command -scriptblock $cmd; - ``` -3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the **results**; otherwise, you see an error message. -- ```powershell - Write-Host $results - ``` --### Create output dataset -In this step, you create the output dataset to represent output data stored in the Azure Blob storage. --1. Assign the command to variable named **cmd**. -- ```powershell - $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@outputdataset.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/datasets/AzureBlobOutput?api-version=2015-10-01}; - ``` -2. Run the command by using **Invoke-Command**. -- ```powershell - $results = Invoke-Command -scriptblock $cmd; - ``` -3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the **results**; otherwise, you see an error message. -- ```powershell - Write-Host $results - ``` --## Create pipeline -In this step, you create your first pipeline with a **HDInsightHive** activity. Input slice is available monthly (frequency: Month, interval: 1), output slice is produced monthly, and the scheduler property for the activity is also set to monthly. The settings for the output dataset and the activity scheduler must match. Currently, output dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce any output. If the activity doesn't take any input, you can skip creating the input dataset. --Confirm that you see the **input.log** file in the **adfgetstarted/inputdata** folder in the Azure blob storage, and run the following command to deploy the pipeline. Since the **start** and **end** times are set in the past and **isPaused** is set to false, the pipeline (activity in the pipeline) runs immediately after you deploy. --1. Assign the command to variable named **cmd**. -- ```powershell - $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@pipeline.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/datapipelines/MyFirstPipeline?api-version=2015-10-01}; - ``` -2. Run the command by using **Invoke-Command**. -- ```powershell - $results = Invoke-Command -scriptblock $cmd; - ``` -3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the **results**; otherwise, you see an error message. -- ```powershell - Write-Host $results - ``` -4. Congratulations, you have successfully created your first pipeline using Azure PowerShell! --## Monitor pipeline -In this step, you use Data Factory REST API to monitor slices being produced by the pipeline. --```powershell -$ds ="AzureBlobOutput" --$cmd = {.\curl.exe -X GET -H "Authorization: Bearer $accessToken" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/datasets/$ds/slices?start=1970-01-01T00%3a00%3a00.0000000Z"&"end=2016-08-12T00%3a00%3a00.0000000Z"&"api-version=2015-10-01}; --$results2 = Invoke-Command -scriptblock $cmd; --IF ((ConvertFrom-Json $results2).value -ne $NULL) { - ConvertFrom-Json $results2 | Select-Object -Expand value | Format-Table -} else { - (convertFrom-Json $results2).RemoteException -} -``` --> [!IMPORTANT] -> Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the pipeline to take **approximately 30 minutes** to process the slice. -> -> --Run the Invoke-Command and the next one until you see the slice in **Ready** state or **Failed** state. When the slice is in Ready state, check the **partitioneddata** folder in the **adfgetstarted** container in your blob storage for the output data. The creation of an on-demand HDInsight cluster usually takes some time. ---> [!IMPORTANT] -> The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container. -> -> --You can also use Azure portal to monitor slices and troubleshoot any issues. See [Monitor pipelines using Azure portal](data-factory-monitor-manage-pipelines.md) details. --## Summary -In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop cluster. You used the Data Factory Editor in the Azure portal to do the following steps: --1. Created an Azure **data factory**. -2. Created two **linked services**: - 1. **Azure Storage** linked service to link your Azure blob storage that holds input/output files to the data factory. - 2. **Azure HDInsight** on-demand linked service to link an on-demand HDInsight Hadoop cluster to the data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data and produce output data. -3. Created two **datasets**, which describe input and output data for HDInsight Hive activity in the pipeline. -4. Created a **pipeline** with a **HDInsight Hive** activity. --## Next steps -In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script on an on-demand Azure HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to Azure SQL, see [Tutorial: Copy data from an Azure Blob to Azure SQL](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --## See Also -| Topic | Description | -|: |: | -| [Data Factory REST API Reference](/rest/api/datafactory/) |See comprehensive documentation on Data Factory cmdlets | -| [Pipelines](data-factory-create-pipelines.md) |This article helps you understand pipelines and activities in Azure Data Factory and how to use them to construct end-to-end data-driven workflows for your scenario or business. | -| [Datasets](data-factory-create-datasets.md) |This article helps you understand datasets in Azure Data Factory. | -| [Scheduling and Execution](data-factory-scheduling-and-execution.md) |This article explains the scheduling and execution aspects of Azure Data Factory application model. | -| [Monitor and manage pipelines using Monitoring App](data-factory-monitor-manage-app.md) |This article describes how to monitor, manage, and debug pipelines using the Monitoring & Management App. | |
data-factory | Data Factory Build Your First Pipeline Using Vs | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-build-your-first-pipeline-using-vs.md | - Title: Build your first data factory (Visual Studio) -description: In this tutorial, you create a sample Azure Data Factory pipeline using Visual Studio. ------- Previously updated : 04/12/2023---# Tutorial: Create a data factory by using Visual Studio -> [!div class="op_single_selector" title="Tools/SDKs"] -> * [Overview and prerequisites](data-factory-build-your-first-pipeline.md) -> * [Visual Studio](data-factory-build-your-first-pipeline-using-vs.md) -> * [PowerShell](data-factory-build-your-first-pipeline-using-powershell.md) -> * [Resource Manager Template](data-factory-build-your-first-pipeline-using-arm.md) -> * [REST API](data-factory-build-your-first-pipeline-using-rest-api.md) ---> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Quickstart: Create a data factory using Azure Data Factory](../quickstart-create-data-factory-dot-net.md). --This tutorial shows you how to create an Azure Data Factory by using Visual Studio. You create a Visual Studio project using the Data Factory project template, define Data Factory entities (linked services, datasets, and pipeline) in JSON format, and then publish/deploy these entities to the cloud. --The pipeline in this tutorial has one activity: **HDInsight Hive activity**. This activity runs a hive script on an Azure HDInsight cluster that transforms input data to produce output data. The pipeline is scheduled to run once a month between the specified start and end times. --> [!NOTE] -> This tutorial does not show how copy data by using Azure Data Factory. For a tutorial on how to copy data using Azure Data Factory, see [Tutorial: Copy data from Blob Storage to SQL Database](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). -> -> A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see [scheduling and execution in Data Factory](data-factory-scheduling-and-execution.md#multiple-activities-in-a-pipeline). ---## Walkthrough: Create and publish Data Factory entities -Here are the steps you perform as part of this walkthrough: --1. Create two linked - - In this tutorial, both input and output data for the hive activity are in the same Azure Blob Storage. You use an on-demand HDInsight cluster to process existing input data to produce output data. The on-demand HDInsight cluster is automatically created for you by Azure Data Factory at run time when the input data is ready to be processed. You need to link your data stores or computes to your data factory so that the Data Factory service can connect to them at runtime. Therefore, you link your Azure Storage Account to the data factory by using the AzureStorageLinkedService1, and link an on-demand HDInsight cluster by using the HDInsightOnDemandLinkedService1. When publishing, you specify the name for the data factory to be created or an existing data factory. -2. Create two datasets: **InputDataset** and **OutputDataset**, which represent the input/output data that is stored in the Azure blob storage. - - These dataset definitions refer to the Azure Storage linked service you created in the previous step. For the InputDataset, you specify the blob container (adfgetstarted) and the folder (inptutdata) that contains a blob with the input data. For the OutputDataset, you specify the blob container (adfgetstarted) and the folder (partitioneddata) that holds the output data. You also specify other properties such as structure, availability, and policy. -3. Create a pipeline named **MyFirstPipeline**. - - In this walkthrough, the pipeline has only one activity: **HDInsight Hive Activity**. This activity transform input data to produce output data by running a hive script on an on-demand HDInsight cluster. To learn more about hive activity, see [Hive Activity](data-factory-hive-activity.md) -4. Create a data factory named **DataFactoryUsingVS**. Deploy the data factory and all Data Factory entities (linked services, tables, and the pipeline). -5. After you publish, you use Azure portal blades and Monitoring & Management App to monitor the pipeline. - -### Prerequisites ---1. Read through [Tutorial Overview](data-factory-build-your-first-pipeline.md) article and complete the **prerequisite** steps. You can also select the **Overview and prerequisites** option in the drop-down list at the top to switch to the article. After you complete the prerequisites, switch back to this article by selecting **Visual Studio** option in the drop-down list. -2. To create Data Factory instances, you must be a member of the [Data Factory Contributor](../../role-based-access-control/built-in-roles.md#data-factory-contributor) role at the subscription/resource group level. -3. You must have the following installed on your computer: - * Visual Studio 2013 or Visual Studio 2015 - * Download Azure SDK for Visual Studio 2013 or Visual Studio 2015. Navigate to [Azure Download Page](https://azure.microsoft.com/downloads/) and click **VS 2013** or **VS 2015** in the **.NET** section. - * Download the latest Azure Data Factory plugin for Visual Studio: [VS 2013](https://visualstudiogallery.msdn.microsoft.com/754d998c-8f92-4aa7-835b-e89c8c954aa5) or [VS 2015](https://visualstudiogallery.msdn.microsoft.com/371a4cf9-0093-40fa-b7dd-be3c74f49005). You can also update the plugin by doing the following steps: On the menu, click **Tools** -> **Extensions and Updates** -> **Online** -> **Visual Studio Gallery** -> **Microsoft Azure Data Factory Tools for Visual Studio** -> **Update**. --Now, let's use Visual Studio to create an Azure Data Factory. --### Create Visual Studio project -1. Launch **Visual Studio 2013** or **Visual Studio 2015**. Click **File**, point to **New**, and click **Project**. You should see the **New Project** dialog box. -2. In the **New Project** dialog, select the **DataFactory** template, and click **Empty Data Factory Project**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/new-project-dialog.png" alt-text="New project dialog box"::: -3. Enter a **name** for the project, **location**, and a name for the **solution**, and click **OK**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/solution-explorer.png" alt-text="Solution Explorer"::: --### Create linked services -In this step, you create two linked --The Azure Storage linked service links your Azure Storage account to the data factory by providing the connection information. Data Factory service uses the connection string from the linked service setting to connect to the Azure storage at runtime. This storage holds input and output data for the pipeline, and the hive script file used by the hive activity. --With on-demand HDInsight linked service, The HDInsight cluster is automatically created at runtime when the input data is ready to processed. The cluster is deleted after it is done processing and idle for the specified amount of time. --> [!NOTE] -> You create a data factory by specifying its name and settings at the time of publishing your Data Factory solution. --#### Create Azure Storage linked service -1. Right-click **Linked Services** in the solution explorer, point to **Add**, and click **New Item**. -2. In the **Add New Item** dialog box, select **Azure Storage Linked Service** from the list, and click **Add**. - :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/new-azure-storage-linked-service.png" alt-text="Screenshot that highlights Azure Storage Linked Service in the list."::: -3. Replace `<accountname>` and `<accountkey>` with the name of your Azure storage account and its key. To learn how to get your storage access key, see [Manage storage account access keys](../../storage/common/storage-account-keys-manage.md). - :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/azure-storage-linked-service.png" alt-text="Azure Storage Linked Service"::: -4. Save the **AzureStorageLinkedService1.json** file. --#### Create Azure HDInsight linked service -1. In the **Solution Explorer**, right-click **Linked Services**, point to **Add**, and click **New Item**. -2. Select **HDInsight On Demand Linked Service**, and click **Add**. -3. Replace the **JSON** with the following JSON: -- ```json - { - "name": "HDInsightOnDemandLinkedService", - "properties": { - "type": "HDInsightOnDemand", - "typeProperties": { - "version": "3.5", - "clusterSize": 1, - "timeToLive": "00:05:00", - "osType": "Linux", - "linkedServiceName": "AzureStorageLinkedService1" - } - } - } - ``` -- The following table provides descriptions for the JSON properties used in the snippet: -- Property | Description - -- | -- - ClusterSize | Specifies the size of the HDInsight Hadoop cluster. - TimeToLive | Specifies that the idle time for the HDInsight cluster, before it is deleted. - linkedServiceName | Specifies the storage account that is used to store the logs that are generated by HDInsight Hadoop cluster. -- > [!IMPORTANT] - > The HDInsight cluster creates a **default container** in the blob storage you specified in the JSON (linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is processed unless there is an existing live cluster (timeToLive). The cluster is automatically deleted when the processing is done. - > - > As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern: `adf<yourdatafactoryname>-<linkedservicename>-datetimestamp`. Use tools such as [Microsoft Azure Storage Explorer](https://storageexplorer.com/) to delete containers in your Azure blob storage. -- For more information about JSON properties, see [Compute linked services](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service) article. -4. Save the **HDInsightOnDemandLinkedService1.json** file. --### Create datasets -In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to the **AzureStorageLinkedService1** you have created earlier in this tutorial. The linked service points to an Azure Storage account and datasets specify container, folder, file name in the storage that holds input and output data. --#### Create input dataset -1. In the **Solution Explorer**, right-click **Tables**, point to **Add**, and click **New Item**. -2. Select **Azure Blob** from the list, change the name of the file to **InputDataSet.json**, and click **Add**. -3. Replace the **JSON** in the editor with the following JSON snippet: -- ```json - { - "name": "AzureBlobInput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService1", - "typeProperties": { - "fileName": "input.log", - "folderPath": "adfgetstarted/inputdata", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Month", - "interval": 1 - }, - "external": true, - "policy": {} - } - } - ``` - This JSON snippet defines a dataset called **AzureBlobInput** that represents input data for the hive activity in the pipeline. You specify that the input data is located in the blob container called `adfgetstarted` and the folder called `inputdata`. -- The following table provides descriptions for the JSON properties used in the snippet: -- Property | Description | - -- | -- | - type |The type property is set to **AzureBlob** because data resides in Azure Blob Storage. - linkedServiceName | Refers to the AzureStorageLinkedService1 you created earlier. - fileName |This property is optional. If you omit this property, all the files from the folderPath are picked. In this case, only the input.log is processed. - type | The log files are in text format, so we use TextFormat. | - columnDelimiter | columns in the log files are delimited by the comma character (`,`) - frequency/interval | frequency set to Month and interval is 1, which means that the input slices are available monthly. - external | This property is set to true if the input data for the activity is not generated by the pipeline. This property is only specified on input datasets. For the input dataset of the first activity, always set it to true. -4. Save the **InputDataset.json** file. --#### Create output dataset -Now, you create the output dataset to represent output data stored in the Azure Blob storage. --1. In the **Solution Explorer**, right-click **tables**, point to **Add**, and click **New Item**. -2. Select **Azure Blob** from the list, change the name of the file to **OutputDataset.json**, and click **Add**. -3. Replace the **JSON** in the editor with the following JSON: - - ```json - { - "name": "AzureBlobOutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService1", - "typeProperties": { - "folderPath": "adfgetstarted/partitioneddata", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Month", - "interval": 1 - } - } - } - ``` - The JSON snippet defines a dataset called **AzureBlobOutput** that represents output data produced by the hive activity in the pipeline. You specify that the output data is produced by the hive activity is placed in the blob container called `adfgetstarted` and the folder called `partitioneddata`. - - The **availability** section specifies that the output dataset is produced on a monthly basis. The output dataset drives the schedule of the pipeline. The pipeline runs monthly between its start and end times. -- See **Create the input dataset** section for descriptions of these properties. You do not set the external property on an output dataset as the dataset is produced by the pipeline. -4. Save the **OutputDataset.json** file. --### Create pipeline -You have created the Azure Storage linked service, and input and output datasets so far. Now, you create a pipeline with a **HDInsightHive** activity. The **input** for the hive activity is set to **AzureBlobInput** and **output** is set to **AzureBlobOutput**. A slice of an input dataset is available monthly (frequency: Month, interval: 1), and the output slice is produced monthly too. --1. In the **Solution Explorer**, right-click **Pipelines**, point to **Add**, and click **New Item.** -2. Select **Hive Transformation Pipeline** from the list, and click **Add**. -3. Replace the **JSON** with the following snippet: -- > [!IMPORTANT] - > Replace `<storageaccountname>` with the name of your storage account. -- ```json - { - "name": "MyFirstPipeline", - "properties": { - "description": "My first Azure Data Factory pipeline", - "activities": [ - { - "type": "HDInsightHive", - "typeProperties": { - "scriptPath": "adfgetstarted/script/partitionweblogs.hql", - "scriptLinkedService": "AzureStorageLinkedService1", - "defines": { - "inputtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata", - "partitionedtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata" - } - }, - "inputs": [ - { - "name": "AzureBlobInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "policy": { - "concurrency": 1, - "retry": 3 - }, - "scheduler": { - "frequency": "Month", - "interval": 1 - }, - "name": "RunSampleHiveActivity", - "linkedServiceName": "HDInsightOnDemandLinkedService" - } - ], - "start": "2016-04-01T00:00:00Z", - "end": "2016-04-02T00:00:00Z", - "isPaused": false - } - } - ``` -- > [!IMPORTANT] - > Replace `<storageaccountname>` with the name of your storage account. -- The JSON snippet defines a pipeline that consists of a single activity (Hive Activity). This activity runs a Hive script to process input data on an on-demand HDInsight cluster to produce output data. In the activities section of the pipeline JSON, you see only one activity in the array with type set to **HDInsightHive**. -- In the type properties that are specific to HDInsight Hive activity, you specify what Azure Storage linked service has the hive script file, the path to the script file, and parameters to the script file. -- The Hive script file, **partitionweblogs.hql**, is stored in the Azure storage account (specified by the scriptLinkedService), and in the `script` folder in the container `adfgetstarted`. -- The `defines` section is used to specify the runtime settings that are passed to the hive script as Hive configuration values (e.g `${hiveconf:inputtable}`, `${hiveconf:partitionedtable})`. -- The **start** and **end** properties of the pipeline specifies the active period of the pipeline. You configured the dataset to be produced monthly, therefore, only once slice is produced by the pipeline (because the month is same in start and end dates). -- In the activity JSON, you specify that the Hive script runs on the compute specified by the **linkedServiceName** - **HDInsightOnDemandLinkedService**. -4. Save the **HiveActivity1.json** file. --### Add partitionweblogs.hql and input.log as a dependency -1. Right-click **Dependencies** in the **Solution Explorer** window, point to **Add**, and click **Existing Item**. -2. Navigate to the **C:\ADFGettingStarted** and select **partitionweblogs.hql**, **input.log** files, and click **Add**. You created these two files as part of prerequisites from the [Tutorial Overview](data-factory-build-your-first-pipeline.md). --When you publish the solution in the next step, the **partitionweblogs.hql** file is uploaded to the **script** folder in the `adfgetstarted` blob container. --### Publish/deploy Data Factory entities -In this step, you publish the Data Factory entities (linked services, datasets, and pipeline) in your project to the Azure Data Factory service. In the process of publishing, you specify the name for your data factory. --1. Right-click project in the Solution Explorer, and click **Publish**. -2. If you see **Sign in to your Microsoft account** dialog box, enter your credentials for the account that has Azure subscription, and click **sign in**. -3. You should see the following dialog box: -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/publish.png" alt-text="Publish dialog box"::: -4. In the **Configure data factory** page, do the following steps: -- :::image type="content" source="media/data-factory-build-your-first-pipeline-using-vs/publish-new-data-factory.png" alt-text="Publish - New data factory settings"::: -- 1. Select **Create New Data Factory** option. - 2. Enter a unique **name** for the data factory. For example: **DataFactoryUsingVS09152016**. The name must be globally unique. - 3. Select the right subscription for the **Subscription** field. - > [!IMPORTANT] - > If you do not see any subscription, ensure that you logged in using an account that is an admin or co-admin of the subscription. - 4. Select the **resource group** for the data factory to be created. - 5. Select the **region** for the data factory. - 6. Click **Next** to switch to the **Publish Items** page. (Press **TAB** to move out of the Name field to if the **Next** button is disabled.) - > [!IMPORTANT] - > If you receive the error **Data factory name ΓÇ£DataFactoryUsingVSΓÇ¥ is not available** when publishing, change the name (for example, yournameDataFactoryUsingVS). See [Data Factory - Naming Rules](data-factory-naming-rules.md) topic for naming rules for Data Factory artifacts. --1. In the **Publish Items** page, ensure that all the Data Factories entities are selected, and click **Next** to switch to the **Summary** page. -- :::image type="content" source="media/data-factory-build-your-first-pipeline-using-vs/publish-items-page.png" alt-text="Publish items page"::: -2. Review the summary and click **Next** to start the deployment process and view the **Deployment Status**. -- :::image type="content" source="media/data-factory-build-your-first-pipeline-using-vs/summary-page.png" alt-text="Summary page"::: -3. In the **Deployment Status** page, you should see the status of the deployment process. Click Finish after the deployment is done. --Important points to note: --- If you receive the error: **This subscription is not registered to use namespace Microsoft.DataFactory**, do one of the following and try publishing again:-- - In Azure PowerShell, run the following command to register the Data Factory provider. -- ```powershell - Register-AzResourceProvider -ProviderNamespace Microsoft.DataFactory - ``` -- You can run the following command to confirm that the Data Factory provider is registered. -- ```powershell - Get-AzResourceProvider - ``` -- - Login using the Azure subscription in to the [Azure portal](https://portal.azure.com) and navigate to a Data Factory blade (or) create a data factory in the Azure portal. This action automatically registers the provider for you. --- The name of the data factory may be registered as a DNS name in the future and hence become publicly visible.--- To create Data Factory instances, you need to be an admin or co-admin of the Azure subscription--### Monitor pipeline --In this step, you monitor the pipeline using Diagram View of the data factory. --#### Monitor pipeline using Diagram View -1. Log in to the [Azure portal](https://portal.azure.com/), do the following steps: - 1. Click **More services** and click **Data factories**. - - :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/browse-datafactories.png" alt-text="Browse data factories"::: - 2. Select the name of your data factory (for example: **DataFactoryUsingVS09152016**) from the list of data factories. - - :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/select-first-data-factory.png" alt-text="Select your data factory"::: -2. In the home page for your data factory, click **Diagram**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/diagram-tile.png" alt-text="Diagram tile"::: -3. In the Diagram View, you see an overview of the pipelines, and datasets used in this tutorial. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/diagram-view-2.png" alt-text="Diagram View"::: -4. To view all activities in the pipeline, right-click pipeline in the diagram and click Open Pipeline. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/open-pipeline-menu.png" alt-text="Open pipeline menu"::: -5. Confirm that you see the HDInsightHive activity in the pipeline. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/open-pipeline-view.png" alt-text="Open pipeline view"::: -- To navigate back to the previous view, click **Data factory** in the breadcrumb menu at the top. -6. In the **Diagram View**, double-click the dataset **AzureBlobInput**. Confirm that the slice is in **Ready** state. It may take a couple of minutes for the slice to show up in Ready state. If it does not happen after you wait for sometime, see if you have the input file (input.log) placed in the right container (`adfgetstarted`) and folder (`inputdata`). And, make sure that the **external** property on the input dataset is set to **true**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/input-slice-ready.png" alt-text="Input slice in ready state"::: -7. Click **X** to close **AzureBlobInput** blade. -8. In the **Diagram View**, double-click the dataset **AzureBlobOutput**. You see that the slice that is currently being processed. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/dataset-blade.png" alt-text="Screenshot that highlights the AzureBlobOutput dataset."::: -9. When processing is done, you see the slice in **Ready** state. -- > [!IMPORTANT] - > Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the pipeline to take **approximately 30 minutes** to process the slice. - - :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/dataset-slice-ready.png" alt-text="Dataset"::: -10. When the slice is in **Ready** state, check the `partitioneddata` folder in the `adfgetstarted` container in your blob storage for the output data. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/three-ouptut-files.png" alt-text="output data"::: -11. Click the slice to see details about it in a **Data slice** blade. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/data-slice-details.png" alt-text="Data slice details"::: -12. Click an activity run in the **Activity runs list** to see details about an activity run (Hive activity in our scenario) in an **Activity run details** window. - - :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/activity-window-blade.png" alt-text="Activity run details"::: -- From the log files, you can see the Hive query that was executed and status information. These logs are useful for troubleshooting any issues. --See [Monitor datasets and pipeline](data-factory-monitor-manage-pipelines.md) for instructions on how to use the Azure portal to monitor the pipeline and datasets you have created in this tutorial. --#### Monitor pipeline using Monitor & Manage App -You can also use Monitor & Manage application to monitor your pipelines. For detailed information about using this application, see [Monitor and manage Azure Data Factory pipelines using Monitoring and Management App](data-factory-monitor-manage-app.md). --1. Click Monitor & Manage tile. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/monitor-and-manage-tile.png" alt-text="Monitor & Manage tile"::: -2. You should see Monitor & Manage application. Change the **Start time** and **End time** to match start (04-01-2016 12:00 AM) and end times (04-02-2016 12:00 AM) of your pipeline, and click **Apply**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/monitor-and-manage-app.png" alt-text="Monitor & Manage App"::: -3. To see details about an activity window, select it in the **Activity Windows list** to see details about it. - :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/activity-window-details.png" alt-text="Activity window details"::: --> [!IMPORTANT] -> The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial again, upload the input file (input.log) to the `inputdata` folder of the `adfgetstarted` container. --### Additional notes -- A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive script to transform input data. See [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) for all the sources and sinks supported by the Copy Activity. See [compute linked services](data-factory-compute-linked-services.md) for the list of compute services supported by Data Factory.-- Linked services link data stores or compute services to an Azure Data Factory. See [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) for all the sources and sinks supported by the Copy Activity. See [compute linked services](data-factory-compute-linked-services.md) for the list of compute services supported by Data Factory and [transformation activities](data-factory-data-transformation-activities.md) that can run on them.-- See [Move data from/to Azure Blob](data-factory-azure-blob-connector.md#azure-storage-linked-service) for details about JSON properties used in the Azure Storage linked service definition.-- You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See [Compute Linked Services](data-factory-compute-linked-services.md) for details.-- The Data Factory creates a **Linux-based** HDInsight cluster for you with the preceding JSON. See [On-demand HDInsight Linked Service](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service) for details.-- The HDInsight cluster creates a **default container** in the blob storage you specified in the JSON (linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is processed unless there is an existing live cluster (timeToLive). The cluster is automatically deleted when the processing is done.- - As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern: `adf**yourdatafactoryname**-**linkedservicename**-datetimestamp`. Use tools such as [Microsoft Azure Storage Explorer](https://storageexplorer.com/) to delete containers in your Azure blob storage. -- Currently, output dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce any output. If the activity doesn't take any input, you can skip creating the input dataset. -- This tutorial does not show how copy data by using Azure Data Factory. For a tutorial on how to copy data using Azure Data Factory, see [Tutorial: Copy data from Blob Storage to SQL Database](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md).---## Use Server Explorer to view data factories -1. In **Visual Studio**, click **View** on the menu, and click **Server Explorer**. -2. In the Server Explorer window, expand **Azure** and expand **Data Factory**. If you see **Sign in to Visual Studio**, enter the **account** associated with your Azure subscription and click **Continue**. Enter **password**, and click **Sign in**. Visual Studio tries to get information about all Azure data factories in your subscription. You see the status of this operation in the **Data Factory Task List** window. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/server-explorer.png" alt-text="Server Explorer"::: -3. You can right-click a data factory, and select **Export Data Factory to New Project** to create a Visual Studio project based on an existing data factory. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/export-data-factory-menu.png" alt-text="Export data factory"::: --## Update Data Factory tools for Visual Studio -To update Azure Data Factory tools for Visual Studio, do the following steps: --1. Click **Tools** on the menu and select **Extensions and Updates**. -2. Select **Updates** in the left pane and then select **Visual Studio Gallery**. -3. Select **Azure Data Factory tools for Visual Studio** and click **Update**. If you do not see this entry, you already have the latest version of the tools. --## Use configuration files -You can use configuration files in Visual Studio to configure properties for linked services/tables/pipelines differently for each environment. --Consider the following JSON definition for an Azure Storage linked service. To specify **connectionString** with different values for accountname and accountkey based on the environment (Dev/Test/Production) to which you are deploying Data Factory entities. You can achieve this behavior by using separate configuration file for each environment. --```json -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "description": "", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --### Add a configuration file -Add a configuration file for each environment by performing the following steps: --1. Right-click the Data Factory project in your Visual Studio solution, point to **Add**, and click **New item**. -2. Select **Config** from the list of installed templates on the left, select **Configuration File**, enter a **name** for the configuration file, and click **Add**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/add-config-file.png" alt-text="Add configuration file"::: -3. Add configuration parameters and their values in the following format: -- ```json - { - "$schema": "http://datafactories.schema.management.azure.com/vsschemas/V1/Microsoft.DataFactory.Config.json", - "AzureStorageLinkedService1": [ - { - "name": "$.properties.typeProperties.connectionString", - "value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - ], - "AzureSqlLinkedService1": [ - { - "name": "$.properties.typeProperties.connectionString", - "value": "Server=tcp:<logical SQL server name>.database.windows.net,1433;Database=<Azure Sql database>;User ID=<user name>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" - } - ] - } - ``` -- This example configures connectionString property of an Azure Storage linked service and an Azure SQL linked service. Notice that the syntax for specifying name is [JsonPath](https://goessner.net/articles/JsonPath/). -- If JSON has a property that has an array of values as shown in the following code: -- ```json - "structure": [ - { - "name": "FirstName", - "type": "String" - }, - { - "name": "LastName", - "type": "String" - } - ] - ``` -- Configure properties as shown in the following configuration file (use zero-based indexing): -- ```json - { - "name": "$.properties.structure[0].name", - "value": "FirstName" - }, - { - "name": "$.properties.structure[0].type", - "value": "String" - }, - { - "name": "$.properties.structure[1].name", - "value": "LastName" - }, - { - "name": "$.properties.structure[1].type", - "value": "String" - } - ``` --### Property names with spaces --If a property name has spaces in it, use square brackets as shown in the following example (Database server name): --```json -{ - "name": "$.properties.activities[1].typeProperties.webServiceParameters.['Database server name']", - "value": "MyAsqlServer.database.windows.net" -} -``` --### Deploy solution using a configuration -When you are publishing Azure Data Factory entities in VS, you can specify the configuration that you want to use for that publishing operation. --To publish entities in an Azure Data Factory project using configuration file: --1. Right-click Data Factory project and click **Publish** to see the **Publish Items** dialog box. -2. Select an existing data factory or specify values for creating a data factory on the **Configure data factory** page, and click **Next**. -3. On the **Publish Items** page: you see a drop-down list with available configurations for the **Select Deployment Config** field. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/select-config-file.png" alt-text="Select config file"::: -4. Select the **configuration file** that you would like to use and click **Next**. -5. Confirm that you see the name of JSON file in the **Summary** page and click **Next**. -6. Click **Finish** after the deployment operation is finished. --When you deploy, the values from the configuration file are used to set values for properties in the JSON files before the entities are deployed to Azure Data Factory service. --## Use Azure Key Vault -It is not advisable and often against security policy to commit sensitive data such as connection strings to the code repository. See [ADF Secure Publish](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/ADFSecurePublish) sample on GitHub to learn about storing sensitive information in Azure Key Vault and using it while publishing Data Factory entities. The Secure Publish extension for Visual Studio allows the secrets to be stored in Key Vault and only references to them are specified in linked services/ deployment configurations. These references are resolved when you publish Data Factory entities to Azure. These files can then be committed to source repository without exposing any secrets. --## Summary -In this tutorial, you created an Azure Data Factory to process data by running Hive script on a HDInsight hadoop cluster. You used the Data Factory Editor in the Azure portal to do the following steps: --1. Created an Azure **data factory**. -2. Created two **linked services**: - 1. **Azure Storage** linked service to link your Azure blob storage that holds input/output files to the data factory. - 2. **Azure HDInsight** on-demand linked service to link an on-demand HDInsight Hadoop cluster to the data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data and produce output data. -3. Created two **datasets**, which describe input and output data for HDInsight Hive activity in the pipeline. -4. Created a **pipeline** with a **HDInsight Hive** activity. --## Next steps -In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script on an on-demand HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to Azure SQL, see [Tutorial: Copy data from an Azure blob to Azure SQL](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --You can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. See [Scheduling and execution in Data Factory](data-factory-scheduling-and-execution.md) for detailed information. ---## See Also --| Topic | Description | -|: |: | -| [Pipelines](data-factory-create-pipelines.md) |This article helps you understand pipelines and activities in Azure Data Factory and how to use them to construct data-driven workflows for your scenario or business. | -| [Datasets](data-factory-create-datasets.md) |This article helps you understand datasets in Azure Data Factory. | -| [Data Transformation Activities](data-factory-data-transformation-activities.md) |This article provides a list of data transformation activities (such as HDInsight Hive transformation you used in this tutorial) supported by Azure Data Factory. | -| [Scheduling and execution](data-factory-scheduling-and-execution.md) |This article explains the scheduling and execution aspects of Azure Data Factory application model. | -| [Monitor and manage pipelines using Monitoring App](data-factory-monitor-manage-app.md) |This article describes how to monitor, manage, and debug pipelines using the Monitoring & Management App. | |
data-factory | Data Factory Build Your First Pipeline | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-build-your-first-pipeline.md | - Title: 'Data Factory tutorial: First data pipeline ' -description: This Azure Data Factory tutorial shows you how to create and schedule a data factory that processes data using Hive script on a Hadoop cluster. ------ Previously updated : 04/12/2023---# Tutorial: Build your first pipeline to transform data using Hadoop cluster -> [!div class="op_single_selector"] -> * [Overview and prerequisites](data-factory-build-your-first-pipeline.md) -> * [Visual Studio](data-factory-build-your-first-pipeline-using-vs.md) -> * [PowerShell](data-factory-build-your-first-pipeline-using-powershell.md) -> * [Resource Manager template](data-factory-build-your-first-pipeline-using-arm.md) -> * [REST API](data-factory-build-your-first-pipeline-using-rest-api.md) ---> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Quickstart: Create a data factory using Azure Data Factory](../quickstart-create-data-factory-dot-net.md). --In this tutorial, you build your first Azure data factory with a data pipeline. The pipeline transforms input data by running Hive script on an Azure HDInsight (Hadoop) cluster to produce output data. --This article provides overview and prerequisites for the tutorial. After you complete the prerequisites, you can do the tutorial using one of the following tools/SDKs: Visual Studio, PowerShell, Resource Manager template, REST API. Select one of the options in the drop-down list at the beginning (or) links at the end of this article to do the tutorial using one of these options. --## Tutorial overview -In this tutorial, you perform the following steps: --1. Create a **data factory**. A data factory can contain one or more data pipelines that move and transform data. -- In this tutorial, you create one pipeline in the data factory. -2. Create a **pipeline**. A pipeline can have one or more activities (Examples: Copy Activity, HDInsight Hive Activity). This sample uses the HDInsight Hive activity that runs a Hive script on a HDInsight Hadoop cluster. The script first creates a table that references the raw web log data stored in Azure blob storage and then partitions the raw data by year and month. -- In this tutorial, the pipeline uses the Hive Activity to transform data by running a Hive query on an Azure HDInsight Hadoop cluster. -3. Create **linked services**. You create a linked service to link a data store or a compute service to the data factory. A data store such as Azure Storage holds input/output data of activities in the pipeline. A compute service such as HDInsight Hadoop cluster processes/transforms data. -- In this tutorial, you create two linked -3. Create input and output **datasets**. An input dataset represents the input for an activity in the pipeline and an output dataset represents the output for the activity. -- In this tutorial, the input and output datasets specify locations of input and output data in the Azure Blob Storage. The Azure Storage linked service specifies what Azure Storage Account is used. An input dataset specifies where the input files are located and an output dataset specifies where the output files are placed. ---See [Introduction to Azure Data Factory](data-factory-introduction.md) article for a detailed overview of Azure Data Factory. --Here is the **diagram view** of the sample data factory you build in this tutorial. **MyFirstPipeline** has one activity of type Hive that consumes **AzureBlobInput** dataset as an input and produces **AzureBlobOutput** dataset as an output. ----In this tutorial, **inputdata** folder of the **adfgetstarted** Azure blob container contains one file named input.log. This log file has entries from three months: January, February, and March of 2016. Here are the sample rows for each month in the input file. --``` -2016-01-01,02:01:09,SAMPLEWEBSITE,GET,/blogposts/mvc4/step2.png,X-ARR-LOG-ID=2ec4b8ad-3cf0-4442-93ab-837317ece6a1,80,-,1.54.23.196,Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36,-,http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-post-scenarios.aspx,\N,200,0,0,53175,871 -2016-02-01,02:01:10,SAMPLEWEBSITE,GET,/blogposts/mvc4/step7.png,X-ARR-LOG-ID=d7472a26-431a-4a4d-99eb-c7b4fda2cf4c,80,-,1.54.23.196,Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36,-,http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-post-scenarios.aspx,\N,200,0,0,30184,871 -2016-03-01,02:01:10,SAMPLEWEBSITE,GET,/blogposts/mvc4/step7.png,X-ARR-LOG-ID=d7472a26-431a-4a4d-99eb-c7b4fda2cf4c,80,-,1.54.23.196,Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36,-,http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-post-scenarios.aspx,\N,200,0,0,30184,871 -``` --When the file is processed by the pipeline with HDInsight Hive Activity, the activity runs a Hive script on the HDInsight cluster that partitions input data by year and month. The script creates three output folders that contain a file with entries from each month. --``` -adfgetstarted/partitioneddata/year=2016/month=1/000000_0 -adfgetstarted/partitioneddata/year=2016/month=2/000000_0 -adfgetstarted/partitioneddata/year=2016/month=3/000000_0 -``` --From the sample lines shown above, the first one (with 2016-01-01) is written to the 000000_0 file in the month=1 folder. Similarly, the second one is written to the file in the month=2 folder and the third one is written to the file in the month=3 folder. --## Prerequisites -Before you begin this tutorial, you must have the following prerequisites: --1. **Azure subscription** - If you don't have an Azure subscription, you can create a free trial account in just a couple of minutes. See the [Free Trial](https://azure.microsoft.com/pricing/free-trial/) article on how you can obtain a free trial account. -2. **Azure Storage** - You use an Azure storage account for storing the data in this tutorial. If you don't have an Azure storage account, see the [Create a storage account](../../storage/common/storage-account-create.md) article. After you have created the storage account, note down the **account name** and **access key**. For information about how to retrieve the storage account access keys, see [Manage storage account access keys](../../storage/common/storage-account-keys-manage.md). -3. Download and review the Hive query file (**HQL**) located at: ```https://adftutorialfiles.blob.core.windows.net/hivetutorial/partitionweblogs.hql```. This query transforms input data to produce output data. -4. Download and review the sample input file (**input.log**) located at: ```https://adftutorialfiles.blob.core.windows.net/hivetutorial/input.log``` -5. Create a blob container named **adfgetstarted** in your Azure Blob Storage. -6. Upload **partitionweblogs.hql** file to the **script** folder in the **adfgetstarted** container. Use tools such as [Microsoft Azure Storage Explorer](https://storageexplorer.com/). -7. Upload **input.log** file to the **inputdata** folder in the **adfgetstarted** container. --After you complete the prerequisites, select one of the following tools/SDKs to do the tutorial: --- [Visual Studio](data-factory-build-your-first-pipeline-using-vs.md)-- [PowerShell](data-factory-build-your-first-pipeline-using-powershell.md)-- [Resource Manager template](data-factory-build-your-first-pipeline-using-arm.md)-- [REST API](data-factory-build-your-first-pipeline-using-rest-api.md)--Visual Studio provides a GUI way of building your data factories. Whereas, PowerShell, Resource Manager Template, and REST API options provides scripting/programming way of building your data factories. --> [!NOTE] -> The data pipeline in this tutorial transforms input data to produce output data. It does not copy data from a source data store to a destination data store. For a tutorial on how to copy data using Azure Data Factory, see [Tutorial: Copy data from Blob Storage to SQL Database](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). -> -> You can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. See [Scheduling and execution in Data Factory](data-factory-scheduling-and-execution.md) for detailed information. |
data-factory | Data Factory Compute Linked Services | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-compute-linked-services.md | - Title: Compute environments supported by Azure Data Factory version 1 -description: Learn about compute environments that you can use in Azure Data Factory pipelines (such as Azure HDInsight) to transform or process data. ------ Previously updated : 04/12/2023---# Compute environments supported by Azure Data Factory version 1 ---> [!NOTE] -> This article applies to version 1 of Azure Data Factory. If you are using the current version of the Data Factory service, see [Compute linked services in](../compute-linked-services.md). --This article explains the compute environments that you can use to process or transform data. It also provides details about different configurations (on-demand versus bring-your-own) that Data Factory supports when you configure linked services that link these compute environments to an Azure data factory. --The following table provides a list of compute environments that are supported by Data Factory, and the activities that can run on them. --| Compute environment | Activities | -| - | - | -| [On-demand Azure HDInsight cluster](#azure-hdinsight-on-demand-linked-service) or [your own HDInsight cluster](#azure-hdinsight-linked-service) | [DotNet](data-factory-use-custom-activities.md), [Hive](data-factory-hive-activity.md), [Pig](data-factory-pig-activity.md), [MapReduce](data-factory-map-reduce.md), [Hadoop Streaming](data-factory-hadoop-streaming-activity.md) | -| [Azure Batch](#azure-batch-linked-service) | [DotNet](data-factory-use-custom-activities.md) | -| [ML Studio (classic)](#ml-studio-classic-linked-service) | [Studio (classic) activities: Batch Execution and Update Resource](data-factory-azure-ml-batch-execution-activity.md) | -| [Azure Data Lake Analytics](#azure-data-lake-analytics-linked-service) | [Data Lake Analytics U-SQL](data-factory-usql-activity.md) | -| [Azure SQL](#azure-sql-linked-service), [Azure Synapse Analytics](#azure-synapse-analytics-linked-service), [SQL Server](#sql-server-linked-service) | [Stored Procedure Activity](data-factory-stored-proc-activity.md) | --## <a name="supported-hdinsight-versions-in-azure-data-factory"></a>HDInsight versions supported in Data Factory -Azure HDInsight supports multiple Hadoop cluster versions that you can deploy at any time. Each supported version creates a specific version of the Hortonworks Data Platform (HDP) distribution and a set of components in the distribution. --Microsoft updates the list of supported HDInsight versions with the latest Hadoop ecosystem components and fixes. For detailed information, see [Supported HDInsight versions](../../hdinsight/hdinsight-component-versioning.md#supported-hdinsight-versions). --> [!IMPORTANT] -> Linux-based HDInsight version 3.3 was retired July 31, 2017. Data Factory version 1 on-demand HDInsight linked services customers were given until December 15, 2017, to test and upgrade to a later version of HDInsight. Windows-based HDInsight will be retired July 31, 2018. -> -> --### After the retirement date --After December 15, 2017: --- You can no longer create Linux-based HDInsight version 3.3 (or earlier versions) clusters by using an on-demand HDInsight linked service in Data Factory version 1. -- If the [**osType** and **Version** properties](#azure-hdinsight-on-demand-linked-service) are not explicitly specified in the JSON definition for an existing Data Factory version 1 on-demand HDInsight linked service, the default value is changed from **Version=3.1, osType=Windows** to **Version=\<latest HDI default version\>, osType=Linux**.--After July 31, 2018: --- You can no longer create any version of Windows-based HDInsight clusters by using an on-demand HDInsight linked service in Data Factory version 1. --### Recommended actions --- To ensure that you can use the latest Hadoop ecosystem components and fixes, update the [**osType** and **Version** properties](#azure-hdinsight-on-demand-linked-service) of affected Data Factory version 1 on-demand HDInsight linked service definitions to newer Linux-based HDInsight versions (HDInsight 3.6). -- Before December 15, 2017, test Data Factory version 1 Hive, Pig, MapReduce, and Hadoop streaming activities that reference the affected linked service. Ensure that they are compatible with the new **osType** and **Version** default values (**Version=3.6**, **osType=Linux**) or the explicit HDInsight version and OS type that you are upgrading to. - To learn more about compatibility, see [Migrate from a Windows-based HDInsight cluster to a Linux-based cluster](../../hdinsight/index.yml) and [What are the Hadoop components and versions available with HDInsight?](../../hdinsight/hdinsight-component-versioning.md). -- To continue using a Data Factory version 1 on-demand HDInsight linked service to create Windows-based HDInsight clusters, explicitly set **osType** to **Windows** before December 15, 2017. We recommend that you migrate to Linux-based HDInsight clusters before July 31, 2018. -- If you are using an on-demand HDInsight linked service to execute Data Factory version 1 DotNet Custom Activity, update the DotNet Custom Activity JSON definition to instead use an Azure Batch linked service. For more information, see [Use custom activities in a Data Factory pipeline](./data-factory-use-custom-activities.md). --> [!Note] -> If you use your existing, bring-your-own cluster HDInsight linked device in Data Factory version 1 or a bring-your-own and on-demand HDInsight linked service in Azure Data Factory, no action is required. In those scenarios, the latest version support policy of HDInsight clusters is already enforced. -> -> ---## On-demand compute environment -In an on-demand configuration, Data Factory fully manages the compute environment. Data Factory automatically creates the compute environment before a job is submitted for processing data. When the job is finished, Data Factory removes the compute environment. --You can create a linked service for an on-demand compute environment. Use the linked service to configure the compute environment, and to control granular settings for job execution, cluster management, and bootstrapping actions. --> [!NOTE] -> Currently, the on-demand configuration is supported only for HDInsight clusters. -> --## Azure HDInsight on-demand linked service -Data Factory can automatically create a Windows-based or Linux-based on-demand HDInsight cluster for processing data. The cluster is created in the same region as the storage account that's associated with the cluster. Use the JSON **linkedServiceName** property to create the cluster. --Note the following *key* points about on-demand HDInsight linked service: --* The on-demand HDInsight cluster doesn't appear in your Azure subscription. The Data Factory service manages the on-demand HDInsight cluster on your behalf. -* The logs for jobs that are run on an on-demand HDInsight cluster are copied to the storage account that's associated with the HDInsight cluster. To access these logs, in the Azure portal, go to the **Activity Run Details** pane. For more information, see [Monitor and manage pipelines](data-factory-monitor-manage-pipelines.md). -* You are charged only for the time that the HDInsight cluster is up and running jobs. --> [!IMPORTANT] -> It typically takes *20 minutes* or more to provision an on-demand HDInsight cluster. -> -> --### Example -The following JSON defines a Linux-based on-demand HDInsight linked service. Data Factory automatically creates a *Linux-based* HDInsight cluster when it processes a data slice. --```json -{ - "name": "HDInsightOnDemandLinkedService", - "properties": { - "type": "HDInsightOnDemand", - "typeProperties": { - "version": "3.6", - "osType": "Linux", - "clusterSize": 1, - "timeToLive": "00:05:00", - "linkedServiceName": "AzureStorageLinkedService" - } - } -} -``` --> [!IMPORTANT] -> The HDInsight cluster creates a *default container* in the Azure Blob storage that you specify in the JSON **linkedServiceName** property. By design, HDInsight doesn't delete this container when the cluster is deleted. In an on-demand HDInsight linked service, an HDInsight cluster is created every time a slice needs to be processed, unless there's an existing live cluster (**timeToLive**). The cluster is deleted when processing is finished. -> -> As more slices are processed, you see many containers in your Blob storage. If you don't need the containers for troubleshooting jobs, you might want to delete the containers to reduce the storage cost. The names of these containers follow a pattern: `adf<your Data Factory name>-<linked service name>-<date and time>`. You can use a tool like [Microsoft Azure Storage Explorer](https://storageexplorer.com/) to delete containers in Blob storage. -> -> --### Properties -| Property | Description | Required | -| - | - | -- | -| type | Set the type property to **HDInsightOnDemand**. | Yes | -| clusterSize | The number of worker and data nodes in the cluster. The HDInsight cluster is created with 2 head nodes, in addition to the number of worker nodes that you specify for this property. The nodes are of size Standard_D3, which has 4 cores. A 4-worker node cluster takes 24 cores (4\*4 = 16 cores for worker nodes, plus 2\*4 = 8 cores for head nodes). For details about the Standard_D3 tier, see [Create Linux-based Hadoop clusters in HDInsight](../../hdinsight/hdinsight-hadoop-provision-linux-clusters.md). | Yes | -| timeToLive | The allowed idle time for the on-demand HDInsight cluster. Specifies how long the on-demand HDInsight cluster stays alive when an activity run is finished, if there are no other active jobs in the cluster.<br /><br />For example, if an activity run takes 6 minutes and **timeToLive** is set to 5 minutes, the cluster stays alive for 5 minutes after the 6 minutes of processing the activity run. If another activity run is executed in the 6-minute window, it's processed by the same cluster.<br /><br />Creating an on-demand HDInsight cluster is an expensive operation (it might take a while). Use this setting as needed to improve performance of a data factory by reusing an on-demand HDInsight cluster.<br /><br />If you set the **timeToLive** value to **0**, the cluster is deleted as soon as the activity run finishes. However, if you set a high value, the cluster might stay idle, unnecessarily resulting in high costs. It's important to set the appropriate value based on your needs.<br /><br />If the **timeToLive** value is appropriately set, multiple pipelines can share the instance of the on-demand HDInsight cluster. | Yes | -| version | The version of the HDInsight cluster. For allowed HDInsight versions, see [Supported HDInsight versions](../../hdinsight/hdinsight-component-versioning.md#supported-hdinsight-versions). If this value isn't specified, the [latest HDI default version](../../hdinsight/hdinsight-component-versioning.md) is used. | No | -| linkedServiceName | The Azure Storage linked service to be used by the on-demand cluster for storing and processing data. The HDInsight cluster is created in the same region as this storage account.<p>Currently, you can't create an on-demand HDInsight cluster that uses Azure Data Lake Store as the storage. If you want to store the result data from HDInsight processing in Data Lake Store, use Copy Activity to copy the data from Blob storage to Data Lake Store. </p> | Yes | -| additionalLinkedServiceNames | Specifies additional storage accounts for the HDInsight linked service. Data Factory registers the storage accounts on your behalf. These storage accounts must be in the same region as the HDInsight cluster. The HDInsight cluster is created in the same region as the storage account that's specified by the **linkedServiceName** property. | No | -| osType | The type of operating system. Allowed values are **Linux** and **Windows**. If this value isn't specified, **Linux** is used. <br /><br />We strongly recommend using Linux-based HDInsight clusters. The retirement date for HDInsight on Windows is July 31, 2018. | No | -| hcatalogLinkedServiceName | The name of the Azure SQL linked service that points to the HCatalog database. The on-demand HDInsight cluster is created by using the SQL database as the metastore. | No | --#### Example: LinkedServiceNames JSON --```json -"additionalLinkedServiceNames": [ - "otherLinkedServiceName1", - "otherLinkedServiceName2" - ] -``` --### Advanced properties -For granular configuration of the on-demand HDInsight cluster, you can specify the following properties: --| Property | Description | Required | -| : | : | :- | -| coreConfiguration | Specifies the core configuration parameters (core-site.xml) for the HDInsight cluster to be created. | No | -| hBaseConfiguration | Specifies the HBase configuration parameters (hbase-site.xml) for the HDInsight cluster. | No | -| hdfsConfiguration | Specifies the HDFS configuration parameters (hdfs-site.xml) for the HDInsight cluster. | No | -| hiveConfiguration | Specifies the Hive configuration parameters (hive-site.xml) for the HDInsight cluster. | No | -| mapReduceConfiguration | Specifies the MapReduce configuration parameters (mapred-site.xml) for the HDInsight cluster. | No | -| oozieConfiguration | Specifies the Oozie configuration parameters (oozie-site.xml) for the HDInsight cluster. | No | -| stormConfiguration | Specifies the Storm configuration parameters (storm-site.xml) for the HDInsight cluster. | No | -| yarnConfiguration | Specifies the YARN configuration parameters (yarn-site.xml) for the HDInsight cluster. | No | --#### Example: On-demand HDInsight cluster configuration with advanced properties --```json -{ - "name": " HDInsightOnDemandLinkedService", - "properties": { - "type": "HDInsightOnDemand", - "typeProperties": { - "version": "3.6", - "osType": "Linux", - "clusterSize": 16, - "timeToLive": "01:30:00", - "linkedServiceName": "adfods1", - "coreConfiguration": { - "templeton.mapper.memory.mb": "5000" - }, - "hiveConfiguration": { - "templeton.mapper.memory.mb": "5000" - }, - "mapReduceConfiguration": { - "mapreduce.reduce.java.opts": "-Xmx4000m", - "mapreduce.map.java.opts": "-Xmx4000m", - "mapreduce.map.memory.mb": "5000", - "mapreduce.reduce.memory.mb": "5000", - "mapreduce.job.reduce.slowstart.completedmaps": "0.8" - }, - "yarnConfiguration": { - "yarn.app.mapreduce.am.resource.mb": "5000", - "mapreduce.map.memory.mb": "5000" - }, - "additionalLinkedServiceNames": [ - "datafeeds", - "adobedatafeed" - ] - } - } -} -``` --### Node sizes -To specify the size of head, data, and ZooKeeper nodes, use the following properties: --| Property | Description | Required | -| :- | : | :- | -| headNodeSize | Sets the size of the head node. The default value is **Standard_D3**. For details, see [Specify node sizes](#specify-node-sizes). | No | -| dataNodeSize | Sets the size of the data node. The default value is **Standard_D3**. | No | -| zookeeperNodeSize | Sets the size of the ZooKeeper node. The default value is **Standard_D3**. | No | --#### Specify node sizes -For string values that you must specify for the properties described in the preceding section, see [Virtual machine sizes](../../virtual-machines/sizes.md). The values must conform to the cmdlets and APIs referenced in [Virtual machine sizes](../../virtual-machines/sizes.md). The Large (default) data node size has 7 GB of memory. This might not be sufficient for your scenario. --If you want to create D4-size head nodes and worker nodes, specify **Standard_D4** as the value for the **headNodeSize** and **dataNodeSize** properties: --```json -"headNodeSize": "Standard_D4", -"dataNodeSize": "Standard_D4", -``` --If you set an incorrect value for these properties, you might see the following message: -- Failed to create cluster. Exception: Unable to complete the cluster create operation. Operation failed with code '400'. Cluster left behind state: 'Error'. Message: 'PreClusterCreationValidationFailure'. - -If you see this message, ensure that you are using the cmdlet and API names from the table in [Virtual machine sizes](../../virtual-machines/sizes.md). --> [!NOTE] -> Currently, Data Factory doesn't support HDInsight clusters that use Data Lake Store as the primary store. Use Azure Storage as the primary store for HDInsight clusters. -> -> ---## Bring-your-own compute environment -You can register an existing compute environment as a linked service in Data Factory. You manage the compute environment. The Data Factory service uses the compute environment to execute activities. --This type of configuration is supported for the following compute environments: --* Azure HDInsight -* Azure Batch -* ML Studio (classic) -* Azure Data Lake Analytics -* Azure SQL Database, Azure Synapse Analytics, SQL Server --## Azure HDInsight linked service -You can create an HDInsight linked service to register your own HDInsight cluster with Data Factory. --### Example --```json -{ - "name": "HDInsightLinkedService", - "properties": { - "type": "HDInsight", - "typeProperties": { - "clusterUri": " https://<hdinsightclustername>.azurehdinsight.net/", - "userName": "admin", - "password": "<password>", - "linkedServiceName": "MyHDInsightStoragelinkedService" - } - } -} -``` --### Properties -| Property | Description | Required | -| -- | - | -- | -| type | Set the type property to **HDInsight**. | Yes | -| clusterUri | The URI of the HDInsight cluster. | Yes | -| username | The name of the user account to use to connect to an existing HDInsight cluster. | Yes | -| password | The password for the user account. | Yes | -| linkedServiceName | The name of the storage linked service that refers to the Blob storage used by the HDInsight cluster. <p>Currently, you can't specify a Data Lake Store linked service for this property. If the HDInsight cluster has access to Data Lake Store, you might access data in Data Lake Store from Hive or Pig scripts. </p> | Yes | --## Azure Batch linked service -You can create a Batch linked service to register a Batch pool of virtual machines (VMs) to a data factory. You can run Microsoft .NET custom activities by using either Batch or HDInsight. --If you are new to using the Batch service: --* Learn about [Azure Batch basics](/azure/azure-sql/database/sql-database-paas-overview). -* Learn about the [New-AzureBatchAccount](/previous-versions/azure/mt125880(v=azure.100)) cmdlet. Use this cmdlet to create a Batch account. Or, you can create the Batch account by using the [Azure portal](../../batch/batch-account-create-portal.md). For detailed information about using the cmdlet, see [Using PowerShell to manage a Batch account](/archive/blogs/windowshpc/using-azure-powershell-to-manage-azure-batch-account). -* Learn about the [New-AzureBatchPool](/previous-versions/azure/mt125936(v=azure.100)) cmdlet. Use this cmdlet to create a Batch pool. --### Example --```json -{ - "name": "AzureBatchLinkedService", - "properties": { - "type": "AzureBatch", - "typeProperties": { - "accountName": "<Azure Batch account name>", - "accessKey": "<Azure Batch account key>", - "poolName": "<Azure Batch pool name>", - "linkedServiceName": "<Specify associated storage linked service reference here>" - } - } -} -``` --For the **accountName** property, append **.\<region name\>** to the name of your batch account. For example: --```json -"accountName": "mybatchaccount.eastus" -``` --Another option is to provide the **batchUri** endpoint. For example: --```json -"accountName": "adfteam", -"batchUri": "https://eastus.batch.azure.com", -``` --### Properties -| Property | Description | Required | -| -- | - | -- | -| type | Set the type property to **AzureBatch**. | Yes | -| accountName | The name of the Batch account. | Yes | -| accessKey | The access key for the Batch account. | Yes | -| poolName | The name of the pool of VMs. | Yes | -| linkedServiceName | The name of the storage linked service that's associated with this Batch linked service. This linked service is used for staging files that are required to run the activity, and to store activity execution logs. | Yes | --## ML Studio (classic) linked service ---You can create an ML Studio (classic) linked service to register a Studio (classic) batch scoring endpoint to a data factory. --### Example --```json -{ - "name": "AzureMLLinkedService", - "properties": { - "type": "AzureML", - "typeProperties": { - "mlEndpoint": "https://[batch scoring endpoint]/jobs", - "apiKey": "<apikey>" - } - } -} -``` --### Properties -| Property | Description | Required | -| - | - | -- | -| Type | Set the type property to **AzureML**. | Yes | -| mlEndpoint | The batch scoring URL. | Yes | -| apiKey | The published workspace modelΓÇÖs API. | Yes | --## Azure Data Lake Analytics linked service -You can create a Data Lake Analytics linked service to link a Data Lake Analytics compute service to an Azure data factory. The Data Lake Analytics U-SQL activity in the pipeline refers to this linked service. --The following table describes the generic properties that are used in the JSON definition: --| Property | Description | Required | -| | - | - | -| type | Set the type property to **AzureDataLakeAnalytics**. | Yes | -| accountName | The Data Lake Analytics account name. | Yes | -| dataLakeAnalyticsUri | The Data Lake Analytics URI. | No | -| subscriptionId | The Azure subscription ID. | No<br /><br />(If not specified, the data factory subscription is used.) | -| resourceGroupName | The Azure resource group name. | No<br /><br /> (If not specified, the data factory resource group is used.) | --### Authentication options -For your Data Lake Analytics linked service, you can choose between authentication by using a service principal or a user credential. --#### Service principal authentication (recommended) -To use service principal authentication, register an application entity in Microsoft Entra ID. Then, grant Microsoft Entra ID access to Data Lake Store. For detailed steps, see [Service-to-service authentication](../../data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory.md). Make note of the following values, which you use to define the linked service: -* Application ID -* Application key -* Tenant ID --Use service principal authentication by specifying the following properties: --| Property | Description | Required | -| :- | : | :- | -| servicePrincipalId | The application's client ID. | Yes | -| servicePrincipalKey | The application's key. | Yes | -| tenant | The tenant information (domain name or tenant ID) where your application is located. To get this information, hover your mouse in the upper-right corner of the Azure portal. | Yes | --**Example: Service principal authentication** -```json -{ - "name": "AzureDataLakeAnalyticsLinkedService", - "properties": { - "type": "AzureDataLakeAnalytics", - "typeProperties": { - "accountName": "adftestaccount", - "dataLakeAnalyticsUri": "datalakeanalyticscompute.net", - "servicePrincipalId": "<service principal id>", - "servicePrincipalKey": "<service principal key>", - "tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>", - "subscriptionId": "<optional, subscription id of ADLA>", - "resourceGroupName": "<optional, resource group name of ADLA>" - } - } -} -``` --#### User credential authentication -For user credential authentication for Data Lake Analytics, specify the following properties: --| Property | Description | Required | -| :- | : | :- | -| authorization | In Data Factory Editor, select the **Authorize** button. Enter the credential that assigns the autogenerated authorization URL to this property. | Yes | -| sessionId | The OAuth session ID from the OAuth authorization session. Each session ID is unique and can be used only once. This setting is automatically generated when you use Data Factory Editor. | Yes | --**Example: User credential authentication** -```json -{ - "name": "AzureDataLakeAnalyticsLinkedService", - "properties": { - "type": "AzureDataLakeAnalytics", - "typeProperties": { - "accountName": "adftestaccount", - "dataLakeAnalyticsUri": "datalakeanalyticscompute.net", - "authorization": "<authcode>", - "sessionId": "<session ID>", - "subscriptionId": "<optional, subscription id of ADLA>", - "resourceGroupName": "<optional, resource group name of ADLA>" - } - } -} -``` --#### Token expiration -The authorization code that you generated by selecting the **Authorize** button expires after a set interval. --You might see the following error message when the authentication token expires: -- Credential operation error: invalid_grant - AADSTS70002: Error validating credentials. AADSTS70008: The provided access grant is expired or revoked. Trace ID: d18629e8-af88-43c5-88e3-d8419eb1fca1 Correlation ID: fac30a0c-6be6-4e02-8d69-a776d2ffefd7 Timestamp: 2015-12-15 21:09:31Z --The following table shows expirations by user account type: --| User type | Expires after | -| : | : | -| User accounts that are *not* managed by Microsoft Entra ID (Hotmail, Live, and so on) | 12 hours. | -| User accounts that *are* managed by Microsoft Entra ID | 14 days after the last slice run. <br /><br />90 days, if a slice that's based on an OAuth-based linked service runs at least once every 14 days. | --To avoid or resolve this error, reauthorize by selecting the **Authorize** button when the token expires. Then, redeploy the linked service. You can also generate values for the **sessionId** and **authorization** properties programmatically by using the following code: --```csharp -if (linkedService.Properties.TypeProperties is AzureDataLakeStoreLinkedService || - linkedService.Properties.TypeProperties is AzureDataLakeAnalyticsLinkedService) -{ - AuthorizationSessionGetResponse authorizationSession = this.Client.OAuth.Get(this.ResourceGroupName, this.DataFactoryName, linkedService.Properties.Type); -- WindowsFormsWebAuthenticationDialog authenticationDialog = new WindowsFormsWebAuthenticationDialog(null); - string authorization = authenticationDialog.AuthenticateAAD(authorizationSession.AuthorizationSession.Endpoint, new Uri("urn:ietf:wg:oauth:2.0:oob")); -- AzureDataLakeStoreLinkedService azureDataLakeStoreProperties = linkedService.Properties.TypeProperties as AzureDataLakeStoreLinkedService; - if (azureDataLakeStoreProperties != null) - { - azureDataLakeStoreProperties.SessionId = authorizationSession.AuthorizationSession.SessionId; - azureDataLakeStoreProperties.Authorization = authorization; - } -- AzureDataLakeAnalyticsLinkedService azureDataLakeAnalyticsProperties = linkedService.Properties.TypeProperties as AzureDataLakeAnalyticsLinkedService; - if (azureDataLakeAnalyticsProperties != null) - { - azureDataLakeAnalyticsProperties.SessionId = authorizationSession.AuthorizationSession.SessionId; - azureDataLakeAnalyticsProperties.Authorization = authorization; - } -} -``` --For details about the Data Factory classes that are used in this code example, see: -* [AzureDataLakeStoreLinkedService class](/dotnet/api/microsoft.azure.management.datafactories.models.azuredatalakestorelinkedservice) -* [AzureDataLakeAnalyticsLinkedService class](/dotnet/api/microsoft.azure.management.datafactories.models.azuredatalakeanalyticslinkedservice) -* [AuthorizationSessionGetResponse class](/dotnet/api/microsoft.azure.management.datafactories.models.authorizationsessiongetresponse) --Add a reference to Microsoft.IdentityModel.Clients.ActiveDirectory.WindowsForms.dll for the **WindowsFormsWebAuthenticationDialog** class. --## Azure SQL linked service -You can create a SQL linked service and use it with the [Stored Procedure Activity](data-factory-stored-proc-activity.md) to invoke a stored procedure from a Data Factory pipeline. For more information, see [Azure SQL connector](data-factory-azure-sql-connector.md#linked-service-properties). --## Azure Synapse Analytics linked service -You can create an Azure Synapse Analytics linked service and use it with the [Stored Procedure Activity](data-factory-stored-proc-activity.md) to invoke a stored procedure from a Data Factory pipeline. For more information, see [Azure Synapse Analytics connector](data-factory-azure-sql-data-warehouse-connector.md#linked-service-properties). --## SQL Server linked service -You can create a SQL Server linked service and use it with the [Stored Procedure Activity](data-factory-stored-proc-activity.md) to invoke a stored procedure from a Data Factory pipeline. For more information, see [SQL Server connector](data-factory-sqlserver-connector.md#linked-service-properties). |
data-factory | Data Factory Copy Activity Fault Tolerance | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-copy-activity-fault-tolerance.md | - Title: Add fault tolerance in Azure Data Factory Copy Activity by skipping incompatible rows -description: Learn how to add fault tolerance in Azure Data Factory Copy Activity by skipping incompatible rows during copy ---- Previously updated : 04/12/2023----# Add fault tolerance in Copy Activity by skipping incompatible rows --> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-copy-activity-fault-tolerance.md) -> * [Version 2 (current version)](../copy-activity-fault-tolerance.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [fault tolerance in copy activity of Data Factory](../copy-activity-fault-tolerance.md). --Azure Data Factory [Copy Activity](data-factory-data-movement-activities.md) offers you two ways to handle incompatible rows when copying data between source and sink data stores: --- You can abort and fail the copy activity when incompatible data is encountered (default behavior).-- You can continue to copy all of the data by adding fault tolerance and skipping incompatible data rows. In addition, you can log the incompatible rows in Azure Blob storage. You can then examine the log to learn the cause for the failure, fix the data on the data source, and retry the copy activity.--## Supported scenarios -Copy Activity supports three scenarios for detecting, skipping, and logging incompatible data: --- **Incompatibility between the source data type and the sink native type**-- For example: Copy data from a CSV file in Blob storage to a SQL database with a schema definition that contains three **INT** type columns. The CSV file rows that contain numeric data, such as `123,456,789` are copied successfully to the sink store. However, the rows that contain non-numeric values, such as `123,456,abc` are detected as incompatible and are skipped. --- **Mismatch in the number of columns between the source and the sink**-- For example: Copy data from a CSV file in Blob storage to a SQL database with a schema definition that contains six columns. The CSV file rows that contain six columns are copied successfully to the sink store. The CSV file rows that contain more or fewer than six columns are detected as incompatible and are skipped. --- **Primary key violation when writing to SQL Server/Azure SQL Database/Azure Cosmos DB**-- For example: Copy data from a SQL server to a SQL database. A primary key is defined in the sink SQL database, but no such primary key is defined in the source SQL server. The duplicated rows that exist in the source cannot be copied to the sink. Copy Activity copies only the first row of the source data into the sink. The subsequent source rows that contain the duplicated primary key value are detected as incompatible and are skipped. -->[!NOTE] ->This feature doesn't apply when copy activity is configured to invoke external data loading mechanism including [Azure Synapse Analytics PolyBase](data-factory-azure-sql-data-warehouse-connector.md#use-polybase-to-load-data-into-azure-synapse-analytics) or [Amazon Redshift Unload](data-factory-amazon-redshift-connector.md#use-unload-to-copy-data-from-amazon-redshift). For loading data into Azure Synapse Analytics using PolyBase, use PolyBase's native fault tolerance support by specifying "[polyBaseSettings](data-factory-azure-sql-data-warehouse-connector.md#sqldwsink)" in copy activity. --## Configuration -The following example provides a JSON definition to configure skipping the incompatible rows in Copy Activity: --```json -"typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "SqlSink", - }, - "enableSkipIncompatibleRow": true, - "redirectIncompatibleRowSettings": { - "linkedServiceName": "BlobStorage", - "path": "redirectcontainer/erroroutput" - } -} -``` --| Property | Description | Allowed values | Required | -| | | | | -| **enableSkipIncompatibleRow** | Enable skipping incompatible rows during copy or not. | True<br/>False (default) | No | -| **redirectIncompatibleRowSettings** | A group of properties that can be specified when you want to log the incompatible rows. | | No | -| **linkedServiceName** | The linked service of Azure Storage to store the log that contains the skipped rows. | The name of an [AzureStorage](data-factory-azure-blob-connector.md#azure-storage-linked-service) or [AzureStorageSas](data-factory-azure-blob-connector.md#azure-storage-sas-linked-service) linked service, which refers to the storage instance that you want to use to store the log file. | No | -| **path** | The path of the log file that contains the skipped rows. | Specify the Blob storage path that you want to use to log the incompatible data. If you do not provide a path, the service creates a container for you. | No | --## Monitoring -After the copy activity run completes, you can see the number of skipped rows in the monitoring section: ---If you configure to log the incompatible rows, you can find the log file at this path: `https://[your-blob-account].blob.core.windows.net/[path-if-configured]/[copy-activity-run-id]/[auto-generated-GUID].csv` In the log file, you can see the rows that were skipped and the root cause of the incompatibility. --Both the original data and the corresponding error are logged in the file. An example of the log file content is as follows: -``` -data1, data2, data3, UserErrorInvalidDataValue,Column 'Prop_2' contains an invalid value 'data3'. Cannot convert 'data3' to type 'DateTime'., -data4, data5, data6, Violation of PRIMARY KEY constraint 'PK_tblintstrdatetimewithpk'. Cannot insert duplicate key in object 'dbo.tblintstrdatetimewithpk'. The duplicate key value is (data4). -``` --## Next steps -To learn more about Azure Data Factory Copy Activity, see [Move data by using Copy Activity](data-factory-data-movement-activities.md). |
data-factory | Data Factory Copy Activity Performance | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-copy-activity-performance.md | - Title: Copy Activity performance and tuning guide -description: Learn about key factors that affect the performance of data movement in Azure Data Factory when you use Copy Activity. ---- Previously updated : 04/12/2023----# Copy Activity performance and tuning guide --> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-copy-activity-performance.md) -> * [Version 2 (current version)](../copy-activity-performance.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Copy activity performance and tuning guide for Data Factory](../copy-activity-performance.md). --Azure Data Factory Copy Activity delivers a first-class secure, reliable, and high-performance data loading solution. It enables you to copy tens of terabytes of data every day across a rich variety of cloud and on-premises data stores. Blazing-fast data loading performance is key to ensure you can focus on the core "big data" problem: building advanced analytics solutions and getting deep insights from all that data. --Azure provides a set of enterprise-grade data storage and data warehouse solutions, and Copy Activity offers a highly optimized data loading experience that is easy to configure and set up. With just a single copy activity, you can achieve: --* Loading data into **Azure Synapse Analytics** at **1.2 GBps**. For a walkthrough with a use case, see [Load 1 TB into Azure Synapse Analytics under 15 minutes with Azure Data Factory](data-factory-load-sql-data-warehouse.md). -* Loading data into **Azure Blob storage** at **1.0 GBps** -* Loading data into **Azure Data Lake Store** at **1.0 GBps** --This article describes: --* [Performance reference numbers](#performance-reference) for supported source and sink data stores to help you plan your project; -* Features that can boost the copy throughput in different scenarios, including [cloud data movement units](#cloud-data-movement-units), [parallel copy](#parallel-copy), and [staged Copy](#staged-copy); -* [Performance tuning guidance](#performance-tuning-steps) on how to tune the performance and the key factors that can impact copy performance. --> [!NOTE] -> If you are not familiar with Copy Activity in general, see [Move data by using Copy Activity](data-factory-data-movement-activities.md) before reading this article. -> --## Performance reference --As a reference, below table shows the copy throughput number in MBps for the given source and sink pairs based on in-house testing. For comparison, it also demonstrates how different settings of [cloud data movement units](#cloud-data-movement-units) or [Data Management Gateway scalability](data-factory-data-management-gateway-high-availability-scalability.md) (multiple gateway nodes) can help on copy performance. --->[!IMPORTANT] ->In Azure Data Factory version 1, the minimal cloud data movement units for cloud-to-cloud copy is two. If not specified, see default data movement units being used in [cloud data movement units](#cloud-data-movement-units). --**Points to note:** -* Throughput is calculated by using the following formula: [size of data read from source]/[Copy Activity run duration]. -* The performance reference numbers in the table were measured using [TPC-H](http://www.tpc.org/tpch/) data set in a single copy activity run. -* In Azure data stores, the source and sink are in the same Azure region. -* For hybrid copy between on-premises and cloud data stores, each gateway node was running on a machine that was separate from the on-premises data store with below specification. When a single activity was running on gateway, the copy operation consumed only a small portion of the test machine's CPU, memory, or network bandwidth. Learn more from [consideration for Data Management Gateway](#considerations-for-data-management-gateway). - <table> - <tr> - <td>CPU</td> - <td>32 cores 2.20 GHz Intel Xeon E5-2660 v2</td> - </tr> - <tr> - <td>Memory</td> - <td>128 GB</td> - </tr> - <tr> - <td>Network</td> - <td>Internet interface: 10 Gbps; intranet interface: 40 Gbps</td> - </tr> - </table> ---> [!TIP] -> You can achieve higher throughput by leveraging more data movement units (DMUs) than the default maximum DMUs, which is 32 for a cloud-to-cloud copy activity run. For example, with 100 DMUs, you can achieve copying data from Azure Blob into Azure Data Lake Store at **1.0GBps**. See the [Cloud data movement units](#cloud-data-movement-units) section for details about this feature and the supported scenario. Contact [Azure support](https://azure.microsoft.com/support/) to request more DMUs. --## Parallel copy -You can read data from the source or write data to the destination **in parallel within a Copy Activity run**. This feature enhances the throughput of a copy operation and reduces the time it takes to move data. --This setting is different from the **concurrency** property in the activity definition. The **concurrency** property determines the number of **concurrent Copy Activity runs** to process data from different activity windows (1 AM to 2 AM, 2 AM to 3 AM, 3 AM to 4 AM, and so on). This capability is helpful when you perform a historical load. The parallel copy capability applies to a **single activity run**. --Let's look at a sample scenario. In the following example, multiple slices from the past need to be processed. Data Factory runs an instance of Copy Activity (an activity run) for each slice: --* The data slice from the first activity window (1 AM to 2 AM) ==> Activity run 1 -* The data slice from the second activity window (2 AM to 3 AM) ==> Activity run 2 -* The data slice from the second activity window (3 AM to 4 AM) ==> Activity run 3 --And so on. --In this example, when the **concurrency** value is set to 2, **Activity run 1** and **Activity run 2** copy data from two activity windows **concurrently** to improve data movement performance. However, if multiple files are associated with Activity run 1, the data movement service copies files from the source to the destination one file at a time. --### Cloud data movement units -A **cloud data movement unit (DMU)** is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Data Factory. DMU is applicable for cloud-to-cloud copy operations, but not in a hybrid copy. --**The minimal cloud data movement units to empower Copy Activity run is two.** If not specified, the following table lists the default DMUs used in different copy scenarios: --| Copy scenario | Default DMUs determined by service | -|: |: | -| Copy data between file-based stores | Between 4 and 16 depending on the number and size of the files. | -| All other copy scenarios | 4 | --To override this default, specify a value for the **cloudDataMovementUnits** property as follows. The **allowed values** for the **cloudDataMovementUnits** property are 2, 4, 8, 16, 32. The **actual number of cloud DMUs** that the copy operation uses at run time is equal to or less than the configured value, depending on your data pattern. For information about the level of performance gain you might get when you configure more units for a specific copy source and sink, see the [performance reference](#performance-reference). --```json -"activities":[ - { - "name": "Sample copy activity", - "description": "", - "type": "Copy", - "inputs": [{ "name": "InputDataset" }], - "outputs": [{ "name": "OutputDataset" }], - "typeProperties": { - "source": { - "type": "BlobSource", - }, - "sink": { - "type": "AzureDataLakeStoreSink" - }, - "cloudDataMovementUnits": 32 - } - } -] -``` --> [!NOTE] -> If you need more cloud DMUs for a higher throughput, contact [Azure support](https://azure.microsoft.com/support/). Setting of 8 and above currently works only when you **copy multiple files from Blob storage/Data Lake Store/Amazon S3/cloud FTP/cloud SFTP to Blob storage/Data Lake Store/Azure SQL Database**. -> --### parallelCopies -You can use the **parallelCopies** property to indicate the parallelism that you want Copy Activity to use. You can think of this property as the maximum number of threads within Copy Activity that can read from your source or write to your sink data stores in parallel. --For each Copy Activity run, Data Factory determines the number of parallel copies to use to copy data from the source data store and to the destination data store. The default number of parallel copies that it uses depends on the type of source and sink that you are using. --| Source and sink | Default parallel copy count determined by service | -| | | -| Copy data between file-based stores (Blob storage; Data Lake Store; Amazon S3; an on-premises file system; an on-premises HDFS) |Between 1 and 32. Depends on the size of the files and the number of cloud data movement units (DMUs) used to copy data between two cloud data stores, or the physical configuration of the Gateway machine used for a hybrid copy (to copy data to or from an on-premises data store). | -| Copy data from **any source data store to Azure Table storage** |4 | -| All other source and sink pairs |1 | --Usually, the default behavior should give you the best throughput. However, to control the load on machines that host your data stores, or to tune copy performance, you may choose to override the default value and specify a value for the **parallelCopies** property. The value must be between 1 and 32 (both inclusive). At run time, for the best performance, Copy Activity uses a value that is less than or equal to the value that you set. --```json -"activities":[ - { - "name": "Sample copy activity", - "description": "", - "type": "Copy", - "inputs": [{ "name": "InputDataset" }], - "outputs": [{ "name": "OutputDataset" }], - "typeProperties": { - "source": { - "type": "BlobSource", - }, - "sink": { - "type": "AzureDataLakeStoreSink" - }, - "parallelCopies": 8 - } - } -] -``` -Points to note: --* When you copy data between file-based stores, the **parallelCopies** determine the parallelism at the file level. The chunking within a single file would happen underneath automatically and transparently, and it's designed to use the best suitable chunk size for a given source data store type to load data in parallel and orthogonal to parallelCopies. The actual number of parallel copies the data movement service uses for the copy operation at run time is no more than the number of files you have. If the copy behavior is **mergeFile**, Copy Activity cannot take advantage of file-level parallelism. -* When you specify a value for the **parallelCopies** property, consider the load increase on your source and sink data stores, and to gateway if it is a hybrid copy. This happens especially when you have multiple activities or concurrent runs of the same activities that run against the same data store. If you notice that either the data store or Gateway is overwhelmed with the load, decrease the **parallelCopies** value to relieve the load. -* When you copy data from stores that are not file-based to stores that are file-based, the data movement service ignores the **parallelCopies** property. Even if parallelism is specified, it's not applied in this case. --> [!NOTE] -> You must use Data Management Gateway version 1.11 or later to use the **parallelCopies** feature when you do a hybrid copy. -> -> --To better use these two properties, and to enhance your data movement throughput, see the sample use cases. You don't need to configure **parallelCopies** to take advantage of the default behavior. If you do configure and **parallelCopies** is too small, multiple cloud DMUs might not be fully utilized. --### Billing impact -It's **important** to remember that you are charged based on the total time of the copy operation. If a copy job used to take one hour with one cloud unit and now it takes 15 minutes with four cloud units, the overall bill remains almost the same. For example, you use four cloud units. The first cloud unit spends 10 minutes, the second one, 10 minutes, the third one, 5 minutes, and the fourth one, 5 minutes, all in one Copy Activity run. You are charged for the total copy (data movement) time, which is 10 + 10 + 5 + 5 = 30 minutes. Using **parallelCopies** does not affect billing. --## Staged copy -When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an interim staging store. Staging is especially useful in the following cases: --1. **You want to ingest data from various data stores into Azure Synapse Analytics via PolyBase**. Azure Synapse Analytics uses PolyBase as a high-throughput mechanism to load a large amount of data into Azure Synapse Analytics. However, the source data must be in Blob storage, and it must meet additional criteria. When you load data from a data store other than Blob storage, you can activate data copying via interim staging Blob storage. In that case, Data Factory performs the required data transformations to ensure that it meets the requirements of PolyBase. Then it uses PolyBase to load data into Azure Synapse Analytics. For more details, see [Use PolyBase to load data into Azure Synapse Analytics](data-factory-azure-sql-data-warehouse-connector.md#use-polybase-to-load-data-into-azure-synapse-analytics). For a walkthrough with a use case, see [Load 1 TB into Azure Synapse Analytics under 15 minutes with Azure Data Factory](data-factory-load-sql-data-warehouse.md). -2. **Sometimes it takes a while to perform a hybrid data movement (that is, to copy between an on-premises data store and a cloud data store) over a slow network connection**. To improve performance, you can compress the data on-premises so that it takes less time to move data to the staging data store in the cloud. Then you can decompress the data in the staging store before you load it into the destination data store. -3. **You don't want to open ports other than port 80 and port 443 in your firewall, because of corporate IT policies**. For example, when you copy data from an on-premises data store to an Azure SQL Database sink or an Azure Synapse Analytics sink, you need to activate outbound TCP communication on port 1433 for both the Windows firewall and your corporate firewall. In this scenario, take advantage of the gateway to first copy data to a Blob storage staging instance over HTTP or HTTPS on port 443. Then, load the data into SQL Database or Azure Synapse Analytics from Blob storage staging. In this flow, you don't need to enable port 1433. --### How staged copy works -When you activate the staging feature, first the data is copied from the source data store to the staging data store (bring your own). Next, the data is copied from the staging data store to the sink data store. Data Factory automatically manages the two-stage flow for you. Data Factory also cleans up temporary data from the staging storage after the data movement is complete. --In the cloud copy scenario (both source and sink data stores are in the cloud), gateway is not used. The Data Factory service performs the copy operations. ---In the hybrid copy scenario (source is on-premises and sink is in the cloud), the gateway moves data from the source data store to a staging data store. Data Factory service moves data from the staging data store to the sink data store. Copying data from a cloud data store to an on-premises data store via staging also is supported with the reversed flow. ---When you activate data movement by using a staging store, you can specify whether you want the data to be compressed before moving data from the source data store to an interim or staging data store, and then decompressed before moving data from an interim or staging data store to the sink data store. --Currently, you can't copy data between two on-premises data stores by using a staging store. We expect this option to be available soon. --### Configuration -Configure the **enableStaging** setting in Copy Activity to specify whether you want the data to be staged in Blob storage before you load it into a destination data store. When you set **enableStaging** to TRUE, specify the additional properties listed in the next table. If you don't have one, you also need to create an Azure Storage or Storage shared access signature-linked service for staging. --| Property | Description | Default value | Required | -| | | | | -| **enableStaging** |Specify whether you want to copy data via an interim staging store. |False |No | -| **linkedServiceName** |Specify the name of an [AzureStorage](data-factory-azure-blob-connector.md#azure-storage-linked-service) or [AzureStorageSas](data-factory-azure-blob-connector.md#azure-storage-sas-linked-service) linked service, which refers to the instance of Storage that you use as an interim staging store. <br/><br/> You cannot use Storage with a shared access signature to load data into Azure Synapse Analytics via PolyBase. You can use it in all other scenarios. |N/A |Yes, when **enableStaging** is set to TRUE | -| **path** |Specify the Blob storage path that you want to contain the staged data. If you do not provide a path, the service creates a container to store temporary data. <br/><br/> Specify a path only if you use Storage with a shared access signature, or you require temporary data to be in a specific location. |N/A |No | -| **enableCompression** |Specifies whether data should be compressed before it is copied to the destination. This setting reduces the volume of data being transferred. |False |No | --Here's a sample definition of Copy Activity with the properties that are described in the preceding table: --```json -"activities":[ -{ - "name": "Sample copy activity", - "type": "Copy", - "inputs": [{ "name": "OnpremisesSQLServerInput" }], - "outputs": [{ "name": "AzureSQLDBOutput" }], - "typeProperties": { - "source": { - "type": "SqlSource", - }, - "sink": { - "type": "SqlSink" - }, - "enableStaging": true, - "stagingSettings": { - "linkedServiceName": "MyStagingBlob", - "path": "stagingcontainer/path", - "enableCompression": true - } - } -} -] -``` --### Billing impact -You are charged based on two steps: copy duration and copy type. --* When you use staging during a cloud copy (copying data from a cloud data store to another cloud data store), you are charged the [sum of copy duration for step 1 and step 2] x [cloud copy unit price]. -* When you use staging during a hybrid copy (copying data from an on-premises data store to a cloud data store), you are charged for [hybrid copy duration] x [hybrid copy unit price] + [cloud copy duration] x [cloud copy unit price]. --## Performance tuning steps -We suggest that you take these steps to tune the performance of your Data Factory service with Copy Activity: --1. **Establish a baseline**. During the development phase, test your pipeline by using Copy Activity against a representative data sample. You can use the Data Factory [slicing model](data-factory-scheduling-and-execution.md) to limit the amount of data you work with. -- Collect execution time and performance characteristics by using the **Monitoring and Management App**. Choose **Monitor & Manage** on your Data Factory home page. In the tree view, choose the **output dataset**. In the **Activity Windows** list, choose the Copy Activity run. **Activity Windows** lists the Copy Activity duration and the size of the data that's copied. The throughput is listed in **Activity Window Explorer**. To learn more about the app, see [Monitor and manage Azure Data Factory pipelines by using the Monitoring and Management App](data-factory-monitor-manage-app.md). -- :::image type="content" source="./media/data-factory-copy-activity-performance/mmapp-activity-run-details.png" alt-text="Activity run details"::: -- Later in the article, you can compare the performance and configuration of your scenario to Copy Activity's [performance reference](#performance-reference) from our tests. -2. **Diagnose and optimize performance**. If the performance you observe doesn't meet your expectations, you need to identify performance bottlenecks. Then, optimize performance to remove or reduce the effect of bottlenecks. A full description of performance diagnosis is beyond the scope of this article, but here are some common considerations: -- * Performance features: - * [Parallel copy](#parallel-copy) - * [Cloud data movement units](#cloud-data-movement-units) - * [Staged copy](#staged-copy) - * [Data Management Gateway scalability](data-factory-data-management-gateway-high-availability-scalability.md) - * [Data Management Gateway](#considerations-for-data-management-gateway) - * [Source](#considerations-for-the-source) - * [Sink](#considerations-for-the-sink) - * [Serialization and deserialization](#considerations-for-serialization-and-deserialization) - * [Compression](#considerations-for-compression) - * [Column mapping](#considerations-for-column-mapping) - * [Other considerations](#other-considerations) -3. **Expand the configuration to your entire data set**. When you're satisfied with the execution results and performance, you can expand the definition and pipeline active period to cover your entire data set. --## Considerations for Data Management Gateway -**Gateway setup**: We recommend that you use a dedicated machine to host Data Management Gateway. See [Considerations for using Data Management Gateway](data-factory-data-management-gateway.md#considerations-for-using-gateway). --**Gateway monitoring and scale-up/out**: A single logical gateway with one or more gateway nodes can serve multiple Copy Activity runs at the same time concurrently. You can view near-real time snapshot of resource utilization (CPU, memory, network(in/out), etc.) on a gateway machine as well as the number of concurrent jobs running versus limit in the Azure portal, see [Monitor gateway in the portal](data-factory-data-management-gateway.md#monitor-gateway-in-the-portal). If you have heavy need on hybrid data movement either with large number of concurrent copy activity runs or with large volume of data to copy, consider to [scale up or scale out gateway](data-factory-data-management-gateway-high-availability-scalability.md#scale-considerations) so as to better utilize your resource or to provision more resource to empower copy. --## Considerations for the source -### General -Be sure that the underlying data store is not overwhelmed by other workloads that are running on or against it. --For Microsoft data stores, see [monitoring and tuning topics](#performance-reference) that are specific to data stores, and help you understand data store performance characteristics, minimize response times, and maximize throughput. --If you copy data from Blob storage to Azure Synapse Analytics, consider using **PolyBase** to boost performance. See [Use PolyBase to load data into Azure Synapse Analytics](data-factory-azure-sql-data-warehouse-connector.md#use-polybase-to-load-data-into-azure-synapse-analytics) for details. For a walkthrough with a use case, see [Load 1 TB into Azure Synapse Analytics under 15 minutes with Azure Data Factory](data-factory-load-sql-data-warehouse.md). --### File-based data stores -*(Includes Blob storage, Data Lake Store, Amazon S3, on-premises file systems, and on-premises HDFS)* --* **Average file size and file count**: Copy Activity transfers data one file at a time. With the same amount of data to be moved, the overall throughput is lower if the data consists of many small files rather than a few large files due to the bootstrap phase for each file. Therefore, if possible, combine small files into larger files to gain higher throughput. -* **File format and compression**: For more ways to improve performance, see the [Considerations for serialization and deserialization](#considerations-for-serialization-and-deserialization) and [Considerations for compression](#considerations-for-compression) sections. -* For the **on-premises file system** scenario, in which **Data Management Gateway** is required, see the [Considerations for Data Management Gateway](#considerations-for-data-management-gateway) section. --### Relational data stores -*(Includes SQL Database; Azure Synapse Analytics; Amazon Redshift; SQL Server databases; and Oracle, MySQL, DB2, Teradata, Sybase, and PostgreSQL databases, etc.)* --* **Data pattern**: Your table schema affects copy throughput. A large row size gives you a better performance than small row size, to copy the same amount of data. The reason is that the database can more efficiently retrieve fewer batches of data that contain fewer rows. -* **Query or stored procedure**: Optimize the logic of the query or stored procedure you specify in the Copy Activity source to fetch data more efficiently. -* For **on-premises relational databases**, such as SQL Server and Oracle, which require the use of **Data Management Gateway**, see the Considerations for Data Management Gateway section. --## Considerations for the sink -### General -Be sure that the underlying data store is not overwhelmed by other workloads that are running on or against it. --For Microsoft data stores, refer to [monitoring and tuning topics](#performance-reference) that are specific to data stores. These topics can help you understand data store performance characteristics and how to minimize response times and maximize throughput. --If you are copying data from **Blob storage** to **Azure Synapse Analytics**, consider using **PolyBase** to boost performance. See [Use PolyBase to load data into Azure Synapse Analytics](data-factory-azure-sql-data-warehouse-connector.md#use-polybase-to-load-data-into-azure-synapse-analytics) for details. For a walkthrough with a use case, see [Load 1 TB into Azure Synapse Analytics under 15 minutes with Azure Data Factory](data-factory-load-sql-data-warehouse.md). --### File-based data stores -*(Includes Blob storage, Data Lake Store, Amazon S3, on-premises file systems, and on-premises HDFS)* --* **Copy behavior**: If you copy data from a different file-based data store, Copy Activity has three options via the **copyBehavior** property. It preserves hierarchy, flattens hierarchy, or merges files. Either preserving or flattening hierarchy has little or no performance overhead, but merging files causes performance overhead to increase. -* **File format and compression**: See the [Considerations for serialization and deserialization](#considerations-for-serialization-and-deserialization) and [Considerations for compression](#considerations-for-compression) sections for more ways to improve performance. -* **Blob storage**: Currently, Blob storage supports only block blobs for optimized data transfer and throughput. -* For **on-premises file systems** scenarios that require the use of **Data Management Gateway**, see the [Considerations for Data Management Gateway](#considerations-for-data-management-gateway) section. --### Relational data stores -*(Includes SQL Database, Azure Synapse Analytics, SQL Server databases, and Oracle databases)* --* **Copy behavior**: Depending on the properties you've set for **sqlSink**, Copy Activity writes data to the destination database in different ways. - * By default, the data movement service uses the Bulk Copy API to insert data in append mode, which provides the best performance. - * If you configure a stored procedure in the sink, the database applies the data one row at a time instead of as a bulk load. Performance drops significantly. If your data set is large, when applicable, consider switching to using the **sqlWriterCleanupScript** property. - * If you configure the **sqlWriterCleanupScript** property for each Copy Activity run, the service triggers the script, and then you use the Bulk Copy API to insert the data. For example, to overwrite the entire table with the latest data, you can specify a script to first delete all records before bulk-loading the new data from the source. -* **Data pattern and batch size**: - * Your table schema affects copy throughput. To copy the same amount of data, a large row size gives you better performance than a small row size because the database can more efficiently commit fewer batches of data. - * Copy Activity inserts data in a series of batches. You can set the number of rows in a batch by using the **writeBatchSize** property. If your data has small rows, you can set the **writeBatchSize** property with a higher value to benefit from lower batch overhead and higher throughput. If the row size of your data is large, be careful when you increase **writeBatchSize**. A high value might lead to a copy failure caused by overloading the database. -* For **on-premises relational databases** like SQL Server and Oracle, which require the use of **Data Management Gateway**, see the [Considerations for Data Management Gateway](#considerations-for-data-management-gateway) section. --### NoSQL stores -*(Includes Table storage and Azure Cosmos DB )* --* For **Table storage**: - * **Partition**: Writing data to interleaved partitions dramatically degrades performance. Sort your source data by partition key so that the data is inserted efficiently into one partition after another, or adjust the logic to write the data to a single partition. -* For **Azure Cosmos DB**: - * **Batch size**: The **writeBatchSize** property sets the number of parallel requests to the Azure Cosmos DB service to create documents. You can expect better performance when you increase **writeBatchSize** because more parallel requests are sent to Azure Cosmos DB. However, watch for throttling when you write to Azure Cosmos DB (the error message is "Request rate is large"). Various factors can cause throttling, including document size, the number of terms in the documents, and the target collection's indexing policy. To achieve higher copy throughput, consider using a better collection, for example, S3. --## Considerations for serialization and deserialization -Serialization and deserialization can occur when your input data set or output data set is a file. See [Supported file and compression formats](data-factory-supported-file-and-compression-formats.md) with details on supported file formats by Copy Activity. --**Copy behavior**: --* Copying files between file-based data stores: - * When input and output data sets both have the same or no file format settings, the data movement service executes a binary copy without any serialization or deserialization. You see a higher throughput compared to the scenario, in which the source and sink file format settings are different from each other. - * When input and output data sets both are in text format and only the encoding type is different, the data movement service only does encoding conversion. It doesn't do any serialization and deserialization, which causes some performance overhead compared to a binary copy. - * When input and output data sets both have different file formats or different configurations, like delimiters, the data movement service deserializes source data to stream, transform, and then serialize it into the output format you indicated. This operation results in a much more significant performance overhead compared to other scenarios. -* When you copy files to/from a data store that is not file-based (for example, from a file-based store to a relational store), the serialization or deserialization step is required. This step results in significant performance overhead. --**File format**: The file format you choose might affect copy performance. For example, Avro is a compact binary format that stores metadata with data. It has broad support in the Hadoop ecosystem for processing and querying. However, Avro is more expensive for serialization and deserialization, which results in lower copy throughput compared to text format. Make your choice of file format throughout the processing flow holistically. Start with what form the data is stored in, source data stores or to be extracted from external systems; the best format for storage, analytical processing, and querying; and in what format the data should be exported into data marts for reporting and visualization tools. Sometimes a file format that is suboptimal for read and write performance might be a good choice when you consider the overall analytical process. --## Considerations for compression -When your input or output data set is a file, you can set Copy Activity to perform compression or decompression as it writes data to the destination. When you choose compression, you make a tradeoff between input/output (I/O) and CPU. Compressing the data costs extra in compute resources. But in return, it reduces network I/O and storage. Depending on your data, you may see a boost in overall copy throughput. --**Codec**: Copy Activity supports gzip, bzip2, and Deflate compression types. Azure HDInsight can consume all three types for processing. Each compression codec has advantages. For example, bzip2 has the lowest copy throughput, but you get the best Hive query performance with bzip2 because you can split it for processing. Gzip is the most balanced option, and it is used the most often. Choose the codec that best suits your end-to-end scenario. --**Level**: You can choose from two options for each compression codec: fastest compressed and optimally compressed. The fastest compressed option compresses the data as quickly as possible, even if the resulting file is not optimally compressed. The optimally compressed option spends more time on compression and yields a minimal amount of data. You can test both options to see which provides better overall performance in your case. --**A consideration**: To copy a large amount of data between an on-premises store and the cloud, consider using interim blob storage with compression. Using interim storage is helpful when the bandwidth of your corporate network and your Azure services is the limiting factor, and you want the input data set and output data set both to be in uncompressed form. More specifically, you can break a single copy activity into two copy activities. The first copy activity copies from the source to an interim or staging blob in compressed form. The second copy activity copies the compressed data from staging, and then decompresses while it writes to the sink. --## Considerations for column mapping -You can set the **columnMappings** property in Copy Activity to map all or a subset of the input columns to the output columns. After the data movement service reads the data from the source, it needs to perform column mapping on the data before it writes the data to the sink. This extra processing reduces copy throughput. --If your source data store is queryable, for example, if it's a relational store like SQL Database or SQL Server, or if it's a NoSQL store like Table storage or Azure Cosmos DB, consider pushing the column filtering and reordering logic to the **query** property instead of using column mapping. This way, the projection occurs while the data movement service reads data from the source data store, where it is much more efficient. --## Other considerations -If the size of data you want to copy is large, you can adjust your business logic to further partition the data using the slicing mechanism in Data Factory. Then, schedule Copy Activity to run more frequently to reduce the data size for each Copy Activity run. --Be cautious about the number of data sets and copy activities requiring Data Factory to connector to the same data store at the same time. Many concurrent copy jobs might throttle a data store and lead to degraded performance, copy job internal retries, and in some cases, execution failures. --## Sample scenario: Copy from a SQL Server database to Blob storage -**Scenario**: A pipeline is built to copy data from a SQL Server database to Blob storage in CSV format. To make the copy job faster, the CSV files should be compressed into bzip2 format. --**Test and analysis**: The throughput of Copy Activity is less than 2 MBps, which is much slower than the performance benchmark. --**Performance analysis and tuning**: -To troubleshoot the performance issue, let's look at how the data is processed and moved. --1. **Read data**: Gateway opens a connection to SQL Server and sends the query. SQL Server responds by sending the data stream to Gateway via the intranet. -2. **Serialize and compress data**: Gateway serializes the data stream to CSV format, and compresses the data to a bzip2 stream. -3. **Write data**: Gateway uploads the bzip2 stream to Blob storage via the Internet. --As you can see, the data is being processed and moved in a streaming sequential manner: SQL Server > LAN > Gateway > WAN > Blob storage. **The overall performance is gated by the minimum throughput across the pipeline**. ---One or more of the following factors might cause the performance bottleneck: --* **Source**: SQL Server itself has low throughput because of heavy loads. -* **Data Management Gateway**: - * **LAN**: Gateway is located far from the SQL Server computer and has a low-bandwidth connection. - * **Gateway**: Gateway has reached its load limitations to perform the following operations: - * **Serialization**: Serializing the data stream to CSV format has slow throughput. - * **Compression**: You chose a slow compression codec (for example, bzip2, which is 2.8 MBps with Core i7). - * **WAN**: The bandwidth between the corporate network and your Azure services is low (for example, T1 = 1,544 kbps; T2 = 6,312 kbps). -* **Sink**: Blob storage has low throughput. (This scenario is unlikely because its SLA guarantees a minimum of 60 MBps.) --In this case, bzip2 data compression might be slowing down the entire pipeline. Switching to a gzip compression codec might ease this bottleneck. --## Sample scenarios: Use parallel copy -**Scenario I:** Copy 1,000 1-MB files from the on-premises file system to Blob storage. --**Analysis and performance tuning**: For an example, if you have installed gateway on a quad core machine, Data Factory uses 16 parallel copies to move files from the file system to Blob storage concurrently. This parallel execution should result in high throughput. You also can explicitly specify the parallel copies count. When you copy many small files, parallel copies dramatically help throughput by using resources more effectively. ---**Scenario II**: Copy 20 blobs of 500 MB each from Blob storage to Data Lake Store Analytics, and then tune performance. --**Analysis and performance tuning**: In this scenario, Data Factory copies the data from Blob storage to Data Lake Store by using single-copy (**parallelCopies** set to 1) and single-cloud data movement units. The throughput you observe will be close to that described in the [performance reference section](#performance-reference). ---**Scenario III**: Individual file size is greater than dozens of MBs and total volume is large. --**Analysis and performance turning**: Increasing **parallelCopies** doesn't result in better copy performance because of the resource limitations of a single-cloud DMU. Instead, you should specify more cloud DMUs to get more resources to perform the data movement. Do not specify a value for the **parallelCopies** property. Data Factory handles the parallelism for you. In this case, if you set **cloudDataMovementUnits** to 4, a throughput of about four times occurs. ---## Reference -Here are performance monitoring and tuning references for some of the supported data stores: --* Azure Blob storage: [Scalability and performance targets for Blob storage](../../storage/blobs/scalability-targets.md) and [Performance and scalability checklist for Blob storage](../../storage/blobs/storage-performance-checklist.md). -* Azure Table storage: [Scalability and performance targets for Table storage](../../storage/tables/scalability-targets.md) and [Performance and scalability checklist for Table storage](../../storage/tables/storage-performance-checklist.md). -* Azure SQL Database: You can [monitor the performance](/azure/azure-sql/database/monitor-tune-overview) and check the database transaction unit (DTU) percentage -* Azure Synapse Analytics: Its capability is measured in data warehouse units (DWUs); see [Manage compute power in Azure Synapse Analytics (Overview)](../../synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-compute-overview.md) -* Azure Cosmos DB: [Performance levels in Azure Cosmos DB](../../cosmos-db/performance-levels.md) -* On-premises SQL Server: [Monitor and tune for performance](/sql/relational-databases/performance/monitor-and-tune-for-performance) -* On-premises file server: [Performance tuning for file servers](/previous-versions//dn567661(v=vs.85)) |
data-factory | Data Factory Copy Activity Tutorial Using Azure Resource Manager Template | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-copy-activity-tutorial-using-azure-resource-manager-template.md | - Title: 'Tutorial: Create a pipeline using Resource Manager Template ' -description: In this tutorial, you create an Azure Data Factory pipeline by using an Azure Resource Manager template. This pipeline copies data from an Azure blob storage to Azure SQL Database. ----- Previously updated : 04/12/2023----# Tutorial: Use Azure Resource Manager template to create a Data Factory pipeline to copy data -> [!div class="op_single_selector"] -> * [Overview and prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) -> * [Copy Wizard](data-factory-copy-data-wizard-tutorial.md) -> * [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) -> * [PowerShell](data-factory-copy-activity-tutorial-using-powershell.md) -> * [Azure Resource Manager template](data-factory-copy-activity-tutorial-using-azure-resource-manager-template.md) -> * [REST API](data-factory-copy-activity-tutorial-using-rest-api.md) -> * [.NET API](data-factory-copy-activity-tutorial-using-dotnet-api.md) -> --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [copy activity tutorial](../quickstart-create-data-factory-dot-net.md). --This tutorial shows you how to use an Azure Resource Manager template to create an Azure data factory. The data pipeline in this tutorial copies data from a source data store to a destination data store. It does not transform input data to produce output data. For a tutorial on how to transform data using Azure Data Factory, see [Tutorial: Build a pipeline to transform data using Hadoop cluster](data-factory-build-your-first-pipeline.md). --In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats). The activity is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see [Data Movement Activities](data-factory-data-movement-activities.md). --A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see [multiple activities in a pipeline](data-factory-scheduling-and-execution.md#multiple-activities-in-a-pipeline). --> [!NOTE] -> The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to transform data using Azure Data Factory, see [Tutorial: Build a pipeline to transform data using Hadoop cluster](data-factory-build-your-first-pipeline.md). --## Prerequisites ---* Go through [Tutorial Overview and Prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) and complete the **prerequisite** steps. -* Follow instructions in [How to install and configure Azure PowerShell](/powershell/azure/) article to install latest version of Azure PowerShell on your computer. In this tutorial, you use PowerShell to deploy Data Factory entities. -* (optional) See [Authoring Azure Resource Manager Templates](../../azure-resource-manager/templates/syntax.md) to learn about Azure Resource Manager templates. --## In this tutorial -In this tutorial, you create a data factory with the following Data Factory entities: --| Entity | Description | -| | | -| Azure Storage linked service |Links your Azure Storage account to the data factory. Azure Storage is the source data store and Azure SQL Database is the sink data store for the copy activity in the tutorial. It specifies the storage account that contains the input data for the copy activity. | -| Azure SQL Database linked service |Links our database in Azure SQL Database to the data factory. It specifies the database that holds the output data for the copy activity. | -| Azure Blob input dataset |Refers to the Azure Storage linked service. The linked service refers to an Azure Storage account and the Azure Blob dataset specifies the container, folder, and file name in the storage that holds the input data. | -| Azure SQL output dataset |Refers to the Azure SQL linked service. The Azure SQL linked service refers to an logical SQL server and the Azure SQL dataset specifies the name of the table that holds the output data. | -| Data pipeline |The pipeline has one activity of type Copy that takes the Azure blob dataset as an input and the Azure SQL dataset as an output. The copy activity copies data from an Azure blob to a table in Azure SQL Database. | --A data factory can have one or more pipelines. A pipeline can have one or more activities in it. There are two types of activities: [data movement activities](data-factory-data-movement-activities.md) and [data transformation activities](data-factory-data-transformation-activities.md). In this tutorial, you create a pipeline with one activity (copy activity). ---The following section provides the complete Resource Manager template for defining Data Factory entities so that you can quickly run through the tutorial and test the template. To understand how each Data Factory entity is defined, see [Data Factory entities in the template](#data-factory-entities-in-the-template) section. --## Data Factory JSON template -The top-level Resource Manager template for defining a data factory is: --```json -{ - "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", - "contentVersion": "1.0.0.0", - "parameters": { ... - }, - "variables": { ... - }, - "resources": [ - { - "name": "[parameters('dataFactoryName')]", - "apiVersion": "[variables('apiVersion')]", - "type": "Microsoft.DataFactory/datafactories", - "location": "westus", - "resources": [ - { ... }, - { ... }, - { ... }, - { ... } - ] - } - ] -} -``` -Create a JSON file named **ADFCopyTutorialARM.json** in **C:\ADFGetStarted** folder with the following content: --```json -{ - "contentVersion": "1.0.0.0", - "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", - "parameters": { - "storageAccountName": { "type": "string", "metadata": { "description": "Name of the Azure storage account that contains the data to be copied." } }, - "storageAccountKey": { "type": "securestring", "metadata": { "description": "Key for the Azure storage account." } }, - "sourceBlobContainer": { "type": "string", "metadata": { "description": "Name of the blob container in the Azure Storage account." } }, - "sourceBlobName": { "type": "string", "metadata": { "description": "Name of the blob in the container that has the data to be copied to Azure SQL Database table" } }, - "sqlServerName": { "type": "string", "metadata": { "description": "Name of the logical SQL server that will hold the output/copied data." } }, - "databaseName": { "type": "string", "metadata": { "description": "Name of the Azure SQL Database in the logical SQL server." } }, - "sqlServerUserName": { "type": "string", "metadata": { "description": "Name of the user that has access to the logical SQL server." } }, - "sqlServerPassword": { "type": "securestring", "metadata": { "description": "Password for the user." } }, - "targetSQLTable": { "type": "string", "metadata": { "description": "Table in the Azure SQL Database that will hold the copied data." } - } - }, - "variables": { - "dataFactoryName": "[concat('AzureBlobToAzureSQLDatabaseDF', uniqueString(resourceGroup().id))]", - "azureSqlLinkedServiceName": "AzureSqlLinkedService", - "azureStorageLinkedServiceName": "AzureStorageLinkedService", - "blobInputDatasetName": "BlobInputDataset", - "sqlOutputDatasetName": "SQLOutputDataset", - "pipelineName": "Blob2SQLPipeline" - }, - "resources": [ - { - "name": "[variables('dataFactoryName')]", - "apiVersion": "2015-10-01", - "type": "Microsoft.DataFactory/datafactories", - "location": "West US", - "resources": [ - { - "type": "linkedservices", - "name": "[variables('azureStorageLinkedServiceName')]", - "dependsOn": [ - "[variables('dataFactoryName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "AzureStorage", - "description": "Azure Storage linked service", - "typeProperties": { - "connectionString": "[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',parameters('storageAccountKey'))]" - } - } - }, - { - "type": "linkedservices", - "name": "[variables('azureSqlLinkedServiceName')]", - "dependsOn": [ - "[variables('dataFactoryName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "AzureSqlDatabase", - "description": "Azure SQL linked service", - "typeProperties": { - "connectionString": "[concat('Server=tcp:',parameters('sqlServerName'),'.database.windows.net,1433;Database=', parameters('databaseName'), ';User ID=',parameters('sqlServerUserName'),';Password=',parameters('sqlServerPassword'),';Trusted_Connection=False;Encrypt=True;Connection Timeout=30')]" - } - } - }, - { - "type": "datasets", - "name": "[variables('blobInputDatasetName')]", - "dependsOn": [ - "[variables('dataFactoryName')]", - "[variables('azureStorageLinkedServiceName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "[variables('azureStorageLinkedServiceName')]", - "structure": [ - { - "name": "Column0", - "type": "String" - }, - { - "name": "Column1", - "type": "String" - } - ], - "typeProperties": { - "folderPath": "[concat(parameters('sourceBlobContainer'), '/')]", - "fileName": "[parameters('sourceBlobName')]", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true - } - }, - { - "type": "datasets", - "name": "[variables('sqlOutputDatasetName')]", - "dependsOn": [ - "[variables('dataFactoryName')]", - "[variables('azureSqlLinkedServiceName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "AzureSqlTable", - "linkedServiceName": "[variables('azureSqlLinkedServiceName')]", - "structure": [ - { - "name": "FirstName", - "type": "String" - }, - { - "name": "LastName", - "type": "String" - } - ], - "typeProperties": { - "tableName": "[parameters('targetSQLTable')]" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } - }, - { - "type": "datapipelines", - "name": "[variables('pipelineName')]", - "dependsOn": [ - "[variables('dataFactoryName')]", - "[variables('azureStorageLinkedServiceName')]", - "[variables('azureSqlLinkedServiceName')]", - "[variables('blobInputDatasetName')]", - "[variables('sqlOutputDatasetName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "activities": [ - { - "name": "CopyFromAzureBlobToAzureSQL", - "description": "Copy data frm Azure blob to Azure SQL", - "type": "Copy", - "inputs": [ - { - "name": "[variables('blobInputDatasetName')]" - } - ], - "outputs": [ - { - "name": "[variables('sqlOutputDatasetName')]" - } - ], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "SqlSink", - "sqlWriterCleanupScript": "$$Text.Format('DELETE FROM {0}', 'emp')" - }, - "translator": { - "type": "TabularTranslator", - "columnMappings": "Column0:FirstName,Column1:LastName" - } - }, - "Policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 3, - "timeout": "01:00:00" - } - } - ], - "start": "2017-05-11T00:00:00Z", - "end": "2017-05-12T00:00:00Z" - } - } - ] - } - ] - } -``` --## Parameters JSON -Create a JSON file named **ADFCopyTutorialARM-Parameters.json** that contains parameters for the Azure Resource Manager template. --> [!IMPORTANT] -> Specify name and key of your Azure Storage account for storageAccountName and storageAccountKey parameters. -> -> Specify logical SQL server, database, user, and password for sqlServerName, databaseName, sqlServerUserName, and sqlServerPassword parameters. --```json -{ - "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#", - "contentVersion": "1.0.0.0", - "parameters": { - "storageAccountName": { "value": "<Name of the Azure storage account>" }, - "storageAccountKey": { - "value": "<Key for the Azure storage account>" - }, - "sourceBlobContainer": { "value": "adftutorial" }, - "sourceBlobName": { "value": "emp.txt" }, - "sqlServerName": { "value": "<Name of the logical SQL server>" }, - "databaseName": { "value": "<Name of the database>" }, - "sqlServerUserName": { "value": "<Name of the user who has access to the database>" }, - "sqlServerPassword": { "value": "<password for the user>" }, - "targetSQLTable": { "value": "emp" } - } -} -``` --> [!IMPORTANT] -> You may have separate parameter JSON files for development, testing, and production environments that you can use with the same Data Factory JSON template. By using a Power Shell script, you can automate deploying Data Factory entities in these environments. -> -> --## Create data factory -1. Start **Azure PowerShell** and run the following command: -- * Run the following command and enter the user name and password that you use to sign in to the Azure portal. - - ```powershell - Connect-AzAccount - ``` -- * Run the following command to view all the subscriptions for this account. - - ```powershell - Get-AzSubscription - ``` -- * Run the following command to select the subscription that you want to work with. - - ```powershell - Get-AzSubscription -SubscriptionName <SUBSCRIPTION NAME> | Set-AzContext - ``` --2. Run the following command to deploy Data Factory entities using the Resource Manager template you created in Step 1. -- ```powershell - New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -TemplateFile C:\ADFGetStarted\ADFCopyTutorialARM.json -TemplateParameterFile C:\ADFGetStarted\ADFCopyTutorialARM-Parameters.json - ``` --## Monitor pipeline --1. Log in to the [Azure portal](https://portal.azure.com) using your Azure account. --1. Click **Data factories** on the left menu (or) click **All services** and click **Data factories** under **INTELLIGENCE + ANALYTICS** category. - - :::image type="content" source="media/data-factory-copy-activity-tutorial-using-azure-resource-manager-template/data-factories-menu.png" alt-text="Data factories menu"::: --1. In the **Data factories** page, search for and find your data factory (AzureBlobToAzureSQLDatabaseDF). - - :::image type="content" source="media/data-factory-copy-activity-tutorial-using-azure-resource-manager-template/search-for-data-factory.png" alt-text="Search for data factory"::: --1. Click your Azure data factory. You see the home page for the data factory. - - :::image type="content" source="media/data-factory-copy-activity-tutorial-using-azure-resource-manager-template/data-factory-home-page.png" alt-text="Home page for data factory"::: --1. Follow instructions from [Monitor datasets and pipeline](data-factory-monitor-manage-pipelines.md) to monitor the pipeline and datasets you have created in this tutorial. Currently, Visual Studio does not support monitoring Data Factory pipelines. --1. When a slice is in the **Ready** state, verify that the data is copied to the **emp** table in the Azure SQL Database. --For more information on how to use Azure portal blades to monitor pipeline and datasets you have created in this tutorial, see [Monitor datasets and pipeline](data-factory-monitor-manage-pipelines.md) . --For more information on how to use the Monitor & Manage application to monitor your data pipelines, see [Monitor and manage Azure Data Factory pipelines using Monitoring App](data-factory-monitor-manage-app.md). --## Data Factory entities in the template --### Define data factory --You define a data factory in the Resource Manager template as shown in the following sample: --```json -{ - "resources": [ - { - "name": "[variables('dataFactoryName')]", - "apiVersion": "2015-10-01", - "type": "Microsoft.DataFactory/datafactories", - "location": "West US" - } - ] -} -``` --The dataFactoryName is defined as: --```json -{ - "dataFactoryName": "[concat('AzureBlobToAzureSQLDatabaseDF', uniqueString(resourceGroup().id))]" -} -``` --It is a unique string based on the resource group ID. --### Defining Data Factory entities --The following Data Factory entities are defined in the JSON template: --1. [Azure Storage linked service](#azure-storage-linked-service) -2. [Azure SQL linked service](#azure-sql-database-linked-service) -3. [Azure blob dataset](#azure-blob-dataset) -4. [Azure SQL dataset](#azure-sql-dataset) -5. [Data pipeline with a copy activity](#data-pipeline) --#### Azure Storage linked service --The AzureStorageLinkedService links your Azure storage account to the data factory. You created a container and uploaded data to this storage account as part of [prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). You specify the name and key of your Azure storage account in this section. See [Azure Storage linked service](data-factory-azure-blob-connector.md#azure-storage-linked-service) for details about JSON properties used to define an Azure Storage linked service. --```json -{ - "type": "linkedservices", - "name": "[variables('azureStorageLinkedServiceName')]", - "dependsOn": [ - "[variables('dataFactoryName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "AzureStorage", - "description": "Azure Storage linked service", - "typeProperties": { - "connectionString": "[concat('DefaultEndpointsProtocol=https;AccountName=',parameters('storageAccountName'),';AccountKey=',parameters('storageAccountKey'))]" - } - } -} -``` --The connectionString uses the storageAccountName and storageAccountKey parameters. The values for these parameters passed by using a configuration file. The definition also uses variables: azureStorageLinkedService and dataFactoryName defined in the template. --#### Azure SQL Database linked service --AzureSqlLinkedService links your database in Azure SQL Database to the data factory. The data that is copied from the blob storage is stored in this database. You created the emp table in this database as part of [prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). You specify the logical SQL server name, database name, user name, and user password in this section. See [Azure SQL linked service](data-factory-azure-sql-connector.md#linked-service-properties) for details about JSON properties used to define an Azure SQL linked service. --```json -{ - "type": "linkedservices", - "name": "[variables('azureSqlLinkedServiceName')]", - "dependsOn": [ - "[variables('dataFactoryName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "AzureSqlDatabase", - "description": "Azure SQL linked service", - "typeProperties": { - "connectionString": "[concat('Server=tcp:',parameters('sqlServerName'),'.database.windows.net,1433;Database=', parameters('databaseName'), ';User ID=',parameters('sqlServerUserName'),';Password=',parameters('sqlServerPassword'),';Trusted_Connection=False;Encrypt=True;Connection Timeout=30')]" - } - } -} -``` --The connectionString uses sqlServerName, databaseName, sqlServerUserName, and sqlServerPassword parameters whose values are passed by using a configuration file. The definition also uses the following variables from the template: azureSqlLinkedServiceName, dataFactoryName. --#### Azure blob dataset -The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. In Azure blob dataset definition, you specify names of blob container, folder, and file that contains the input data. See [Azure Blob dataset properties](data-factory-azure-blob-connector.md#dataset-properties) for details about JSON properties used to define an Azure Blob dataset. --```json -{ - "type": "datasets", - "name": "[variables('blobInputDatasetName')]", - "dependsOn": [ - "[variables('dataFactoryName')]", - "[variables('azureStorageLinkedServiceName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "[variables('azureStorageLinkedServiceName')]", - "structure": [ - { - "name": "Column0", - "type": "String" - }, - { - "name": "Column1", - "type": "String" - } - ], - "typeProperties": { - "folderPath": "[concat(parameters('sourceBlobContainer'), '/')]", - "fileName": "[parameters('sourceBlobName')]", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true - } -} -``` --#### Azure SQL dataset -You specify the name of the table in Azure SQL Database that holds the copied data from the Azure Blob storage. See [Azure SQL dataset properties](data-factory-azure-sql-connector.md#dataset-properties) for details about JSON properties used to define an Azure SQL dataset. --```json -{ - "type": "datasets", - "name": "[variables('sqlOutputDatasetName')]", - "dependsOn": [ - "[variables('dataFactoryName')]", - "[variables('azureSqlLinkedServiceName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "type": "AzureSqlTable", - "linkedServiceName": "[variables('azureSqlLinkedServiceName')]", - "structure": [ - { - "name": "FirstName", - "type": "String" - }, - { - "name": "LastName", - "type": "String" - } - ], - "typeProperties": { - "tableName": "[parameters('targetSQLTable')]" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --#### Data pipeline -You define a pipeline that copies data from the Azure blob dataset to the Azure SQL dataset. See [Pipeline JSON](data-factory-create-pipelines.md#pipeline-json) for descriptions of JSON elements used to define a pipeline in this example. --```json -{ - "type": "datapipelines", - "name": "[variables('pipelineName')]", - "dependsOn": [ - "[variables('dataFactoryName')]", - "[variables('azureStorageLinkedServiceName')]", - "[variables('azureSqlLinkedServiceName')]", - "[variables('blobInputDatasetName')]", - "[variables('sqlOutputDatasetName')]" - ], - "apiVersion": "2015-10-01", - "properties": { - "activities": [ - { - "name": "CopyFromAzureBlobToAzureSQL", - "description": "Copy data frm Azure blob to Azure SQL", - "type": "Copy", - "inputs": [ - { - "name": "[variables('blobInputDatasetName')]" - } - ], - "outputs": [ - { - "name": "[variables('sqlOutputDatasetName')]" - } - ], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "SqlSink", - "sqlWriterCleanupScript": "$$Text.Format('DELETE FROM {0}', 'emp')" - }, - "translator": { - "type": "TabularTranslator", - "columnMappings": "Column0:FirstName,Column1:LastName" - } - }, - "Policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 3, - "timeout": "01:00:00" - } - } - ], - "start": "2017-05-11T00:00:00Z", - "end": "2017-05-12T00:00:00Z" - } -} -``` --## Reuse the template -In the tutorial, you created a template for defining Data Factory entities and a template for passing values for parameters. The pipeline copies data from an Azure Storage account to Azure SQL Database specified via parameters. To use the same template to deploy Data Factory entities to different environments, you create a parameter file for each environment and use it when deploying to that environment. --Example: --```powershell -New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -TemplateFile ADFCopyTutorialARM.json -TemplateParameterFile ADFCopyTutorialARM-Parameters-Dev.json -``` --```powershell -New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -TemplateFile ADFCopyTutorialARM.json -TemplateParameterFile ADFCopyTutorialARM-Parameters-Test.json -``` --```powershell -New-AzResourceGroupDeployment -Name MyARMDeployment -ResourceGroupName ADFTutorialResourceGroup -TemplateFile ADFCopyTutorialARM.json -TemplateParameterFile ADFCopyTutorialARM-Parameters-Production.json -``` --Notice that the first command uses parameter file for the development environment, second one for the test environment, and the third one for the production environment. --You can also reuse the template to perform repeated tasks. For example, you need to create many data factories with one or more pipelines that implement the same logic but each data factory uses different Storage and SQL Database accounts. In this scenario, you use the same template in the same environment (dev, test, or production) with different parameter files to create data factories. --## Next steps -In this tutorial, you used Azure blob storage as a source data store and Azure SQL Database as a destination data store in a copy operation. The following table provides a list of data stores supported as sources and destinations by the copy activity: ---To learn about how to copy data to/from a data store, click the link for the data store in the table. |
data-factory | Data Factory Copy Activity Tutorial Using Dotnet Api | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-copy-activity-tutorial-using-dotnet-api.md | - Title: 'Tutorial: Create a pipeline with Copy Activity using .NET API ' -description: In this tutorial, you create an Azure Data Factory pipeline with a Copy Activity by using .NET API. ----- Previously updated : 04/12/2023----# Tutorial: Create a pipeline with Copy Activity using .NET API -> [!div class="op_single_selector"] -> * [Overview and prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) -> * [Copy Wizard](data-factory-copy-data-wizard-tutorial.md) -> * [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) -> * [PowerShell](data-factory-copy-activity-tutorial-using-powershell.md) -> * [Azure Resource Manager template](data-factory-copy-activity-tutorial-using-azure-resource-manager-template.md) -> * [REST API](data-factory-copy-activity-tutorial-using-rest-api.md) -> * [.NET API](data-factory-copy-activity-tutorial-using-dotnet-api.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [copy activity tutorial](../quickstart-create-data-factory-dot-net.md). --In this article, you learn how to use [.NET API](https://portal.azure.com) to create a data factory with a pipeline that copies data from an Azure blob storage to Azure SQL Database. If you are new to Azure Data Factory, read through the [Introduction to Azure Data Factory](data-factory-introduction.md) article before doing this tutorial. --In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats). The activity is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see [Data Movement Activities](data-factory-data-movement-activities.md). --A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see [multiple activities in a pipeline](data-factory-scheduling-and-execution.md#multiple-activities-in-a-pipeline). --> [!NOTE] -> For complete documentation on .NET API for Data Factory, see [Data Factory .NET API Reference](/dotnet/api/overview/azure/data-factory). -> -> The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to transform data using Azure Data Factory, see [Tutorial: Build a pipeline to transform data using Hadoop cluster](data-factory-build-your-first-pipeline.md). --## Prerequisites ---* Go through [Tutorial Overview and Pre-requisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) to get an overview of the tutorial and complete the **prerequisite** steps. -* Visual Studio 2012 or 2013 or 2015 -* Download and install [Azure .NET SDK](https://azure.microsoft.com/downloads/) -* Azure PowerShell. Follow instructions in [How to install and configure Azure PowerShell](/powershell/azure/install-azure-powershell) article to install Azure PowerShell on your computer. You use Azure PowerShell to create a Microsoft Entra application. --<a name='create-an-application-in-azure-active-directory'></a> --### Create an application in Microsoft Entra ID -Create a Microsoft Entra application, create a service principal for the application, and assign it to the **Data Factory Contributor** role. --1. Launch **PowerShell**. -2. Run the following command and enter the user name and password that you use to sign in to the Azure portal. -- ```powershell - Connect-AzAccount - ``` -3. Run the following command to view all the subscriptions for this account. -- ```powershell - Get-AzSubscription - ``` -4. Run the following command to select the subscription that you want to work with. Replace **<NameOfAzureSubscription**> with the name of your Azure subscription. -- ```powershell - Get-AzSubscription -SubscriptionName <NameOfAzureSubscription> | Set-AzContext - ``` -- > [!IMPORTANT] - > Note down **SubscriptionId** and **TenantId** from the output of this command. --5. Create an Azure resource group named **ADFTutorialResourceGroup** by running the following command in the PowerShell. -- ```powershell - New-AzResourceGroup -Name ADFTutorialResourceGroup -Location "West US" - ``` -- If the resource group already exists, you specify whether to update it (Y) or keep it as (N). -- If you use a different resource group, you need to use the name of your resource group in place of ADFTutorialResourceGroup in this tutorial. -6. Create a Microsoft Entra application. -- ```powershell - $azureAdApplication = New-AzADApplication -DisplayName "ADFCopyTutotiralApp" -HomePage "https://www.contoso.org" -IdentifierUris "https://www.adfcopytutorialapp.org/example" -Password "Pass@word1" - ``` -- If you get the following error, specify a different URL and run the command again. - - ```powershell - Another object with the same value for property identifierUris already exists. - ``` -7. Create the AD service principal. -- ```powershell - New-AzADServicePrincipal -ApplicationId $azureAdApplication.ApplicationId - ``` -8. Add service principal to the **Data Factory Contributor** role. -- ```powershell - New-AzRoleAssignment -RoleDefinitionName "Data Factory Contributor" -ServicePrincipalName $azureAdApplication.ApplicationId.Guid - ``` -9. Get the application ID. -- ```powershell - $azureAdApplication - ``` - Note down the application ID (applicationID) from the output. --You should have following four values from these steps: --* Tenant ID -* Subscription ID -* Application ID -* Password (specified in the first command) --## Walkthrough -1. Using Visual Studio 2012/2013/2015, create a C# .NET console application. - 1. Launch **Visual Studio** 2012/2013/2015. - 2. Click **File**, point to **New**, and click **Project**. - 3. Expand **Templates**, and select **Visual C#**. In this walkthrough, you use C#, but you can use any .NET language. - 4. Select **Console Application** from the list of project types on the right. - 5. Enter **DataFactoryAPITestApp** for the Name. - 6. Select **C:\ADFGetStarted** for the Location. - 7. Click **OK** to create the project. -2. Click **Tools**, point to **NuGet Package Manager**, and click **Package Manager Console**. -3. In the **Package Manager Console**, do the following steps: - 1. Run the following command to install Data Factory package: `Install-Package Microsoft.Azure.Management.DataFactories` - 2. Run the following command to install Microsoft Entra ID package (you use Active Directory API in the code): `Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory -Version 2.19.208020213` -4. Add the following **appSetttings** section to the **App.config** file. These settings are used by the helper method: **GetAuthorizationHeader**. -- Replace values for **<Application ID>**, **<Password>**, **<Subscription ID>**, and **<tenant ID>** with your own values. -- ```xml - <?xml version="1.0" encoding="utf-8" ?> - <configuration> - <appSettings> - <add key="ActiveDirectoryEndpoint" value="https://login.microsoftonline.com/" /> - <add key="ResourceManagerEndpoint" value="https://management.azure.com/" /> - <add key="WindowsManagementUri" value="https://management.core.windows.net/" /> -- <add key="ApplicationId" value="your application ID" /> - <add key="Password" value="Password you used while creating the AAD application" /> - <add key="SubscriptionId" value= "Subscription ID" /> - <add key="ActiveDirectoryTenantId" value="Tenant ID" /> - </appSettings> - </configuration> - ``` --5. Add the following **using** statements to the source file (Program.cs) in the project. -- ```csharp - using System.Configuration; - using System.Collections.ObjectModel; - using System.Threading; - using System.Threading.Tasks; -- using Microsoft.Azure; - using Microsoft.Azure.Management.DataFactories; - using Microsoft.Azure.Management.DataFactories.Models; - using Microsoft.Azure.Management.DataFactories.Common.Models; -- using Microsoft.IdentityModel.Clients.ActiveDirectory; -- ``` --6. Add the following code that creates an instance of **DataPipelineManagementClient** class to the **Main** method. You use this object to create a data factory, a linked service, input and output datasets, and a pipeline. You also use this object to monitor slices of a dataset at runtime. -- ```csharp - // create data factory management client - string resourceGroupName = "ADFTutorialResourceGroup"; - string dataFactoryName = "APITutorialFactory"; -- TokenCloudCredentials aadTokenCredentials = new TokenCloudCredentials( - ConfigurationManager.AppSettings["SubscriptionId"], - GetAuthorizationHeader().Result); -- Uri resourceManagerUri = new Uri(ConfigurationManager.AppSettings["ResourceManagerEndpoint"]); -- DataFactoryManagementClient client = new DataFactoryManagementClient(aadTokenCredentials, resourceManagerUri); - ``` -- > [!IMPORTANT] - > Replace the value of **resourceGroupName** with the name of your Azure resource group. - > - > Update name of the data factory (dataFactoryName) to be unique. Name of the data factory must be globally unique. See [Data Factory - Naming Rules](data-factory-naming-rules.md) topic for naming rules for Data Factory artifacts. --7. Add the following code that creates a **data factory** to the **Main** method. -- ```csharp - // create a data factory - Console.WriteLine("Creating a data factory"); - client.DataFactories.CreateOrUpdate(resourceGroupName, - new DataFactoryCreateOrUpdateParameters() - { - DataFactory = new DataFactory() - { - Name = dataFactoryName, - Location = "westus", - Properties = new DataFactoryProperties() - } - } - ); - ``` -- A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive script to transform input data to product output data. Let's start with creating the data factory in this step. -8. Add the following code that creates an **Azure Storage linked service** to the **Main** method. -- > [!IMPORTANT] - > Replace **storageaccountname** and **accountkey** with name and key of your Azure Storage account. -- ```csharp - // create a linked service for input data store: Azure Storage - Console.WriteLine("Creating Azure Storage linked service"); - client.LinkedServices.CreateOrUpdate(resourceGroupName, dataFactoryName, - new LinkedServiceCreateOrUpdateParameters() - { - LinkedService = new LinkedService() - { - Name = "AzureStorageLinkedService", - Properties = new LinkedServiceProperties - ( - new AzureStorageLinkedService("DefaultEndpointsProtocol=https;AccountName=<storageaccountname>;AccountKey=<accountkey>") - ) - } - } - ); - ``` -- You create linked services in a data factory to link your data stores and compute services to the data factory. In this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake Analytics. You use two data stores of type Azure Storage (source) and Azure SQL Database (destination). -- Therefore, you create two linked services named AzureStorageLinkedService and AzureSqlLinkedService of types: AzureStorage and AzureSqlDatabase. -- The AzureStorageLinkedService links your Azure storage account to the data factory. This storage account is the one in which you created a container and uploaded the data as part of [prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). -9. Add the following code that creates an **Azure SQL linked service** to the **Main** method. -- > [!IMPORTANT] - > Replace **servername**, **databasename**, **username**, and **password** with names of your server, database, user, and password. -- ```csharp - // create a linked service for output data store: Azure SQL Database - Console.WriteLine("Creating Azure SQL Database linked service"); - client.LinkedServices.CreateOrUpdate(resourceGroupName, dataFactoryName, - new LinkedServiceCreateOrUpdateParameters() - { - LinkedService = new LinkedService() - { - Name = "AzureSqlLinkedService", - Properties = new LinkedServiceProperties - ( - new AzureSqlDatabaseLinkedService("Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=<databasename>;User ID=<username>;Password=<password>;Integrated Security=False;Encrypt=True;Connect Timeout=30") - ) - } - } - ); - ``` -- AzureSqlLinkedService links Azure SQL Database to the data factory. The data that is copied from the blob storage is stored in this database. You created the emp table in this database as part of [prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). -10. Add the following code that creates **input and output datasets** to the **Main** method. -- ```csharp - // create input and output datasets - Console.WriteLine("Creating input and output datasets"); - string Dataset_Source = "InputDataset"; - string Dataset_Destination = "OutputDataset"; -- Console.WriteLine("Creating input dataset of type: Azure Blob"); - client.Datasets.CreateOrUpdate(resourceGroupName, dataFactoryName, -- new DatasetCreateOrUpdateParameters() - { - Dataset = new Dataset() - { - Name = Dataset_Source, - Properties = new DatasetProperties() - { - Structure = new List<DataElement>() - { - new DataElement() { Name = "FirstName", Type = "String" }, - new DataElement() { Name = "LastName", Type = "String" } - }, - LinkedServiceName = "AzureStorageLinkedService", - TypeProperties = new AzureBlobDataset() - { - FolderPath = "adftutorial/", - FileName = "emp.txt" - }, - External = true, - Availability = new Availability() - { - Frequency = SchedulePeriod.Hour, - Interval = 1, - }, -- Policy = new Policy() - { - Validation = new ValidationPolicy() - { - MinimumRows = 1 - } - } - } - } - }); -- Console.WriteLine("Creating output dataset of type: Azure SQL"); - client.Datasets.CreateOrUpdate(resourceGroupName, dataFactoryName, - new DatasetCreateOrUpdateParameters() - { - Dataset = new Dataset() - { - Name = Dataset_Destination, - Properties = new DatasetProperties() - { - Structure = new List<DataElement>() - { - new DataElement() { Name = "FirstName", Type = "String" }, - new DataElement() { Name = "LastName", Type = "String" } - }, - LinkedServiceName = "AzureSqlLinkedService", - TypeProperties = new AzureSqlTableDataset() - { - TableName = "emp" - }, - Availability = new Availability() - { - Frequency = SchedulePeriod.Hour, - Interval = 1, - }, - } - } - }); - ``` - - In the previous step, you created linked services to link your Azure Storage account and Azure SQL Database to your data factory. In this step, you define two datasets named InputDataset and OutputDataset that represent input and output data that is stored in the data stores referred by AzureStorageLinkedService and AzureSqlLinkedService respectively. -- The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. And, the input blob dataset (InputDataset) specifies the container and the folder that contains the input data. -- Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to your database. And, the output SQL table dataset (OututDataset) specifies the table in the database to which the data from the blob storage is copied. -- In this step, you create a dataset named InputDataset that points to a blob file (emp.txt) in the root folder of a blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService linked service. If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the destination. In this tutorial, you specify a value for the fileName. -- In this step, you create an output dataset named **OutputDataset**. This dataset points to a SQL table in the database represented by **AzureSqlLinkedService**. -11. Add the following code that **creates and activates a pipeline** to the **Main** method. In this step, you create a pipeline with a **copy activity** that uses **InputDataset** as an input and **OutputDataset** as an output. -- ```csharp - // create a pipeline - Console.WriteLine("Creating a pipeline"); - DateTime PipelineActivePeriodStartTime = new DateTime(2017, 5, 11, 0, 0, 0, 0, DateTimeKind.Utc); - DateTime PipelineActivePeriodEndTime = new DateTime(2017, 5, 12, 0, 0, 0, 0, DateTimeKind.Utc); - string PipelineName = "ADFTutorialPipeline"; -- client.Pipelines.CreateOrUpdate(resourceGroupName, dataFactoryName, - new PipelineCreateOrUpdateParameters() - { - Pipeline = new Pipeline() - { - Name = PipelineName, - Properties = new PipelineProperties() - { - Description = "Demo Pipeline for data transfer between blobs", -- // Initial value for pipeline's active period. With this, you won't need to set slice status - Start = PipelineActivePeriodStartTime, - End = PipelineActivePeriodEndTime, -- Activities = new List<Activity>() - { - new Activity() - { - Name = "BlobToAzureSql", - Inputs = new List<ActivityInput>() - { - new ActivityInput() { - Name = Dataset_Source - } - }, - Outputs = new List<ActivityOutput>() - { - new ActivityOutput() - { - Name = Dataset_Destination - } - }, - TypeProperties = new CopyActivity() - { - Source = new BlobSource(), - Sink = new BlobSink() - { - WriteBatchSize = 10000, - WriteBatchTimeout = TimeSpan.FromMinutes(10) - } - } - } - } - } - } - }); - ``` -- Note the following points: - - - In the activities section, there is only one activity whose **type** is set to **Copy**. For more information about the copy activity, see [data movement activities](data-factory-data-movement-activities.md). In Data Factory solutions, you can also use [data transformation activities](data-factory-data-transformation-activities.md). - - Input for the activity is set to **InputDataset** and output for the activity is set to **OutputDataset**. - - In the **typeProperties** section, **BlobSource** is specified as the source type and **SqlSink** is specified as the sink type. For a complete list of data stores supported by the copy activity as sources and sinks, see [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats). To learn how to use a specific supported data store as a source/sink, click the link in the table. - - Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours. Therefore, 24 slices of output dataset are produced by the pipeline. -12. Add the following code to the **Main** method to get the status of a data slice of the output dataset. There is only slice expected in this sample. -- ```csharp - // Pulling status within a timeout threshold - DateTime start = DateTime.Now; - bool done = false; -- while (DateTime.Now - start < TimeSpan.FromMinutes(5) && !done) - { - Console.WriteLine("Pulling the slice status"); - // wait before the next status check - Thread.Sleep(1000 * 12); -- var datalistResponse = client.DataSlices.List(resourceGroupName, dataFactoryName, Dataset_Destination, - new DataSliceListParameters() - { - DataSliceRangeStartTime = PipelineActivePeriodStartTime.ConvertToISO8601DateTimeString(), - DataSliceRangeEndTime = PipelineActivePeriodEndTime.ConvertToISO8601DateTimeString() - }); -- foreach (DataSlice slice in datalistResponse.DataSlices) - { - if (slice.State == DataSliceState.Failed || slice.State == DataSliceState.Ready) - { - Console.WriteLine("Slice execution is done with status: {0}", slice.State); - done = true; - break; - } - else - { - Console.WriteLine("Slice status is: {0}", slice.State); - } - } - } - ``` --13. Add the following code to get run details for a data slice to the **Main** method. -- ```csharp - Console.WriteLine("Getting run details of a data slice"); -- // give it a few minutes for the output slice to be ready - Console.WriteLine("\nGive it a few minutes for the output slice to be ready and press any key."); - Console.ReadKey(); -- var datasliceRunListResponse = client.DataSliceRuns.List( - resourceGroupName, - dataFactoryName, - Dataset_Destination, - new DataSliceRunListParameters() - { - DataSliceStartTime = PipelineActivePeriodStartTime.ConvertToISO8601DateTimeString() - } - ); -- foreach (DataSliceRun run in datasliceRunListResponse.DataSliceRuns) - { - Console.WriteLine("Status: \t\t{0}", run.Status); - Console.WriteLine("DataSliceStart: \t{0}", run.DataSliceStart); - Console.WriteLine("DataSliceEnd: \t\t{0}", run.DataSliceEnd); - Console.WriteLine("ActivityId: \t\t{0}", run.ActivityName); - Console.WriteLine("ProcessingStartTime: \t{0}", run.ProcessingStartTime); - Console.WriteLine("ProcessingEndTime: \t{0}", run.ProcessingEndTime); - Console.WriteLine("ErrorMessage: \t{0}", run.ErrorMessage); - } -- Console.WriteLine("\nPress any key to exit."); - Console.ReadKey(); - ``` --14. Add the following helper method used by the **Main** method to the **Program** class. -- > [!NOTE] - > When you copy and paste the following code, make sure that the copied code is at the same level as the Main method. -- ```csharp - public static async Task<string> GetAuthorizationHeader() - { - AuthenticationContext context = new AuthenticationContext(ConfigurationManager.AppSettings["ActiveDirectoryEndpoint"] + ConfigurationManager.AppSettings["ActiveDirectoryTenantId"]); - ClientCredential credential = new ClientCredential( - ConfigurationManager.AppSettings["ApplicationId"], - ConfigurationManager.AppSettings["Password"]); - AuthenticationResult result = await context.AcquireTokenAsync( - resource: ConfigurationManager.AppSettings["WindowsManagementUri"], - clientCredential: credential); -- if (result != null) - return result.AccessToken; -- throw new InvalidOperationException("Failed to acquire token"); - } - ``` --15. In the Solution Explorer, expand the project (DataFactoryAPITestApp), right-click **References**, and click **Add Reference**. Select check box for **System.Configuration** assembly. and click **OK**. -16. Build the console application. Click **Build** on the menu and click **Build Solution**. -17. Confirm that there is at least one file in the **adftutorial** container in your Azure blob storage. If not, create **Emp.txt** file in Notepad with the following content and upload it to the adftutorial container. -- ``` - John, Doe - Jane, Doe - ``` -18. Run the sample by clicking **Debug** -> **Start Debugging** on the menu. When you see the **Getting run details of a data slice**, wait for a few minutes, and press **ENTER**. -19. Use the Azure portal to verify that the data factory **APITutorialFactory** is created with the following artifacts: - * Linked service: **LinkedService_AzureStorage** - * Dataset: **InputDataset** and **OutputDataset**. - * Pipeline: **PipelineBlobSample** -20. Verify that the two employee records are created in the **emp** table in the specified database. --## Next steps -For complete documentation on .NET API for Data Factory, see [Data Factory .NET API Reference](/dotnet/api/overview/azure/data-factory). --In this tutorial, you used Azure blob storage as a source data store and Azure SQL Database as a destination data store in a copy operation. The following table provides a list of data stores supported as sources and destinations by the copy activity: ---To learn about how to copy data to/from a data store, click the link for the data store in the table. |
data-factory | Data Factory Copy Activity Tutorial Using Powershell | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-copy-activity-tutorial-using-powershell.md | - Title: 'Tutorial: Create a pipeline to move data by using Azure PowerShell ' -description: In this tutorial, you create an Azure Data Factory pipeline with Copy Activity by using Azure PowerShell. ----- Previously updated : 04/12/2023----# Tutorial: Create a Data Factory pipeline that moves data by using Azure PowerShell -> [!div class="op_single_selector"] -> * [Overview and prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) -> * [Copy Wizard](data-factory-copy-data-wizard-tutorial.md) -> * [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) -> * [PowerShell](data-factory-copy-activity-tutorial-using-powershell.md) -> * [Azure Resource Manager template](data-factory-copy-activity-tutorial-using-azure-resource-manager-template.md) -> * [REST API](data-factory-copy-activity-tutorial-using-rest-api.md) -> * [.NET API](data-factory-copy-activity-tutorial-using-dotnet-api.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [copy activity tutorial](../quickstart-create-data-factory-powershell.md). --In this article, you learn how to use PowerShell to create a data factory with a pipeline that copies data from an Azure blob storage to Azure SQL Database. If you are new to Azure Data Factory, read through the [Introduction to Azure Data Factory](data-factory-introduction.md) article before doing this tutorial. --In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats). The activity is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see [Data Movement Activities](data-factory-data-movement-activities.md). --A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see [multiple activities in a pipeline](data-factory-scheduling-and-execution.md#multiple-activities-in-a-pipeline). --> [!NOTE] -> This article does not cover all the Data Factory cmdlets. See [Data Factory Cmdlet Reference](/powershell/module/az.datafactory) for comprehensive documentation on these cmdlets. -> -> The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to transform data using Azure Data Factory, see [Tutorial: Build a pipeline to transform data using Hadoop cluster](data-factory-build-your-first-pipeline.md). --## Prerequisites ---- Complete prerequisites listed in the [tutorial prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) article.-- Install **Azure PowerShell**. Follow the instructions in [How to install and configure Azure PowerShell](/powershell/azure/install-azure-powershell).--## Steps -Here are the steps you perform as part of this tutorial: --1. Create an Azure **data factory**. In this step, you create a data factory named ADFTutorialDataFactoryPSH. -1. Create **linked services** in the data factory. In this step, you create two linked services of types: Azure Storage and Azure SQL Database. -- The AzureStorageLinkedService links your Azure storage account to the data factory. You created a container and uploaded data to this storage account as part of [prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). -- AzureSqlLinkedService links Azure SQL Database to the data factory. The data that is copied from the blob storage is stored in this database. You created a SQL table in this database as part of [prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). -1. Create input and output **datasets** in the data factory. -- The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. And, the input blob dataset specifies the container and the folder that contains the input data. -- Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to your database. And, the output SQL table dataset specifies the table in the database to which the data from the blob storage is copied. -1. Create a **pipeline** in the data factory. In this step, you create a pipeline with a copy activity. -- The copy activity copies data from a blob in the Azure blob storage to a table in Azure SQL Database. You can use a copy activity in a pipeline to copy data from any supported source to any supported destination. For a list of supported data stores, see [data movement activities](data-factory-data-movement-activities.md#supported-data-stores-and-formats) article. -1. Monitor the pipeline. In this step, you **monitor** the slices of input and output datasets by using PowerShell. --## Create a data factory -> [!IMPORTANT] -> Complete [prerequisites for the tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) if you haven't already done so. --A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run a Hive script to transform input data to product output data. Let's start with creating the data factory in this step. --1. Launch **PowerShell**. Keep Azure PowerShell open until the end of this tutorial. If you close and reopen, you need to run the commands again. -- Run the following command, and enter the user name and password that you use to sign in to the Azure portal: -- ```powershell - Connect-AzAccount - ``` -- Run the following command to view all the subscriptions for this account: -- ```powershell - Get-AzSubscription - ``` -- Run the following command to select the subscription that you want to work with. Replace **<NameOfAzureSubscription**> with the name of your Azure subscription: -- ```powershell - Get-AzSubscription -SubscriptionName <NameOfAzureSubscription> | Set-AzContext - ``` -1. Create an Azure resource group named **ADFTutorialResourceGroup** by running the following command: -- ```powershell - New-AzResourceGroup -Name ADFTutorialResourceGroup -Location "West US" - ``` -- Some of the steps in this tutorial assume that you use the resource group named **ADFTutorialResourceGroup**. If you use a different resource group, you need to use it in place of ADFTutorialResourceGroup in this tutorial. -1. Run the **New-AzDataFactory** cmdlet to create a data factory named **ADFTutorialDataFactoryPSH**: -- ```powershell - $df=New-AzDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name ADFTutorialDataFactoryPSH -Location "West US" - ``` - This name may already have been taken. Therefore, make the name of the data factory unique by adding a prefix or suffix (for example: ADFTutorialDataFactoryPSH05152017) and run the command again. --Note the following points: --* The name of the Azure data factory must be globally unique. If you receive the following error, change the name (for example, yournameADFTutorialDataFactoryPSH). Use this name in place of ADFTutorialFactoryPSH while performing steps in this tutorial. See [Data Factory - Naming Rules](data-factory-naming-rules.md) for Data Factory artifacts. -- ``` - Data factory name "ADFTutorialDataFactoryPSH" is not available - ``` -* To create Data Factory instances, you must be a contributor or administrator of the Azure subscription. -* The name of the data factory may be registered as a DNS name in the future, and hence become publicly visible. -* You may receive the following error: "**This subscription is not registered to use namespace Microsoft.DataFactory.**" Do one of the following, and try publishing again: -- * In Azure PowerShell, run the following command to register the Data Factory provider: -- ```powershell - Register-AzResourceProvider -ProviderNamespace Microsoft.DataFactory - ``` -- Run the following command to confirm that the Data Factory provider is registered: -- ```powershell - Get-AzResourceProvider - ``` - * Sign in by using the Azure subscription to the [Azure portal](https://portal.azure.com). Go to a Data Factory blade, or create a data factory in the Azure portal. This action automatically registers the provider for you. --## Create linked services -You create linked services in a data factory to link your data stores and compute services to the data factory. In this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake Analytics. You use two data stores of type Azure Storage (source) and Azure SQL Database (destination). --Therefore, you create two linked services named AzureStorageLinkedService and AzureSqlLinkedService of types: AzureStorage and AzureSqlDatabase. --The AzureStorageLinkedService links your Azure storage account to the data factory. This storage account is the one in which you created a container and uploaded the data as part of [prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --AzureSqlLinkedService links Azure SQL Database to the data factory. The data that is copied from the blob storage is stored in this database. You created the emp table in this database as part of [prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --### Create a linked service for an Azure storage account -In this step, you link your Azure storage account to your data factory. --1. Create a JSON file named **AzureStorageLinkedService.json** in **C:\ADFGetStartedPSH** folder with the following content: (Create the folder ADFGetStartedPSH if it does not already exist.) -- > [!IMPORTANT] - > Replace <accountname> and <accountkey> with name and key of your Azure storage account before saving the file. -- ```json - { - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } - } - ``` -1. In **Azure PowerShell**, switch to the **ADFGetStartedPSH** folder. -1. Run the **New-AzDataFactoryLinkedService** cmdlet to create the linked service: **AzureStorageLinkedService**. This cmdlet, and other Data Factory cmdlets you use in this tutorial requires you to pass values for the **ResourceGroupName** and **DataFactoryName** parameters. Alternatively, you can pass the DataFactory object returned by the New-AzDataFactory cmdlet without typing ResourceGroupName and DataFactoryName each time you run a cmdlet. -- ```powershell - New-AzDataFactoryLinkedService $df -File .\AzureStorageLinkedService.json - ``` - Here is the sample output: -- ``` - LinkedServiceName : AzureStorageLinkedService - ResourceGroupName : ADFTutorialResourceGroup - DataFactoryName : ADFTutorialDataFactoryPSH0516 - Properties : Microsoft.Azure.Management.DataFactories.Models.LinkedServiceProperties - ProvisioningState : Succeeded - ``` -- Other way of creating this linked service is to specify resource group name and data factory name instead of specifying the DataFactory object. -- ```powershell - New-AzDataFactoryLinkedService -ResourceGroupName ADFTutorialResourceGroup -DataFactoryName <Name of your data factory> -File .\AzureStorageLinkedService.json - ``` --### Create a linked service for Azure SQL Database -In this step, you link Azure SQL Database to your data factory. --1. Create a JSON file named AzureSqlLinkedService.json in C:\ADFGetStartedPSH folder with the following content: -- > [!IMPORTANT] - > Replace <servername>, <databasename>, <username@servername>, and <password> with names of your server, database, user account, and password. -- ```json - { - "name": "AzureSqlLinkedService", - "properties": { - "type": "AzureSqlDatabase", - "typeProperties": { - "connectionString": "Server=tcp:<server>.database.windows.net,1433;Database=<databasename>;User ID=<user>@<server>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" - } - } - } - ``` -1. Run the following command to create a linked service: -- ```powershell - New-AzDataFactoryLinkedService $df -File .\AzureSqlLinkedService.json - ``` -- Here is the sample output: -- ``` - LinkedServiceName : AzureSqlLinkedService - ResourceGroupName : ADFTutorialResourceGroup - DataFactoryName : ADFTutorialDataFactoryPSH0516 - Properties : Microsoft.Azure.Management.DataFactories.Models.LinkedServiceProperties - ProvisioningState : Succeeded - ``` -- Confirm that **Allow access to Azure services** setting is turned on for your server. To verify and turn it on, do the following steps: -- 1. Log in to the [Azure portal](https://portal.azure.com) - 1. Click **More services >** on the left, and click **SQL servers** in the **DATABASES** category. - 1. Select your server in the list of SQL servers. - 1. On the SQL server blade, click **Show firewall settings** link. - 1. In the **Firewall settings** blade, click **ON** for **Allow access to Azure services**. - 1. Click **Save** on the toolbar. --## Create datasets -In the previous step, you created linked services to link your Azure Storage account and Azure SQL Database to your data factory. In this step, you define two datasets named InputDataset and OutputDataset that represent input and output data that is stored in the data stores referred by AzureStorageLinkedService and AzureSqlLinkedService respectively. --The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. And, the input blob dataset (InputDataset) specifies the container and the folder that contains the input data. --Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to your database. And, the output SQL table dataset (OututDataset) specifies the table in the database to which the data from the blob storage is copied. --### Create an input dataset -In this step, you create a dataset named InputDataset that points to a blob file (emp.txt) in the root folder of a blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService linked service. If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the destination. In this tutorial, you specify a value for the fileName. --1. Create a JSON file named **InputDataset.json** in the **C:\ADFGetStartedPSH** folder, with the following content: -- ```json - { - "name": "InputDataset", - "properties": { - "structure": [ - { - "name": "FirstName", - "type": "String" - }, - { - "name": "LastName", - "type": "String" - } - ], - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "fileName": "emp.txt", - "folderPath": "adftutorial/", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } - } - ``` -- The following table provides descriptions for the JSON properties used in the snippet: -- | Property | Description | - |: |: | - | type | The type property is set to **AzureBlob** because data resides in an Azure blob storage. | - | linkedServiceName | Refers to the **AzureStorageLinkedService** that you created earlier. | - | folderPath | Specifies the blob **container** and the **folder** that contains input blobs. In this tutorial, adftutorial is the blob container and folder is the root folder. | - | fileName | This property is optional. If you omit this property, all files from the folderPath are picked. In this tutorial, **emp.txt** is specified for the fileName, so only that file is picked up for processing. | - | format -> type |The input file is in the text format, so we use **TextFormat**. | - | columnDelimiter | The columns in the input file are delimited by **comma character (`,`)**. | - | frequency/interval | The frequency is set to **Hour** and interval is set to **1**, which means that the input slices are available **hourly**. In other words, the Data Factory service looks for input data every hour in the root folder of blob container (**adftutorial**) you specified. It looks for the data within the pipeline start and end times, not before or after these times. | - | external | This property is set to **true** if the data is not generated by this pipeline. The input data in this tutorial is in the emp.txt file, which is not generated by this pipeline, so we set this property to true. | -- For more information about these JSON properties, see [Azure Blob connector article](data-factory-azure-blob-connector.md#dataset-properties). -1. Run the following command to create the Data Factory dataset. -- ```powershell - New-AzDataFactoryDataset $df -File .\InputDataset.json - ``` - Here is the sample output: -- ``` - DatasetName : InputDataset - ResourceGroupName : ADFTutorialResourceGroup - DataFactoryName : ADFTutorialDataFactoryPSH0516 - Availability : Microsoft.Azure.Management.DataFactories.Common.Models.Availability - Location : Microsoft.Azure.Management.DataFactories.Models.AzureBlobDataset - Policy : Microsoft.Azure.Management.DataFactories.Common.Models.Policy - Structure : {FirstName, LastName} - Properties : Microsoft.Azure.Management.DataFactories.Models.DatasetProperties - ProvisioningState : Succeeded - ``` --### Create an output dataset -In this part of the step, you create an output dataset named **OutputDataset**. This dataset points to a SQL table in Azure SQL Database represented by **AzureSqlLinkedService**. --1. Create a JSON file named **OutputDataset.json** in the **C:\ADFGetStartedPSH** folder with the following content: -- ```json - { - "name": "OutputDataset", - "properties": { - "structure": [ - { - "name": "FirstName", - "type": "String" - }, - { - "name": "LastName", - "type": "String" - } - ], - "type": "AzureSqlTable", - "linkedServiceName": "AzureSqlLinkedService", - "typeProperties": { - "tableName": "emp" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } - } - ``` -- The following table provides descriptions for the JSON properties used in the snippet: -- | Property | Description | - |: |: | - | type | The type property is set to **AzureSqlTable** because data is copied to a table in Azure SQL Database. | - | linkedServiceName | Refers to the **AzureSqlLinkedService** that you created earlier. | - | tableName | Specified the **table** to which the data is copied. | - | frequency/interval | The frequency is set to **Hour** and interval is **1**, which means that the output slices are produced **hourly** between the pipeline start and end times, not before or after these times. | -- There are three columns - **ID**, **FirstName**, and **LastName** - in the emp table in the database. ID is an identity column, so you need to specify only **FirstName** and **LastName** here. -- For more information about these JSON properties, see [Azure SQL connector article](data-factory-azure-sql-connector.md#dataset-properties). -1. Run the following command to create the data factory dataset. -- ```powershell - New-AzDataFactoryDataset $df -File .\OutputDataset.json - ``` -- Here is the sample output: -- ``` - DatasetName : OutputDataset - ResourceGroupName : ADFTutorialResourceGroup - DataFactoryName : ADFTutorialDataFactoryPSH0516 - Availability : Microsoft.Azure.Management.DataFactories.Common.Models.Availability - Location : Microsoft.Azure.Management.DataFactories.Models.AzureSqlTableDataset - Policy : - Structure : {FirstName, LastName} - Properties : Microsoft.Azure.Management.DataFactories.Models.DatasetProperties - ProvisioningState : Succeeded - ``` --## Create a pipeline -In this step, you create a pipeline with a **copy activity** that uses **InputDataset** as an input and **OutputDataset** as an output. --Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours. Therefore, 24 slices of output dataset are produced by the pipeline. --1. Create a JSON file named **ADFTutorialPipeline.json** in the **C:\ADFGetStartedPSH** folder, with the following content: -- ```json - { - "name": "ADFTutorialPipeline", - "properties": { - "description": "Copy data from a blob to Azure SQL table", - "activities": [ - { - "name": "CopyFromBlobToSQL", - "type": "Copy", - "inputs": [ - { - "name": "InputDataset" - } - ], - "outputs": [ - { - "name": "OutputDataset" - } - ], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "SqlSink", - "writeBatchSize": 10000, - "writeBatchTimeout": "60:00:00" - } - }, - "Policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ], - "start": "2017-05-11T00:00:00Z", - "end": "2017-05-12T00:00:00Z" - } - } - ``` - Note the following points: -- - In the activities section, there is only one activity whose **type** is set to **Copy**. For more information about the copy activity, see [data movement activities](data-factory-data-movement-activities.md). In Data Factory solutions, you can also use [data transformation activities](data-factory-data-transformation-activities.md). - - Input for the activity is set to **InputDataset** and output for the activity is set to **OutputDataset**. - - In the **typeProperties** section, **BlobSource** is specified as the source type and **SqlSink** is specified as the sink type. For a complete list of data stores supported by the copy activity as sources and sinks, see [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats). To learn how to use a specific supported data store as a source/sink, click the link in the table. -- Replace the value of the **start** property with the current day and **end** value with the next day. You can specify only the date part and skip the time part of the date time. For example, "2016-02-03", which is equivalent to "2016-02-03T00:00:00Z" -- Both start and end datetimes must be in [ISO format](https://en.wikipedia.org/wiki/ISO_8601). For example: 2016-10-14T16:32:41Z. The **end** time is optional, but we use it in this tutorial. -- If you do not specify value for the **end** property, it is calculated as "**start + 48 hours**". To run the pipeline indefinitely, specify **9999-09-09** as the value for the **end** property. -- In the preceding example, there are 24 data slices as each data slice is produced hourly. -- For descriptions of JSON properties in a pipeline definition, see [create pipelines](data-factory-create-pipelines.md) article. For descriptions of JSON properties in a copy activity definition, see [data movement activities](data-factory-data-movement-activities.md). For descriptions of JSON properties supported by BlobSource, see [Azure Blob connector article](data-factory-azure-blob-connector.md). For descriptions of JSON properties supported by SqlSink, see [Azure SQL Database connector article](data-factory-azure-sql-connector.md). -1. Run the following command to create the data factory table. -- ```powershell - New-AzDataFactoryPipeline $df -File .\ADFTutorialPipeline.json - ``` -- Here is the sample output: -- ``` - PipelineName : ADFTutorialPipeline - ResourceGroupName : ADFTutorialResourceGroup - DataFactoryName : ADFTutorialDataFactoryPSH0516 - Properties : Microsoft.Azure.Management.DataFactories.Models.PipelinePropertie - ProvisioningState : Succeeded - ``` --**Congratulations!** You have successfully created an Azure data factory with a pipeline to copy data from an Azure blob storage to Azure SQL Database. --## Monitor the pipeline -In this step, you use Azure PowerShell to monitor what's going on in an Azure data factory. --1. Replace <DataFactoryName> with the name of your data factory and run **Get-AzDataFactory**, and assign the output to a variable $df. -- ```powershell - $df=Get-AzDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name <DataFactoryName> - ``` -- For example: - ```powershell - $df=Get-AzDataFactory -ResourceGroupName ADFTutorialResourceGroup -Name ADFTutorialDataFactoryPSH0516 - ``` -- Then, run print the contents of $df to see the following output: -- ``` - PS C:\ADFGetStartedPSH> $df -- DataFactoryName : ADFTutorialDataFactoryPSH0516 - DataFactoryId : 6f194b34-03b3-49ab-8f03-9f8a7b9d3e30 - ResourceGroupName : ADFTutorialResourceGroup - Location : West US - Tags : {} - Properties : Microsoft.Azure.Management.DataFactories.Models.DataFactoryProperties - ProvisioningState : Succeeded - ``` -1. Run **Get-AzDataFactorySlice** to get details about all slices of the **OutputDataset**, which is the output dataset of the pipeline. -- ```powershell - Get-AzDataFactorySlice $df -DatasetName OutputDataset -StartDateTime 2017-05-11T00:00:00Z - ``` -- This setting should match the **Start** value in the pipeline JSON. You should see 24 slices, one for each hour from 12 AM of the current day to 12 AM of the next day. -- Here are three sample slices from the output: -- ``` - ResourceGroupName : ADFTutorialResourceGroup - DataFactoryName : ADFTutorialDataFactoryPSH0516 - DatasetName : OutputDataset - Start : 5/11/2017 11:00:00 PM - End : 5/12/2017 12:00:00 AM - RetryCount : 0 - State : Ready - SubState : - LatencyStatus : - LongRetryCount : 0 -- ResourceGroupName : ADFTutorialResourceGroup - DataFactoryName : ADFTutorialDataFactoryPSH0516 - DatasetName : OutputDataset - Start : 5/11/2017 9:00:00 PM - End : 5/11/2017 10:00:00 PM - RetryCount : 0 - State : InProgress - SubState : - LatencyStatus : - LongRetryCount : 0 -- ResourceGroupName : ADFTutorialResourceGroup - DataFactoryName : ADFTutorialDataFactoryPSH0516 - DatasetName : OutputDataset - Start : 5/11/2017 8:00:00 PM - End : 5/11/2017 9:00:00 PM - RetryCount : 0 - State : Waiting - SubState : ConcurrencyLimit - LatencyStatus : - LongRetryCount : 0 - ``` -1. Run **Get-AzDataFactoryRun** to get the details of activity runs for a **specific** slice. Copy the date-time value from the output of the previous command to specify the value for the StartDateTime parameter. -- ```powershell - Get-AzDataFactoryRun $df -DatasetName OutputDataset -StartDateTime "5/11/2017 09:00:00 PM" - ``` -- Here is the sample output: -- ``` - Id : c0ddbd75-d0c7-4816-a775-704bbd7c7eab_636301332000000000_636301368000000000_OutputDataset - ResourceGroupName : ADFTutorialResourceGroup - DataFactoryName : ADFTutorialDataFactoryPSH0516 - DatasetName : OutputDataset - ProcessingStartTime : 5/16/2017 8:00:33 PM - ProcessingEndTime : 5/16/2017 8:01:36 PM - PercentComplete : 100 - DataSliceStart : 5/11/2017 9:00:00 PM - DataSliceEnd : 5/11/2017 10:00:00 PM - Status : Succeeded - Timestamp : 5/16/2017 8:00:33 PM - RetryAttempt : 0 - Properties : {} - ErrorMessage : - ActivityName : CopyFromBlobToSQL - PipelineName : ADFTutorialPipeline - Type : Copy - ``` --For comprehensive documentation on Data Factory cmdlets, see [Data Factory Cmdlet Reference](/powershell/module/az.datafactory). --## Summary -In this tutorial, you created an Azure data factory to copy data from an Azure blob to Azure SQL Database. You used PowerShell to create the data factory, linked services, datasets, and a pipeline. Here are the high-level steps you performed in this tutorial: --1. Created an Azure **data factory**. -1. Created **linked services**: -- a. An **Azure Storage** linked service to link your Azure storage account that holds input data. - b. An **Azure SQL** linked service to link your SQL database that holds the output data. -1. Created **datasets** that describe input data and output data for pipelines. -1. Created a **pipeline** with **Copy Activity**, with **BlobSource** as the source and **SqlSink** as the sink. --## Next steps -In this tutorial, you used Azure blob storage as a source data store and Azure SQL Database as a destination data store in a copy operation. The following table provides a list of data stores supported as sources and destinations by the copy activity: ---To learn about how to copy data to/from a data store, click the link for the data store in the table. |
data-factory | Data Factory Copy Activity Tutorial Using Rest Api | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-copy-activity-tutorial-using-rest-api.md | - Title: 'Tutorial: Use REST API to create an Azure Data Factory pipeline ' -description: In this tutorial, you use REST API to create an Azure Data Factory pipeline with a Copy Activity to copy data from an Azure blob storage to Azure SQL Database. ---- Previously updated : 04/12/2023-----# Tutorial: Use REST API to create an Azure Data Factory pipeline to copy data -> [!div class="op_single_selector"] -> * [Overview and prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) -> * [Copy Wizard](data-factory-copy-data-wizard-tutorial.md) -> * [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) -> * [PowerShell](data-factory-copy-activity-tutorial-using-powershell.md) -> * [Azure Resource Manager template](data-factory-copy-activity-tutorial-using-azure-resource-manager-template.md) -> * [REST API](data-factory-copy-activity-tutorial-using-rest-api.md) -> * [.NET API](data-factory-copy-activity-tutorial-using-dotnet-api.md) -> -> --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [copy activity tutorial](../quickstart-create-data-factory-rest-api.md). --In this article, you learn how to use REST API to create a data factory with a pipeline that copies data from an Azure blob storage to Azure SQL Database. If you are new to Azure Data Factory, read through the [Introduction to Azure Data Factory](data-factory-introduction.md) article before doing this tutorial. --In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats). The activity is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see [Data Movement Activities](data-factory-data-movement-activities.md). --A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see [multiple activities in a pipeline](data-factory-scheduling-and-execution.md#multiple-activities-in-a-pipeline). --> [!NOTE] -> This article does not cover all the Data Factory REST API. See [Data Factory REST API Reference](/rest/api/datafactory/) for comprehensive documentation on Data Factory cmdlets. -> -> The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to transform data using Azure Data Factory, see [Tutorial: Build a pipeline to transform data using Hadoop cluster](data-factory-build-your-first-pipeline.md). --## Prerequisites ---* Go through [Tutorial Overview](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) and complete the **prerequisite** steps. -* Install [Curl](https://curl.haxx.se/dlwiz/) on your machine. You use the Curl tool with REST commands to create a data factory. -* Follow instructions from [this article](../../active-directory/develop/howto-create-service-principal-portal.md) to: - 1. Create a Web application named **ADFCopyTutorialApp** in Microsoft Entra ID. - 2. Get **client ID** and **secret key**. - 3. Get **tenant ID**. - 4. Assign the **ADFCopyTutorialApp** application to the **Data Factory Contributor** role. -* Install [Azure PowerShell](/powershell/azure/). -* Launch **PowerShell** and do the following steps. Keep Azure PowerShell open until the end of this tutorial. If you close and reopen, you need to run the commands again. - - 1. Run the following command and enter the user name and password that you use to sign in to the Azure portal: - - ```PowerShell - Connect-AzAccount - ``` - 2. Run the following command to view all the subscriptions for this account: -- ```PowerShell - Get-AzSubscription - ``` - 3. Run the following command to select the subscription that you want to work with. Replace **<NameOfAzureSubscription**> with the name of your Azure subscription. - - ```PowerShell - Get-AzSubscription -SubscriptionName <NameOfAzureSubscription> | Set-AzContext - ``` - 4. Create an Azure resource group named **ADFTutorialResourceGroup** by running the following command in the PowerShell: -- ```PowerShell - New-AzResourceGroup -Name ADFTutorialResourceGroup -Location "West US" - ``` - - If the resource group already exists, you specify whether to update it (Y) or keep it as (N). - - Some of the steps in this tutorial assume that you use the resource group named ADFTutorialResourceGroup. If you use a different resource group, you need to use the name of your resource group in place of ADFTutorialResourceGroup in this tutorial. --## Create JSON definitions -Create following JSON files in the folder where curl.exe is located. --### datafactory.json -> [!IMPORTANT] -> Name must be globally unique, so you may want to prefix/suffix ADFCopyTutorialDF to make it a unique name. -> -> --```JSON -{ - "name": "ADFCopyTutorialDF", - "location": "WestUS" -} -``` --### azurestoragelinkedservice.json -> [!IMPORTANT] -> Replace **accountname** and **accountkey** with name and key of your Azure storage account. To learn how to get your storage access key, see [Manage storage account access keys](../../storage/common/storage-account-keys-manage.md). --```JSON -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --For details about JSON properties, see [Azure Storage linked service](data-factory-azure-blob-connector.md#azure-storage-linked-service). --### azuresqllinkedservice.json -> [!IMPORTANT] -> Replace **servername**, **databasename**, **username**, and **password** with name of your server, name of SQL database, user account, and password for the account. -> -> --```JSON -{ - "name": "AzureSqlLinkedService", - "properties": { - "type": "AzureSqlDatabase", - "description": "", - "typeProperties": { - "connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=<databasename>;User ID=<username>;Password=<password>;Integrated Security=False;Encrypt=True;Connect Timeout=30" - } - } -} -``` --For details about JSON properties, see [Azure SQL linked service](data-factory-azure-sql-connector.md#linked-service-properties). --### inputdataset.json --```JSON -{ - "name": "AzureBlobInput", - "properties": { - "structure": [ - { - "name": "FirstName", - "type": "String" - }, - { - "name": "LastName", - "type": "String" - } - ], - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "adftutorial/", - "fileName": "emp.txt", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --The following table provides descriptions for the JSON properties used in the snippet: --| Property | Description | -|: |: | -| type | The type property is set to **AzureBlob** because data resides in an Azure blob storage. | -| linkedServiceName | Refers to the **AzureStorageLinkedService** that you created earlier. | -| folderPath | Specifies the blob **container** and the **folder** that contains input blobs. In this tutorial, adftutorial is the blob container and folder is the root folder. | -| fileName | This property is optional. If you omit this property, all files from the folderPath are picked. In this tutorial, **emp.txt** is specified for the fileName, so only that file is picked up for processing. | -| format -> type |The input file is in the text format, so we use **TextFormat**. | -| columnDelimiter | The columns in the input file are delimited by **comma character (`,`)**. | -| frequency/interval | The frequency is set to **Hour** and interval is set to **1**, which means that the input slices are available **hourly**. In other words, the Data Factory service looks for input data every hour in the root folder of blob container (**adftutorial**) you specified. It looks for the data within the pipeline start and end times, not before or after these times. | -| external | This property is set to **true** if the data is not generated by this pipeline. The input data in this tutorial is in the emp.txt file, which is not generated by this pipeline, so we set this property to true. | --For more information about these JSON properties, see [Azure Blob connector article](data-factory-azure-blob-connector.md#dataset-properties). --### outputdataset.json --```JSON -{ - "name": "AzureSqlOutput", - "properties": { - "structure": [ - { - "name": "FirstName", - "type": "String" - }, - { - "name": "LastName", - "type": "String" - } - ], - "type": "AzureSqlTable", - "linkedServiceName": "AzureSqlLinkedService", - "typeProperties": { - "tableName": "emp" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` -The following table provides descriptions for the JSON properties used in the snippet: --| Property | Description | -|: |: | -| type | The type property is set to **AzureSqlTable** because data is copied to a table in Azure SQL Database. | -| linkedServiceName | Refers to the **AzureSqlLinkedService** that you created earlier. | -| tableName | Specified the **table** to which the data is copied. | -| frequency/interval | The frequency is set to **Hour** and interval is **1**, which means that the output slices are produced **hourly** between the pipeline start and end times, not before or after these times. | --There are three columns - **ID**, **FirstName**, and **LastName** - in the emp table in the database. ID is an identity column, so you need to specify only **FirstName** and **LastName** here. --For more information about these JSON properties, see [Azure SQL connector article](data-factory-azure-sql-connector.md#dataset-properties). --### pipeline.json --```JSON -{ - "name": "ADFTutorialPipeline", - "properties": { - "description": "Copy data from a blob to Azure SQL table", - "activities": [ - { - "name": "CopyFromBlobToSQL", - "description": "Push Regional Effectiveness Campaign data to Azure SQL Database", - "type": "Copy", - "inputs": [ - { - "name": "AzureBlobInput" - } - ], - "outputs": [ - { - "name": "AzureSqlOutput" - } - ], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "SqlSink", - "writeBatchSize": 10000, - "writeBatchTimeout": "60:00:00" - } - }, - "Policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ], - "start": "2017-05-11T00:00:00Z", - "end": "2017-05-12T00:00:00Z" - } -} -``` --Note the following points: --- In the activities section, there is only one activity whose **type** is set to **Copy**. For more information about the copy activity, see [data movement activities](data-factory-data-movement-activities.md). In Data Factory solutions, you can also use [data transformation activities](data-factory-data-transformation-activities.md).-- Input for the activity is set to **AzureBlobInput** and output for the activity is set to **AzureSqlOutput**. -- In the **typeProperties** section, **BlobSource** is specified as the source type and **SqlSink** is specified as the sink type. For a complete list of data stores supported by the copy activity as sources and sinks, see [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats). To learn how to use a specific supported data store as a source/sink, click the link in the table. - -Replace the value of the **start** property with the current day and **end** value with the next day. You can specify only the date part and skip the time part of the date time. For example, "2017-02-03", which is equivalent to "2017-02-03T00:00:00Z" - -Both start and end datetimes must be in [ISO format](https://en.wikipedia.org/wiki/ISO_8601). For example: 2016-10-14T16:32:41Z. The **end** time is optional, but we use it in this tutorial. - -If you do not specify value for the **end** property, it is calculated as "**start + 48 hours**". To run the pipeline indefinitely, specify **9999-09-09** as the value for the **end** property. - -In the preceding example, there are 24 data slices as each data slice is produced hourly. --For descriptions of JSON properties in a pipeline definition, see [create pipelines](data-factory-create-pipelines.md) article. For descriptions of JSON properties in a copy activity definition, see [data movement activities](data-factory-data-movement-activities.md). For descriptions of JSON properties supported by BlobSource, see [Azure Blob connector article](data-factory-azure-blob-connector.md). For descriptions of JSON properties supported by SqlSink, see [Azure SQL Database connector article](data-factory-azure-sql-connector.md). --## Set global variables -In Azure PowerShell, execute the following commands after replacing the values with your own: --> [!IMPORTANT] -> See [Prerequisites](#prerequisites) section for instructions on getting client ID, client secret, tenant ID, and subscription ID. -> -> --```JSON -$client_id = "<client ID of application in AAD>" -$client_secret = "<client key of application in AAD>" -$tenant = "<Azure tenant ID>"; -$subscription_id="<Azure subscription ID>"; --$rg = "ADFTutorialResourceGroup" -``` --Run the following command after updating the name of the data factory you are using: --``` -$adf = "ADFCopyTutorialDF" -``` --<a name='authenticate-with-aad'></a> --## Authenticate with Microsoft Entra ID -Run the following command to authenticate with Microsoft Entra ID: --```PowerShell -$cmd = { .\curl.exe -X POST https://login.microsoftonline.com/$tenant/oauth2/token -F grant_type=client_credentials -F resource=https://management.core.windows.net/ -F client_id=$client_id -F client_secret=$client_secret }; -$responseToken = Invoke-Command -scriptblock $cmd; -$accessToken = (ConvertFrom-Json $responseToken).access_token; --(ConvertFrom-Json $responseToken) -``` --## Create data factory -In this step, you create an Azure Data Factory named **ADFCopyTutorialDF**. A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store. A HDInsight Hive activity to run a Hive script to transform input data to product output data. Run the following commands to create the data factory: --1. Assign the command to variable named **cmd**. - - > [!IMPORTANT] - > Confirm that the name of the data factory you specify here (ADFCopyTutorialDF) matches the name specified in the **datafactory.json**. - - ```PowerShell - $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@datafactory.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/ADFCopyTutorialDF0411?api-version=2015-10-01}; - ``` -2. Run the command by using **Invoke-Command**. - - ```PowerShell - $results = Invoke-Command -scriptblock $cmd; - ``` -3. View the results. If the data factory has been successfully created, you see the JSON for the data factory in the **results**; otherwise, you see an error message. - - ``` - Write-Host $results - ``` --Note the following points: --* The name of the Azure Data Factory must be globally unique. If you see the error in results: **Data factory name "ADFCopyTutorialDF" is not available**, do the following steps: - - 1. Change the name (for example, yournameADFCopyTutorialDF) in the **datafactory.json** file. - 2. In the first command where the **$cmd** variable is assigned a value, replace ADFCopyTutorialDF with the new name and run the command. - 3. Run the next two commands to invoke the REST API to create the data factory and print the results of the operation. - - See [Data Factory - Naming Rules](data-factory-naming-rules.md) topic for naming rules for Data Factory artifacts. -* To create Data Factory instances, you need to be a contributor/administrator of the Azure subscription -* The name of the data factory may be registered as a DNS name in the future and hence become publicly visible. -* If you receive the error: "**This subscription is not registered to use namespace Microsoft.DataFactory**", do one of the following and try publishing again: - - * In Azure PowerShell, run the following command to register the Data Factory provider: -- ```PowerShell - Register-AzResourceProvider -ProviderNamespace Microsoft.DataFactory - ``` - You can run the following command to confirm that the Data Factory provider is registered. - - ```PowerShell - Get-AzResourceProvider - ``` - * Login using the Azure subscription into the [Azure portal](https://portal.azure.com) and navigate to a Data Factory blade (or) create a data factory in the Azure portal. This action automatically registers the provider for you. --Before creating a pipeline, you need to create a few Data Factory entities first. You first create linked services to link source and destination data stores to your data store. Then, define input and output datasets to represent data in linked data stores. Finally, create the pipeline with an activity that uses these datasets. --## Create linked services -You create linked services in a data factory to link your data stores and compute services to the data factory. In this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake Analytics. You use two data stores of type Azure Storage (source) and Azure SQL Database (destination). Therefore, you create two linked services named AzureStorageLinkedService and AzureSqlLinkedService of types: AzureStorage and AzureSqlDatabase. --The AzureStorageLinkedService links your Azure storage account to the data factory. This storage account is the one in which you created a container and uploaded the data as part of [prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --AzureSqlLinkedService links Azure SQL Database to the data factory. The data that is copied from the blob storage is stored in this database. You created the emp table in this database as part of [prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --### Create Azure Storage linked service -In this step, you link your Azure storage account to your data factory. You specify the name and key of your Azure storage account in this section. See [Azure Storage linked service](data-factory-azure-blob-connector.md#azure-storage-linked-service) for details about JSON properties used to define an Azure Storage linked service. --1. Assign the command to variable named **cmd**. -- ```PowerShell - $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@azurestoragelinkedservice.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/linkedservices/AzureStorageLinkedService?api-version=2015-10-01}; - ``` -2. Run the command by using **Invoke-Command**. -- ```PowerShell - $results = Invoke-Command -scriptblock $cmd; - ``` -3. View the results. If the linked service has been successfully created, you see the JSON for the linked service in the **results**; otherwise, you see an error message. -- ```PowerShell - Write-Host $results - ``` --### Create Azure SQL linked service -In this step, you link Azure SQL Database to your data factory. You specify the logical SQL server name, database name, user name, and user password in this section. See [Azure SQL linked service](data-factory-azure-sql-connector.md#linked-service-properties) for details about JSON properties used to define an Azure SQL linked service. --1. Assign the command to variable named **cmd**. - - ```PowerShell - $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@azuresqllinkedservice.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/linkedservices/AzureSqlLinkedService?api-version=2015-10-01}; - ``` -2. Run the command by using **Invoke-Command**. - - ```PowerShell - $results = Invoke-Command -scriptblock $cmd; - ``` -3. View the results. If the linked service has been successfully created, you see the JSON for the linked service in the **results**; otherwise, you see an error message. - - ```PowerShell - Write-Host $results - ``` --## Create datasets -In the previous step, you created linked services to link your Azure Storage account and Azure SQL Database to your data factory. In this step, you define two datasets named AzureBlobInput and AzureSqlOutput that represent input and output data that is stored in the data stores referred by AzureStorageLinkedService and AzureSqlLinkedService respectively. --The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. And, the input blob dataset (AzureBlobInput) specifies the container and the folder that contains the input data. --Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to Azure SQL Database. And, the output SQL table dataset (OututDataset) specifies the table in the database to which the data from the blob storage is copied. --### Create input dataset -In this step, you create a dataset named AzureBlobInput that points to a blob file (emp.txt) in the root folder of a blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService linked service. If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the destination. In this tutorial, you specify a value for the fileName. --1. Assign the command to variable named **cmd**. -- ```PowerSHell - $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@inputdataset.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/datasets/AzureBlobInput?api-version=2015-10-01}; - ``` -2. Run the command by using **Invoke-Command**. - - ```PowerShell - $results = Invoke-Command -scriptblock $cmd; - ``` -3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the **results**; otherwise, you see an error message. - - ```PowerShell - Write-Host $results - ``` --### Create output dataset -The Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to Azure SQL Database. The output SQL table dataset (OututDataset) you create in this step specifies the table in the database to which the data from the blob storage is copied. --1. Assign the command to variable named **cmd**. -- ```PowerShell - $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@outputdataset.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/datasets/AzureSqlOutput?api-version=2015-10-01}; - ``` -2. Run the command by using **Invoke-Command**. - - ```PowerShell - $results = Invoke-Command -scriptblock $cmd; - ``` -3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the **results**; otherwise, you see an error message. - - ```PowerShell - Write-Host $results - ``` --## Create pipeline -In this step, you create a pipeline with a **copy activity** that uses **AzureBlobInput** as an input and **AzureSqlOutput** as an output. --Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours. Therefore, 24 slices of output dataset are produced by the pipeline. --1. Assign the command to variable named **cmd**. -- ```PowerShell - $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@pipeline.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/datapipelines/MyFirstPipeline?api-version=2015-10-01}; - ``` -2. Run the command by using **Invoke-Command**. -- ```PowerShell - $results = Invoke-Command -scriptblock $cmd; - ``` -3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the **results**; otherwise, you see an error message. -- ```PowerShell - Write-Host $results - ``` --**Congratulations!** You have successfully created an Azure data factory, with a pipeline that copies data from Azure Blob Storage to Azure SQL Database. --## Monitor pipeline -In this step, you use Data Factory REST API to monitor slices being produced by the pipeline. --```PowerShell -$ds ="AzureSqlOutput" -``` --> [!IMPORTANT] -> Make sure that the start and end times specified in the following command match the start and end times of the pipeline. --```PowerShell -$cmd = {.\curl.exe -X GET -H "Authorization: Bearer $accessToken" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/datasets/$ds/slices?start=2017-05-11T00%3a00%3a00.0000000Z"&"end=2017-05-12T00%3a00%3a00.0000000Z"&"api-version=2015-10-01}; -``` --```PowerShell -$results2 = Invoke-Command -scriptblock $cmd; -``` --```PowerShell -IF ((ConvertFrom-Json $results2).value -ne $NULL) { - ConvertFrom-Json $results2 | Select-Object -Expand value | Format-Table -} else { - (convertFrom-Json $results2).RemoteException -} -``` --Run the Invoke-Command and the next one until you see a slice in **Ready** state or **Failed** state. When the slice is in Ready state, check the **emp** table in Azure SQL Database for the output data. --For each slice, two rows of data from the source file are copied to the emp table in Azure SQL Database. Therefore, you see 24 new records in the emp table when all the slices are successfully processed (in Ready state). --## Summary -In this tutorial, you used REST API to create an Azure data factory to copy data from an Azure blob to Azure SQL Database. Here are the high-level steps you performed in this tutorial: --1. Created an Azure **data factory**. -2. Created **linked services**: - 1. An Azure Storage linked service to link your Azure Storage account that holds input data. - 2. An Azure SQL linked service to link your database that holds the output data. -3. Created **datasets**, which describe input data and output data for pipelines. -4. Created a **pipeline** with a Copy Activity with BlobSource as source and SqlSink as sink. --## Next steps -In this tutorial, you used Azure blob storage as a source data store and Azure SQL Database as a destination data store in a copy operation. The following table provides a list of data stores supported as sources and destinations by the copy activity: ---To learn about how to copy data to/from a data store, click the link for the data store in the table. |
data-factory | Data Factory Copy Activity Tutorial Using Visual Studio | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-copy-activity-tutorial-using-visual-studio.md | - Title: 'Tutorial: Create a pipeline with Copy Activity using Visual Studio ' -description: In this tutorial, you create an Azure Data Factory pipeline with a Copy Activity by using Visual Studio. ----- Previously updated : 04/12/2023----# Tutorial: Create a pipeline with Copy Activity using Visual Studio -> [!div class="op_single_selector"] -> * [Overview and prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) -> * [Copy Wizard](data-factory-copy-data-wizard-tutorial.md) -> * [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) -> * [PowerShell](data-factory-copy-activity-tutorial-using-powershell.md) -> * [Azure Resource Manager template](data-factory-copy-activity-tutorial-using-azure-resource-manager-template.md) -> * [REST API](data-factory-copy-activity-tutorial-using-rest-api.md) -> * [.NET API](data-factory-copy-activity-tutorial-using-dotnet-api.md) -> -> --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [copy activity tutorial](../quickstart-create-data-factory-dot-net.md). --In this article, you learn how to use the Microsoft Visual Studio to create a data factory with a pipeline that copies data from an Azure blob storage to Azure SQL Database. If you are new to Azure Data Factory, read through the [Introduction to Azure Data Factory](data-factory-introduction.md) article before doing this tutorial. --In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats). The activity is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see [Data Movement Activities](data-factory-data-movement-activities.md). --A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see [multiple activities in a pipeline](data-factory-scheduling-and-execution.md#multiple-activities-in-a-pipeline). --> [!NOTE] -> The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to transform data using Azure Data Factory, see [Tutorial: Build a pipeline to transform data using Hadoop cluster](data-factory-build-your-first-pipeline.md). --## Prerequisites ---1. Read through [Tutorial Overview](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) article and complete the **prerequisite** steps. -2. To create Data Factory instances, you must be a member of the [Data Factory Contributor](../../role-based-access-control/built-in-roles.md#data-factory-contributor) role at the subscription/resource group level. -3. You must have the following installed on your computer: - * Visual Studio 2013 or Visual Studio 2015 - * Download Azure SDK for Visual Studio 2013 or Visual Studio 2015. Navigate to [Azure Download Page](https://azure.microsoft.com/downloads/) and click **VS 2013** or **VS 2015** in the **.NET** section. - * Download the latest Azure Data Factory plugin for Visual Studio: [VS 2013](https://visualstudiogallery.msdn.microsoft.com/754d998c-8f92-4aa7-835b-e89c8c954aa5) or [VS 2015](https://visualstudiogallery.msdn.microsoft.com/371a4cf9-0093-40fa-b7dd-be3c74f49005). You can also update the plugin by doing the following steps: On the menu, click **Tools** -> **Extensions and Updates** -> **Online** -> **Visual Studio Gallery** -> **Microsoft Azure Data Factory Tools for Visual Studio** -> **Update**. --## Steps --Here are the steps you perform as part of this tutorial: --1. Create **linked services** in the data factory. In this step, you create two linked services of types: Azure Storage and Azure SQL Database. -- The AzureStorageLinkedService links your Azure storage account to the data factory. You created a container and uploaded data to this storage account as part of [prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). -- AzureSqlLinkedService links Azure SQL Database to the data factory. The data that is copied from the blob storage is stored in this database. You created a SQL table in this database as part of [prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --2. Create input and output **datasets** in the data factory. -- The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. And, the input blob dataset specifies the container and the folder that contains the input data. -- Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to Azure SQL Database. And, the output SQL table dataset specifies the table in the database to which the data from the blob storage is copied. --3. Create a **pipeline** in the data factory. In this step, you create a pipeline with a copy activity. -- The copy activity copies data from a blob in the Azure blob storage to a table in Azure SQL Database. You can use a copy activity in a pipeline to copy data from any supported source to any supported destination. For a list of supported data stores, see [data movement activities](data-factory-data-movement-activities.md#supported-data-stores-and-formats) article. --4. Create an Azure **data factory** when deploying Data Factory entities (linked services, datasets/tables, and pipelines). --## Create Visual Studio project --1. Launch **Visual Studio 2015**. Click **File**, point to **New**, and click **Project**. You should see the **New Project** dialog box. --2. In the **New Project** dialog, select the **DataFactory** template, and click **Empty Data Factory Project**. -- :::image type="content" source="./media/data-factory-copy-activity-tutorial-using-visual-studio/new-project-dialog.png" alt-text="New project dialog box"::: --3. Specify the name of the project, location for the solution, and name of the solution, and then click **OK**. -- :::image type="content" source="./media/data-factory-copy-activity-tutorial-using-visual-studio/solution-explorer.png" alt-text="Solution Explorer"::: --## Create linked services --You create linked services in a data factory to link your data stores and compute services to the data factory. In this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake Analytics. You use two data stores of type Azure Storage (source) and Azure SQL Database (destination). --Therefore, you create two linked services of types: AzureStorage and AzureSqlDatabase. --The Azure Storage linked service links your Azure storage account to the data factory. This storage account is the one in which you created a container and uploaded the data as part of [prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --Azure SQL linked service links Azure SQL Database to the data factory. The data that is copied from the blob storage is stored in this database. You created the emp table in this database as part of [prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --Linked services link data stores or compute services to an Azure data factory. See [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) for all the sources and sinks supported by the Copy Activity. See [compute linked services](data-factory-compute-linked-services.md) for the list of compute services supported by Data Factory. In this tutorial, you do not use any compute service. --### Create the Azure Storage linked service --1. In **Solution Explorer**, right-click **Linked Services**, point to **Add**, and click **New Item**. --2. In the **Add New Item** dialog box, select **Azure Storage Linked Service** from the list, and click **Add**. -- :::image type="content" source="./media/data-factory-copy-activity-tutorial-using-visual-studio/new-linked-service-dialog.png" alt-text="New Linked Service"::: --3. Replace `<accountname>` and `<accountkey>`* with the name of your Azure storage account and its key. -- :::image type="content" source="./media/data-factory-copy-activity-tutorial-using-visual-studio/azure-storage-linked-service.png" alt-text="Azure Storage Linked Service"::: --4. Save the **AzureStorageLinkedService1.json** file. -- For more information about JSON properties in the linked service definition, see [Azure Blob Storage connector](data-factory-azure-blob-connector.md#linked-service-properties) article. --### Create the Azure SQL linked service --1. Right-click on **Linked Services** node in the **Solution Explorer** again, point to **Add**, and click **New Item**. -2. This time, select **Azure SQL Linked Service**, and click **Add**. -3. In the **AzureSqlLinkedService1.json file**, replace `<servername>`, `<databasename>`, `<username@servername>`, and `<password>` with names of your server, database, user account, and password. -4. Save the **AzureSqlLinkedService1.json** file. -- For more information about these JSON properties, see [Azure SQL Database connector](data-factory-azure-sql-connector.md#linked-service-properties). --## Create datasets --In the previous step, you created linked services to link your Azure Storage account and Azure SQL Database to your data factory. In this step, you define two datasets named InputDataset and OutputDataset that represent input and output data that is stored in the data stores referred by AzureStorageLinkedService1 and AzureSqlLinkedService1 respectively. --The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. And, the input blob dataset (InputDataset) specifies the container and the folder that contains the input data. --Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to Azure SQL Database. And, the output SQL table dataset (OututDataset) specifies the table in the database to which the data from the blob storage is copied. --### Create input dataset --In this step, you create a dataset named InputDataset that points to a blob file (emp.txt) in the root folder of a blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService1 linked service. If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the destination. In this tutorial, you specify a value for the fileName. --Here, you use the term "tables" rather than "datasets". A table is a rectangular dataset and is the only type of dataset supported right now. --1. Right-click **Tables** in the **Solution Explorer**, point to **Add**, and click **New Item**. -2. In the **Add New Item** dialog box, select **Azure Blob**, and click **Add**. -3. Replace the JSON text with the following text and save the **AzureBlobLocation1.json** file. -- ```json - { - "name": "InputDataset", - "properties": { - "structure": [ - { - "name": "FirstName", - "type": "String" - }, - { - "name": "LastName", - "type": "String" - } - ], - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService1", - "typeProperties": { - "folderPath": "adftutorial/", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } - } - ``` - The following table provides descriptions for the JSON properties used in the snippet: -- | Property | Description | - |: |: | - | type | The type property is set to **AzureBlob** because data resides in anAzure blob storage. | - | linkedServiceName | Refers to the **AzureStorageLinkedService** that youcreated earlier. | - | folderPath | Specifies the blob **container** and the **folder** thatcontains input blobs. In this tutorial, adftutorial is the blob container andfolder is the root folder. | - | fileName | This property is optional. If you omit this property, all filesfrom the folderPath are picked. In this tutorial, **emp.txt** is specified forthe fileName, so only that file is picked up for processing. | - | format -> type |The input file is in the text format, so we use**TextFormat**. | - | columnDelimiter | The columns in the input file are delimited by **commacharacter (`,`)**. | - | frequency/interval | The frequency is set to **Hour** and interval is setto **1**, which means that the input slices are available **hourly**. In otherwords, the Data Factory service looks for input data every hour in the rootfolder of blob container (**adftutorial**) you specified. It looks for thedata within the pipeline start and end times, not before or after thesetimes. | - | external | This property is set to **true** if the data is not generated bythis pipeline. The input data in this tutorial is in the emp.txt file, whichis not generated by this pipeline, so we set this property to true. | -- For more information about these JSON properties, see [Azure Blob connector article](data-factory-azure-blob-connector.md#dataset-properties). --### Create output dataset --In this step, you create an output dataset named **OutputDataset**. This dataset points to a SQL table in Azure SQL Database represented by **AzureSqlLinkedService1**. --1. Right-click **Tables** in the **Solution Explorer** again, point to **Add**, and click **New Item**. -2. In the **Add New Item** dialog box, select **Azure SQL**, and click **Add**. -3. Replace the JSON text with the following JSON and save the **AzureSqlTableLocation1.json** file. -- ```json - { - "name": "OutputDataset", - "properties": { - "structure": [ - { - "name": "FirstName", - "type": "String" - }, - { - "name": "LastName", - "type": "String" - } - ], - "type": "AzureSqlTable", - "linkedServiceName": "AzureSqlLinkedService1", - "typeProperties": { - "tableName": "emp" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } - } - ``` -- The following table provides descriptions for the JSON properties used in the snippet: -- | Property | Description | - |: |: | - | type | The type property is set to **AzureSqlTable** because data is copied to a table in Azure SQL Database. | - | linkedServiceName | Refers to the **AzureSqlLinkedService** that you created earlier. | - | tableName | Specified the **table** to which the data is copied. | - | frequency/interval | The frequency is set to **Hour** and interval is **1**, which means that the output slices are produced **hourly** between the pipeline start and end times, not before or after these times. | -- There are three columns - **ID**, **FirstName**, and **LastName** - in the emp table in the database. ID is an identity column, so you need to specify only **FirstName** and **LastName** here. -- For more information about these JSON properties, see [Azure SQL connector article](data-factory-azure-sql-connector.md#dataset-properties). --## Create pipeline --In this step, you create a pipeline with a **copy activity** that uses **InputDataset** as an input and **OutputDataset** as an output. --Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours. Therefore, 24 slices of output dataset are produced by the pipeline. --1. Right-click **Pipelines** in the **Solution Explorer**, point to **Add**, and click **New Item**. -2. Select **Copy Data Pipeline** in the **Add New Item** dialog box and click **Add**. -3. Replace the JSON with the following JSON and save the **CopyActivity1.json** file. -- ```json - { - "name": "ADFTutorialPipeline", - "properties": { - "description": "Copy data from a blob to Azure SQL table", - "activities": [ - { - "name": "CopyFromBlobToSQL", - "type": "Copy", - "inputs": [ - { - "name": "InputDataset" - } - ], - "outputs": [ - { - "name": "OutputDataset" - } - ], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "SqlSink", - "writeBatchSize": 10000, - "writeBatchTimeout": "60:00:00" - } - }, - "Policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "style": "StartOfInterval", - "retry": 0, - "timeout": "01:00:00" - } - } - ], - "start": "2017-05-11T00:00:00Z", - "end": "2017-05-12T00:00:00Z", - "isPaused": false - } - } - ``` -- - In the activities section, there is only one activity whose **type** is set to **Copy**. For more information about the copy activity, see [data movement activities](data-factory-data-movement-activities.md). In Data Factory solutions, you can also use [data transformation activities](data-factory-data-transformation-activities.md). - - Input for the activity is set to **InputDataset** and output for the activity is set to **OutputDataset**. - - In the **typeProperties** section, **BlobSource** is specified as the source type and **SqlSink** is specified as the sink type. For a complete list of data stores supported by the copy activity as sources and sinks, see [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats). To learn how to use a specific supported data store as a source/sink, click the link in the table. - - Replace the value of the **start** property with the current day and **end** value with the next day. You can specify only the date part and skip the time part of the date time. For example, "2016-02-03", which is equivalent to "2016-02-03T00:00:00Z" - - Both start and end datetimes must be in [ISO format](https://en.wikipedia.org/wiki/ISO_8601). For example: 2016-10-14T16:32:41Z. The **end** time is optional, but we use it in this tutorial. - - If you do not specify value for the **end** property, it is calculated as "**start + 48 hours**". To run the pipeline indefinitely, specify **9999-09-09** as the value for the **end** property. - - In the preceding example, there are 24 data slices as each data slice is produced hourly. -- For descriptions of JSON properties in a pipeline definition, see [create pipelines](data-factory-create-pipelines.md) article. For descriptions of JSON properties in a copy activity definition, see [data movement activities](data-factory-data-movement-activities.md). For descriptions of JSON properties supported by BlobSource, see [Azure Blob connector article](data-factory-azure-blob-connector.md). For descriptions of JSON properties supported by SqlSink, see [Azure SQL Database connector article](data-factory-azure-sql-connector.md). --## Publish/deploy Data Factory entities -In this step, you publish Data Factory entities (linked services, datasets, and pipeline) you created earlier. You also specify the name of the new data factory to be created to hold these entities. --1. Right-click project in the Solution Explorer, and click **Publish**. -2. If you see **Sign in to your Microsoft account** dialog box, enter your credentials for the account that has Azure subscription, and click **sign in**. -3. You should see the following dialog box: - - :::image type="content" source="./media/data-factory-copy-activity-tutorial-using-visual-studio/publish.png" alt-text="Publish dialog box"::: -4. In the Configure data factory page, do the following steps: - - 1. select **Create New Data Factory** option. - 2. Enter **VSTutorialFactory** for **Name**. - - > [!IMPORTANT] - > The name of the Azure data factory must be globally unique. If you receive an error about the name of data factory when publishing, change the name of the data factory (for example, yournameVSTutorialFactory) and try publishing again. See [Data Factory - Naming Rules](data-factory-naming-rules.md) topic for naming rules for Data Factory artifacts. - > - > - 3. Select your Azure subscription for the **Subscription** field. - - > [!IMPORTANT] - > If you do not see any subscription, ensure that you logged in using an account that is an admin or co-admin of the subscription. - > - > - 4. Select the **resource group** for the data factory to be created. - 5. Select the **region** for the data factory. Only regions supported by the Data Factory service are shown in the drop-down list. - 6. Click **Next** to switch to the **Publish Items** page. - - :::image type="content" source="media/data-factory-copy-activity-tutorial-using-visual-studio/configure-data-factory-page.png" alt-text="Configure data factory page"::: -5. In the **Publish Items** page, ensure that all the Data Factories entities are selected, and click **Next** to switch to the **Summary** page. - - :::image type="content" source="media/data-factory-copy-activity-tutorial-using-visual-studio/publish-items-page.png" alt-text="Publish items page"::: -6. Review the summary and click **Next** to start the deployment process and view the **Deployment Status**. - - :::image type="content" source="media/data-factory-copy-activity-tutorial-using-visual-studio/publish-summary-page.png" alt-text="Publish summary page"::: -7. In the **Deployment Status** page, you should see the status of the deployment process. Click Finish after the deployment is done. - - :::image type="content" source="media/data-factory-copy-activity-tutorial-using-visual-studio/deployment-status.png" alt-text="Deployment status page"::: --Note the following points: --* If you receive the error: "This subscription is not registered to use namespace Microsoft.DataFactory", do one of the following and try publishing again: - - * In Azure PowerShell, run the following command to register the Data Factory provider. -- ```powershell - Register-AzResourceProvider -ProviderNamespace Microsoft.DataFactory - ``` - You can run the following command to confirm that the Data Factory provider is registered. - - ```powershell - Get-AzResourceProvider - ``` - * Login using the Azure subscription into the [Azure portal](https://portal.azure.com) and navigate to a Data Factory blade (or) create a data factory in the Azure portal. This action automatically registers the provider for you. -* The name of the data factory may be registered as a DNS name in the future and hence become publicly visible. --> [!IMPORTANT] -> To create Data Factory instances, you need to be an admin/co-admin of the Azure subscription --## Monitor pipeline -Navigate to the home page for your data factory: --1. Sign in to the [Azure portal](https://portal.azure.com). -2. Click **More services** on the left menu, and click **Data factories**. -- :::image type="content" source="media/data-factory-copy-activity-tutorial-using-visual-studio/browse-data-factories.png" alt-text="Browse data factories"::: -3. Start typing the name of your data factory. -- :::image type="content" source="media/data-factory-copy-activity-tutorial-using-visual-studio/enter-data-factory-name.png" alt-text="Name of data factory"::: -4. Click your data factory in the results list to see the home page for your data factory. -- :::image type="content" source="media/data-factory-copy-activity-tutorial-using-visual-studio/data-factory-home-page.png" alt-text="Data factory home page"::: -5. Follow instructions from [Monitor datasets and pipeline](data-factory-monitor-manage-pipelines.md) to monitor the pipeline and datasets you have created in this tutorial. Currently, Visual Studio does not support monitoring Data Factory pipelines. --## Summary -In this tutorial, you created an Azure data factory to copy data from an Azure blob to Azure SQL Database. You used Visual Studio to create the data factory, linked services, datasets, and a pipeline. Here are the high-level steps you performed in this tutorial: --1. Created an Azure **data factory**. -2. Created **linked services**: - 1. An **Azure Storage** linked service to link your Azure Storage account that holds input data. - 2. An **Azure SQL** linked service to link your database that holds the output data. -3. Created **datasets**, which describe input data and output data for pipelines. -4. Created a **pipeline** with a **Copy Activity** with **BlobSource** as source and **SqlSink** as sink. --To see how to use a HDInsight Hive Activity to transform data by using Azure HDInsight cluster, see [Tutorial: Build your first pipeline to transform data using Hadoop cluster](data-factory-build-your-first-pipeline.md). --You can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. See [Scheduling and execution in Data Factory](data-factory-scheduling-and-execution.md) for detailed information. --## View all data factories in Server Explorer -This section describes how to use the Server Explorer in Visual Studio to view all the data factories in your Azure subscription and create a Visual Studio project based on an existing data factory. --1. In **Visual Studio**, click **View** on the menu, and click **Server Explorer**. -2. In the Server Explorer window, expand **Azure** and expand **Data Factory**. If you see **Sign in to Visual Studio**, enter the **account** associated with your Azure subscription and click **Continue**. Enter **password**, and click **Sign in**. Visual Studio tries to get information about all Azure data factories in your subscription. You see the status of this operation in the **Data Factory Task List** window. -- :::image type="content" source="./media/data-factory-copy-activity-tutorial-using-visual-studio/server-explorer.png" alt-text="Server Explorer"::: --## Create a Visual Studio project for an existing data factory --- Right-click a data factory in Server Explorer, and select **Export Data Factory to New Project** to create a Visual Studio project based on an existing data factory.-- :::image type="content" source="./media/data-factory-copy-activity-tutorial-using-visual-studio/export-data-factory-menu.png" alt-text="Export data factory to a VS project"::: --## Update Data Factory tools for Visual Studio -To update Azure Data Factory tools for Visual Studio, do the following steps: --1. Click **Tools** on the menu and select **Extensions and Updates**. -2. Select **Updates** in the left pane and then select **Visual Studio Gallery**. -3. Select **Azure Data Factory tools for Visual Studio** and click **Update**. If you do not see this entry, you already have the latest version of the tools. --## Use configuration files -You can use configuration files in Visual Studio to configure properties for linked services/tables/pipelines differently for each environment. --Consider the following JSON definition for an Azure Storage linked service. To specify **connectionString** with different values for accountname and accountkey based on the environment (Dev/Test/Production) to which you are deploying Data Factory entities. You can achieve this behavior by using separate configuration file for each environment. --```json -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "description": "", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --### Add a configuration file -Add a configuration file for each environment by performing the following steps: --1. Right-click the Data Factory project in your Visual Studio solution, point to **Add**, and click **New item**. -2. Select **Config** from the list of installed templates on the left, select **Configuration File**, enter a **name** for the configuration file, and click **Add**. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/add-config-file.png" alt-text="Add configuration file"::: -3. Add configuration parameters and their values in the following format: -- ```json - { - "$schema": "http://datafactories.schema.management.azure.com/vsschemas/V1/Microsoft.DataFactory.Config.json", - "AzureStorageLinkedService1": [ - { - "name": "$.properties.typeProperties.connectionString", - "value": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - ], - "AzureSqlLinkedService1": [ - { - "name": "$.properties.typeProperties.connectionString", - "value": "Server=tcp:<logical SQL server name>.database.windows.net,1433;Database=<Azure SQL datbase>;User ID=<Username>;Password=<Password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" - } - ] - } - ``` -- This example configures connectionString property of an Azure Storage linked service and an Azure SQL linked service. Notice that the syntax for specifying name is [JsonPath](https://goessner.net/articles/JsonPath/). -- If JSON has a property that has an array of values as shown in the following code: -- ```json - "structure": [ - { - "name": "FirstName", - "type": "String" - }, - { - "name": "LastName", - "type": "String" - } - ], - ``` -- Configure properties as shown in the following configuration file (use zero-based indexing): -- ```json - { - "name": "$.properties.structure[0].name", - "value": "FirstName" - } - { - "name": "$.properties.structure[0].type", - "value": "String" - } - { - "name": "$.properties.structure[1].name", - "value": "LastName" - } - { - "name": "$.properties.structure[1].type", - "value": "String" - } - ``` --### Property names with spaces -If a property name has spaces in it, use square brackets as shown in the following example (Database server name): --```json - { - "name": "$.properties.activities[1].typeProperties.webServiceParameters.['Database server name']", - "value": "MyAsqlServer.database.windows.net" - } -``` --### Deploy solution using a configuration -When you are publishing Azure Data Factory entities in VS, you can specify the configuration that you want to use for that publishing operation. --To publish entities in an Azure Data Factory project using configuration file: --1. Right-click Data Factory project and click **Publish** to see the **Publish Items** dialog box. -2. Select an existing data factory or specify values for creating a data factory on the **Configure data factory** page, and click **Next**. -3. On the **Publish Items** page: you see a drop-down list with available configurations for the **Select Deployment Config** field. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/select-config-file.png" alt-text="Select config file"::: -4. Select the **configuration file** that you would like to use and click **Next**. -5. Confirm that you see the name of JSON file in the **Summary** page and click **Next**. -6. Click **Finish** after the deployment operation is finished. --When you deploy, the values from the configuration file are used to set values for properties in the JSON files before the entities are deployed to Azure Data Factory service. --## Use Azure Key Vault -It is not advisable and often against security policy to commit sensitive data such as connection strings to the code repository. See [ADF Secure Publish](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/ADFSecurePublish) sample on GitHub to learn about storing sensitive information in Azure Key Vault and using it while publishing Data Factory entities. The Secure Publish extension for Visual Studio allows the secrets to be stored in Key Vault and only references to them are specified in linked services/ deployment configurations. These references are resolved when you publish Data Factory entities to Azure. These files can then be committed to source repository without exposing any secrets. ---## Next steps -In this tutorial, you used Azure blob storage as a source data store and Azure SQL Database as a destination data store in a copy operation. The following table provides a list of data stores supported as sources and destinations by the copy activity: ---To learn about how to copy data to/from a data store, click the link for the data store in the table. |
data-factory | Data Factory Copy Data From Azure Blob Storage To Sql Database | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-copy-data-from-azure-blob-storage-to-sql-database.md | - Title: Copy data from Blob Storage to SQL Database - Azure -description: This tutorial shows you how to use Copy Activity in an Azure Data Factory pipeline to copy data from Blob storage to SQL database. ---- Previously updated : 04/12/2023----# Tutorial: Copy data from Blob Storage to SQL Database using Data Factory -> [!div class="op_single_selector"] -> * [Overview and prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) -> * [Copy Wizard](data-factory-copy-data-wizard-tutorial.md) -> * [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) -> * [PowerShell](data-factory-copy-activity-tutorial-using-powershell.md) -> * [Azure Resource Manager template](data-factory-copy-activity-tutorial-using-azure-resource-manager-template.md) -> * [REST API](data-factory-copy-activity-tutorial-using-rest-api.md) -> * [.NET API](data-factory-copy-activity-tutorial-using-dotnet-api.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [copy activity tutorial](../quickstart-create-data-factory-dot-net.md). --In this tutorial, you create a data factory with a pipeline to copy data from Blob storage to SQL Database. --The Copy Activity performs the data movement in Azure Data Factory. It is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. See [Data Movement Activities](data-factory-data-movement-activities.md) article for details about the Copy Activity. --> [!NOTE] -> For a detailed overview of the Data Factory service, see the [Introduction to Azure Data Factory](data-factory-introduction.md) article. -> -> --## Prerequisites for the tutorial -Before you begin this tutorial, you must have the following prerequisites: --* **Azure subscription**. If you don't have a subscription, you can create a free trial account in just a couple of minutes. See the [Free Trial](https://azure.microsoft.com/pricing/free-trial/) article for details. -* **Azure Storage Account**. You use the blob storage as a **source** data store in this tutorial. if you don't have an Azure storage account, see the [Create a storage account](../../storage/common/storage-account-create.md) article for steps to create one. -* **Azure SQL Database**. You use Azure SQL Database as a **destination** data store in this tutorial. If you don't have a database in Azure SQL Database that you can use in the tutorial, See [How to create and configure a database in Azure SQL Database](/azure/azure-sql/database/single-database-create-quickstart) to create one. -* **SQL Server 2012/2014 or Visual Studio 2013**. You use SQL Server Management Studio or Visual Studio to create a sample database and to view the result data in the database. --## Collect blob storage account name and key -You need the account name and account key of your Azure storage account to do this tutorial. Note down **account name** and **account key** for your Azure storage account. --1. Log in to the [Azure portal](https://portal.azure.com/). -2. Click **All services** on the left menu and select **Storage Accounts**. -- :::image type="content" source="media/data-factory-copy-data-from-azure-blob-storage-to-sql-database/browse-storage-accounts.png" alt-text="Browse - Storage accounts"::: -3. In the **Storage Accounts** blade, select the **Azure storage account** that you want to use in this tutorial. -4. Select **Access keys** link under **SETTINGS**. -5. Click **copy** (image) button next to **Storage account name** text box and save/paste it somewhere (for example: in a text file). -6. Repeat the previous step to copy or note down the **key1**. -- :::image type="content" source="media/data-factory-copy-data-from-azure-blob-storage-to-sql-database/storage-access-key.png" alt-text="Storage access key"::: -7. Close all the blades by clicking **X**. --## Collect SQL server, database, user names -You need the names of logical SQL server, database, and user to do this tutorial. Note down names of **server**, **database**, and **user** for Azure SQL Database. --1. In the **Azure portal**, click **All services** on the left and select **SQL databases**. -2. In the **SQL databases blade**, select the **database** that you want to use in this tutorial. Note down the **database name**. -3. In the **SQL database** blade, click **Properties** under **SETTINGS**. -4. Note down the values for **SERVER NAME** and **SERVER ADMIN LOGIN**. -5. Close all the blades by clicking **X**. --## Allow Azure services to access SQL server -Ensure that **Allow access to Azure services** setting turned **ON** for your server so that the Data Factory service can access your server. To verify and turn on this setting, do the following steps: --1. Click **All services** hub on the left and click **SQL servers**. -2. Select your server, and click **Firewall** under **SETTINGS**. -3. In the **Firewall settings** blade, click **ON** for **Allow access to Azure services**. -4. Close all the blades by clicking **X**. --## Prepare Blob Storage and SQL Database -Now, prepare your Azure blob storage and Azure SQL Database for the tutorial by performing the following steps: --1. Launch Notepad. Copy the following text and save it as **emp.txt** to **C:\ADFGetStarted** folder on your hard drive. -- ``` - John, Doe - Jane, Doe - ``` -2. Use tools such as [Azure Storage Explorer](https://storageexplorer.com/) to create the **adftutorial** container and to upload the **emp.txt** file to the container. --3. Use the following SQL script to create the **emp** table in your Azure SQL Database. -- ```SQL - CREATE TABLE dbo.emp - ( - ID int IDENTITY(1,1) NOT NULL, - FirstName varchar(50), - LastName varchar(50), - ) - GO -- CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID); - ``` -- **If you have SQL Server 2012/2014 installed on your computer:** follow instructions from [Managing Azure SQL Database using SQL Server Management Studio](/azure/azure-sql/database/single-database-manage) to connect to your server and run the SQL script. -- If your client is not allowed to access the logical SQL server, you need to configure firewall for your server to allow access from your machine (IP Address). See [this article](/azure/azure-sql/database/firewall-configure) for steps to configure the firewall for your server. --## Create a data factory -You have completed the prerequisites. You can create a data factory using one of the following ways. Click one of the options in the drop-down list at the top or the following links to perform the tutorial. --* [Copy Wizard](data-factory-copy-data-wizard-tutorial.md) -* [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) -* [PowerShell](data-factory-copy-activity-tutorial-using-powershell.md) -* [Azure Resource Manager template](data-factory-copy-activity-tutorial-using-azure-resource-manager-template.md) -* [REST API](data-factory-copy-activity-tutorial-using-rest-api.md) -* [.NET API](data-factory-copy-activity-tutorial-using-dotnet-api.md) --> [!NOTE] -> The data pipeline in this tutorial copies data from a source data store to a destination data store. It does not transform input data to produce output data. For a tutorial on how to transform data using Azure Data Factory, see [Tutorial: Build your first pipeline to transform data using Hadoop cluster](data-factory-build-your-first-pipeline.md). -> -> You can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. See [Scheduling and execution in Data Factory](data-factory-scheduling-and-execution.md) for detailed information. |
data-factory | Data Factory Copy Data Wizard Tutorial | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-copy-data-wizard-tutorial.md | - Title: 'Tutorial: Create a pipeline using Copy Wizard ' -description: In this tutorial, you create an Azure Data Factory pipeline with a Copy Activity by using the Copy Wizard supported by Data Factory ---- Previously updated : 04/12/2023----# Tutorial: Create a pipeline with Copy Activity using Data Factory Copy Wizard -> [!div class="op_single_selector"] -> * [Overview and prerequisites](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) -> * [Copy Wizard](data-factory-copy-data-wizard-tutorial.md) -> * [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) -> * [PowerShell](data-factory-copy-activity-tutorial-using-powershell.md) -> * [Azure Resource Manager template](data-factory-copy-activity-tutorial-using-azure-resource-manager-template.md) -> * [REST API](data-factory-copy-activity-tutorial-using-rest-api.md) -> * [.NET API](data-factory-copy-activity-tutorial-using-dotnet-api.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [copy activity tutorial](../quickstart-create-data-factory-dot-net.md). ---This tutorial shows you how to use the **Copy Wizard** to copy data from an Azure blob storage to Azure SQL Database. --The Azure Data Factory **Copy Wizard** allows you to quickly create a data pipeline that copies data from a supported source data store to a supported destination data store. Therefore, we recommend that you use the wizard as a first step to create a sample pipeline for your data movement scenario. For a list of data stores supported as sources and as destinations, see [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats). --This tutorial shows you how to create an Azure data factory, launch the Copy Wizard, go through a series of steps to provide details about your data ingestion/movement scenario. When you finish steps in the wizard, the wizard automatically creates a pipeline with a Copy Activity to copy data from an Azure blob storage to Azure SQL Database. For more information about Copy Activity, see [data movement activities](data-factory-data-movement-activities.md). --## Prerequisites -Complete prerequisites listed in the [Tutorial Overview](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) article before performing this tutorial. --## Create data factory -In this step, you use the Azure portal to create an Azure data factory named **ADFTutorialDataFactory**. --1. Log in to [Azure portal](https://portal.azure.com). -2. Click **Create a resource** from the top-left corner, click **Data + analytics**, and click **Data Factory**. - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/new-data-factory-menu.png" alt-text="New->DataFactory"::: -2. In the **New data factory** blade: - - 1. Enter **ADFTutorialDataFactory** for the **name**. - The name of the Azure data factory must be globally unique. If you receive the error: `Data factory name "ADFTutorialDataFactory" is not available`, change the name of the data factory (for example, yournameADFTutorialDataFactoryYYYYMMDD) and try creating again. See [Data Factory - Naming Rules](data-factory-naming-rules.md) topic for naming rules for Data Factory artifacts. - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/getstarted-data-factory-not-available.png" alt-text="Data Factory name not available"::: - 2. Select your Azure **subscription**. - 3. For Resource Group, do one of the following steps: - - - Select **Use existing** to select an existing resource group. - - Select **Create new** to enter a name for a resource group. - - Some of the steps in this tutorial assume that you use the name: **ADFTutorialResourceGroup** for the resource group. To learn about resource groups, see [Using resource groups to manage your Azure resources](../../azure-resource-manager/management/overview.md). - 4. Select a **location** for the data factory. - 5. Select **Pin to dashboard** check box at the bottom of the blade. - 6. Click **Create**. - - :::image type="content" source="media/data-factory-copy-data-wizard-tutorial/new-data-factory-blade.png" alt-text="New data factory blade"::: -3. After the creation is complete, you see the **Data Factory** blade as shown in the following image: - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/getstarted-data-factory-home-page.png" alt-text="Data factory home page"::: --## Launch Copy Wizard -1. On the Data Factory blade, click **Copy data** to launch the **Copy Wizard**. - - > [!NOTE] - > If you see that the web browser is stuck at "Authorizing...", disable/uncheck **Block third-party cookies and site data** setting in the browser settings (or) keep it enabled and create an exception for **login.microsoftonline.com** and then try launching the wizard again. -2. In the **Properties** page: - - 1. Enter **CopyFromBlobToAzureSql** for **Task name** - 2. Enter **description** (optional). - 3. Change the **Start date time** and the **End date time** so that the end date is set to today and start date to five days earlier. - 4. Click **Next**. - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/copy-tool-properties-page.png" alt-text="Copy Tool - Properties page"::: -3. On the **Source data store** page, click **Azure Blob Storage** tile. You use this page to specify the source data store for the copy task. - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/copy-tool-source-data-store-page.png" alt-text="Copy Tool - Source data store page"::: -4. On the **Specify the Azure Blob storage account** page: - - 1. Enter **AzureStorageLinkedService** for **Linked service name**. - 2. Confirm that **From Azure subscriptions** option is selected for **Account selection method**. - 3. Select your Azure **subscription**. - 4. Select an **Azure storage account** from the list of Azure storage accounts available in the selected subscription. You can also choose to enter storage account settings manually by selecting **Enter manually** option for the **Account selection method**, and then click **Next**. - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/copy-tool-specify-azure-blob-storage-account.png" alt-text="Copy Tool - Specify the Azure Blob storage account"::: -5. On **Choose the input file or folder** page: - - 1. Double-click **adftutorial** (folder). - 2. Select **emp.txt**, and click **Choose** - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/copy-tool-choose-input-file-or-folder.png" alt-text="Screenshot shows the Choose option for your input file."::: -6. On the **Choose the input file or folder** page, click **Next**. Do not select **Binary copy**. - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/chose-input-file-folder.png" alt-text="Screenshot shows the Binary copy option for your input."::: -7. On the **File format settings** page, you see the delimiters and the schema that is auto-detected by the wizard by parsing the file. You can also enter the delimiters manually for the copy wizard to stop auto-detecting or to override. Click **Next** after you review the delimiters and preview data. - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/copy-tool-file-format-settings.png" alt-text="Copy Tool - File format settings"::: -8. On the Destination data store page, select **Azure SQL Database**, and click **Next**. - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/choose-destination-store.png" alt-text="Copy Tool - Choose destination store"::: -9. On **Specify the Azure SQL database** page: - - 1. Enter **AzureSqlLinkedService** for the **Connection name** field. - 2. Confirm that **From Azure subscriptions** option is selected for **Server / database selection method**. - 3. Select your Azure **subscription**. - 4. Select **Server name** and **Database**. - 5. Enter **User name** and **Password**. - 6. Click **Next**. - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/specify-azure-sql-database.png" alt-text="Copy Tool - specify Azure SQL Database"::: -10. On the **Table mapping** page, select **emp** for the **Destination** field from the drop-down list, click **down arrow** (optional) to see the schema and to preview the data. - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/copy-tool-table-mapping-page.png" alt-text="Copy Tool - Table mapping"::: -11. On the **Schema mapping** page, click **Next**. - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/schema-mapping-page.png" alt-text="Copy Tool - schema mapping"::: -12. On the **Performance settings** page, click **Next**. - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/performance-settings.png" alt-text="Screenshot shows the Performance settings page where you can select Next."::: -13. Review information in the **Summary** page, and click **Finish**. The wizard creates two linked services, two datasets (input and output), and one pipeline in the data factory (from where you launched the Copy Wizard). - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/summary-page.png" alt-text="Screenshot shows the Summary page where you can select Next."::: --## Launch Monitor and Manage application -1. On the **Deployment** page, click the link: `Click here to monitor copy pipeline`. - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/copy-tool-deployment-succeeded.png" alt-text="Copy Tool - Deployment succeeded"::: -2. The monitoring application is launched in a separate tab in your web browser. - - :::image type="content" source="./media/data-factory-copy-data-wizard-tutorial/monitoring-app.png" alt-text="Monitoring App"::: -3. To see the latest status of hourly slices, click **Refresh** button in the **ACTIVITY WINDOWS** list at the bottom. You see five activity windows for five days between start and end times for the pipeline. The list is not automatically refreshed, so you may need to click Refresh a couple of times before you see all the activity windows in the Ready state. -4. Select an activity window in the list. See the details about it in the **Activity Window Explorer** on the right. -- :::image type="content" source="media/data-factory-copy-data-wizard-tutorial/activity-window-details.png" alt-text="Activity window details"::: -- Notice that the dates 11, 12, 13, 14, and 15 are in green color, which means that the daily output slices for these dates have already been produced. You also see this color coding on the pipeline and the output dataset in the diagram view. In the previous step, notice that two slices have already been produced, one slice is currently being processed, and the other two are waiting to be processed (based on the color coding). -- For more information on using this application, see [Monitor and manage pipeline using Monitoring App](data-factory-monitor-manage-app.md) article. --## Next steps -In this tutorial, you used Azure blob storage as a source data store and Azure SQL Database as a destination data store in a copy operation. The following table provides a list of data stores supported as sources and destinations by the copy activity: ---For details about fields/properties that you see in the copy wizard for a data store, click the link for the data store in the table. |
data-factory | Data Factory Copy Wizard | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-copy-wizard.md | - Title: Copy data easily with Copy Wizard - Azure -description: Learn about how to use the Data Factory Copy Wizard to copy data from supported data sources to sinks. ---- Previously updated : 04/12/2023----# Copy or move data easily with Azure Data Factory Copy Wizard -> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [copy activity tutorial](../quickstart-create-data-factory-dot-net.md). ---The Azure Data Factory Copy Wizard is to ease the process of ingesting data, which is usually a first step in an end-to-end data integration scenario. When going through the Azure Data Factory Copy Wizard, you do not need to understand any JSON definitions for linked services, datasets, and pipelines. However, after you complete all the steps in the wizard, the wizard automatically creates a pipeline to copy data from the selected data source to the selected destination. In addition, the Copy Wizard helps you to validate the data being ingested at the time of authoring, which saves much of your time, especially when you are ingesting data for the first time from the data source. To start the Copy Wizard, click the **Copy data** tile on the home page of your data factory. ---## An intuitive wizard for copying data -This wizard allows you to easily move data from a wide variety of sources to destinations in minutes. After going through the wizard, a pipeline with a copy activity is automatically created for you along with dependent Data Factory entities (linked services and datasets). No additional steps are required to create the pipeline. ---> [!NOTE] -> See [Copy Wizard tutorial](data-factory-copy-data-wizard-tutorial.md) article for step-by-step instructions to create a sample pipeline to copy data from an Azure blob to an Azure SQL Database table. -> -> --The wizard is designed with big data in mind from the start. It is simple and efficient to author Data Factory pipelines that move hundreds of folders, files, or tables using the Copy Data wizard. The wizard supports the following three features: Automatic data preview, schema capture and mapping, and filtering data. --## Automatic data preview -The copy wizard allows you to review part of the data from the selected data source for you to validate whether the data it is the right data you want to copy. In addition, if the source data is in a text file, the copy wizard parses the text file to learn row and column delimiters, and schema automatically. ---## Schema capture and mapping -The schema of input data may not match the schema of output data in some cases. In this scenario, you need to map columns from the source schema to columns from the destination schema. --The copy wizard automatically maps columns in the source schema to columns in the destination schema. You can override the mappings by using the drop-down lists (or) specify whether a column needs to be skipped while copying the data. ---## Filtering data -The wizard allows you to filter source data to select only the data that needs to be copied to the destination/sink data store. Filtering reduces the volume of the data to be copied to the sink data store and therefore enhances the throughput of the copy operation. It provides a flexible way to filter data in a relational database by using SQL query language (or) files in an Azure blob folder by using [Data Factory functions and variables](data-factory-functions-variables.md). --### Filtering of data in a database -In the example, the SQL query uses the `Text.Format` function and `WindowStart` variable. ---### Filtering of data in an Azure blob folder -You can use variables in the folder path to copy data from a folder that is determined at runtime based on [system variables](data-factory-functions-variables.md#data-factory-system-variables). The supported variables are: **{year}**, **{month}**, **{day}**, **{hour}**, **{minute}**, and **{custom}**. Example: inputfolder/{year}/{month}/{day}. --Suppose that you have input folders in the following format: --```text -2016/03/01/01 -2016/03/01/02 -2016/03/01/03 -... -``` --Click the **Browse** button for **File or folder**, browse to one of these folders (for example, 2016->03->01->02), and click **Choose**. You should see `2016/03/01/02` in the text box. Now, replace **2016** with **{year}**, **03** with **{month}**, **01** with **{day}**, and **02** with **{hour}**, and press Tab. You should see drop-down lists to select the format for these four variables: ---As shown in the following screenshot, you can also use a **custom** variable and any [supported format strings](/dotnet/standard/base-types/custom-date-and-time-format-strings). To select a folder with that structure, use the **Browse** button first. Then replace a value with **{custom}**, and press Tab to see the text box where you can type the format string. ---## Support for diverse data and object types -By using the Copy Wizard, you can efficiently move hundreds of folders, files, or tables. ---## Scheduling options -You can run the copy operation once or on a schedule (hourly, daily, and so on). Both of these options can be used for the breadth of the connectors across on-premises, cloud, and local desktop copy. --A one-time copy operation enables data movement from a source to a destination only once. It applies to data of any size and any supported format. The scheduled copy allows you to copy data on a prescribed recurrence. You can use rich settings (like retry, timeout, and alerts) to configure the scheduled copy. ---## Next steps -For a quick walkthrough of using the Data Factory Copy Wizard to create a pipeline with Copy Activity, see [Tutorial: Create a pipeline using the Copy Wizard](data-factory-copy-data-wizard-tutorial.md). |
data-factory | Data Factory Create Data Factories Programmatically | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-create-data-factories-programmatically.md | - Title: Create data pipelines by using Azure .NET SDK -description: Learn how to programmatically create, monitor, and manage Azure data factories by using Data Factory SDK. ------ Previously updated : 04/12/2023----# Create, monitor, and manage Azure data factories using Azure Data Factory .NET SDK -> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [copy activity tutorial](../quickstart-create-data-factory-dot-net.md). --## Overview -You can create, monitor, and manage Azure data factories programmatically using Data Factory .NET SDK. This article contains a walkthrough that you can follow to create a sample .NET console application that creates and monitors a data factory. --> [!NOTE] -> This article does not cover all the Data Factory .NET API. See [Data Factory .NET API Reference](/dotnet/api/overview/azure/data-factory) for comprehensive documentation on .NET API for Data Factory. --## Prerequisites ---* Visual Studio 2012 or 2013 or 2015 -* Download and install [Azure .NET SDK](https://azure.microsoft.com/downloads/). -* Azure PowerShell. Follow instructions in [How to install and configure Azure PowerShell](/powershell/azure/) article to install Azure PowerShell on your computer. You use Azure PowerShell to create a Microsoft Entra application. --<a name='create-an-application-in-azure-active-directory'></a> --### Create an application in Microsoft Entra ID -Create a Microsoft Entra application, create a service principal for the application, and assign it to the **Data Factory Contributor** role. --1. Launch **PowerShell**. -2. Run the following command and enter the user name and password that you use to sign in to the Azure portal. -- ```powershell - Connect-AzAccount - ``` -3. Run the following command to view all the subscriptions for this account. -- ```powershell - Get-AzSubscription - ``` -4. Run the following command to select the subscription that you want to work with. Replace **<NameOfAzureSubscription**> with the name of your Azure subscription. -- ```powershell - Get-AzSubscription -SubscriptionName <NameOfAzureSubscription> | Set-AzContext - ``` -- > [!IMPORTANT] - > Note down **SubscriptionId** and **TenantId** from the output of this command. --5. Create an Azure resource group named **ADFTutorialResourceGroup** by running the following command in the PowerShell. -- ```powershell - New-AzResourceGroup -Name ADFTutorialResourceGroup -Location "West US" - ``` -- If the resource group already exists, you specify whether to update it (Y) or keep it as (N). -- If you use a different resource group, you need to use the name of your resource group in place of ADFTutorialResourceGroup in this tutorial. -6. Create a Microsoft Entra application. -- ```powershell - $azureAdApplication = New-AzADApplication -DisplayName "ADFDotNetWalkthroughApp" -HomePage "https://www.contoso.org" -IdentifierUris "https://www.adfdotnetwalkthroughapp.org/example" -Password "Pass@word1" - ``` -- If you get the following error, specify a different URL and run the command again. - - ```powershell - Another object with the same value for property identifierUris already exists. - ``` -7. Create the AD service principal. -- ```powershell - New-AzADServicePrincipal -ApplicationId $azureAdApplication.ApplicationId - ``` -8. Add service principal to the **Data Factory Contributor** role. -- ```powershell - New-AzRoleAssignment -RoleDefinitionName "Data Factory Contributor" -ServicePrincipalName $azureAdApplication.ApplicationId.Guid - ``` -9. Get the application ID. -- ```powershell - $azureAdApplication - ``` - Note down the application ID (applicationID) from the output. --You should have following four values from these steps: --* Tenant ID -* Subscription ID -* Application ID -* Password (specified in the first command) --## Walkthrough -In the walkthrough, you create a data factory with a pipeline that contains a copy activity. The copy activity copies data from a folder in your Azure blob storage to another folder in the same blob storage. --The Copy Activity performs the data movement in Azure Data Factory. The activity is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. See [Data Movement Activities](data-factory-data-movement-activities.md) article for details about the Copy Activity. --> [!IMPORTANT] -> The [Microsoft.IdentityModel.Clients.ActiveDirectory](https://www.nuget.org/packages/Microsoft.IdentityModel.Clients.ActiveDirectory) NuGet package and Azure AD Authentication Library (ADAL) have been deprecated. No new features have been added since June 30, 2020. We strongly encourage you to upgrade, see the [migration guide](../../active-directory/develop/msal-migration.md) for more details. --1. Using Visual Studio 2012/2013/2015, create a C# .NET console application. - 1. Launch **Visual Studio** 2012/2013/2015. - 2. Click **File**, point to **New**, and click **Project**. - 3. Expand **Templates**, and select **Visual C#**. In this walkthrough, you use C#, but you can use any .NET language. - 4. Select **Console Application** from the list of project types on the right. - 5. Enter **DataFactoryAPITestApp** for the Name. - 6. Select **C:\ADFGetStarted** for the Location. - 7. Click **OK** to create the project. -2. Click **Tools**, point to **NuGet Package Manager**, and click **Package Manager Console**. -3. In the **Package Manager Console**, do the following steps: - 1. Run the following command to install Data Factory package: `Install-Package Microsoft.Azure.Management.DataFactories` - 2. Run the following command to install Microsoft Entra ID package (you use Active Directory API in the code): `Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory -Version 2.19.208020213` -4. Replace the contents of **App.config** file in the project with the following content: - - ```xml - <?xml version="1.0" encoding="utf-8" ?> - <configuration> - <appSettings> - <add key="ActiveDirectoryEndpoint" value="https://login.microsoftonline.com/" /> - <add key="ResourceManagerEndpoint" value="https://management.azure.com/" /> - <add key="WindowsManagementUri" value="https://management.core.windows.net/" /> -- <add key="ApplicationId" value="your application ID" /> - <add key="Password" value="Password you used while creating the AAD application" /> - <add key="SubscriptionId" value= "Subscription ID" /> - <add key="ActiveDirectoryTenantId" value="Tenant ID" /> - </appSettings> - </configuration> - ``` -5. In the App.Config file, update values for **<Application ID>**, **<Password>**, **<Subscription ID>**, and **<tenant ID>** with your own values. -6. Add the following **using** statements to the **Program.cs** file in the project. -- ```csharp - using System.Configuration; - using System.Collections.ObjectModel; - using System.Threading; - using System.Threading.Tasks; -- using Microsoft.Azure; - using Microsoft.Azure.Management.DataFactories; - using Microsoft.Azure.Management.DataFactories.Models; - using Microsoft.Azure.Management.DataFactories.Common.Models; -- using Microsoft.IdentityModel.Clients.ActiveDirectory; -- ``` -6. Add the following code that creates an instance of **DataPipelineManagementClient** class to the **Main** method. You use this object to create a data factory, a linked service, input and output datasets, and a pipeline. You also use this object to monitor slices of a dataset at runtime. -- ```csharp - // create data factory management client -- //IMPORTANT: specify the name of Azure resource group here - string resourceGroupName = "ADFTutorialResourceGroup"; -- //IMPORTANT: the name of the data factory must be globally unique. - // Therefore, update this value. For example:APITutorialFactory05122017 - string dataFactoryName = "APITutorialFactory"; -- TokenCloudCredentials aadTokenCredentials = new TokenCloudCredentials( - ConfigurationManager.AppSettings["SubscriptionId"], - GetAuthorizationHeader().Result); -- Uri resourceManagerUri = new Uri(ConfigurationManager.AppSettings["ResourceManagerEndpoint"]); -- DataFactoryManagementClient client = new DataFactoryManagementClient(aadTokenCredentials, resourceManagerUri); - ``` -- > [!IMPORTANT] - > Replace the value of **resourceGroupName** with the name of your Azure resource group. You can create a resource group using the [New-AzureResourceGroup](/powershell/module/az.resources/new-azresourcegroup) cmdlet. - > - > Update name of the data factory (dataFactoryName) to be unique. Name of the data factory must be globally unique. See [Data Factory - Naming Rules](data-factory-naming-rules.md) topic for naming rules for Data Factory artifacts. -7. Add the following code that creates a **data factory** to the **Main** method. -- ```csharp - // create a data factory - Console.WriteLine("Creating a data factory"); - client.DataFactories.CreateOrUpdate(resourceGroupName, - new DataFactoryCreateOrUpdateParameters() - { - DataFactory = new DataFactory() - { - Name = dataFactoryName, - Location = "westus", - Properties = new DataFactoryProperties() - } - } - ); - ``` -8. Add the following code that creates an **Azure Storage linked service** to the **Main** method. -- > [!IMPORTANT] - > Replace **storageaccountname** and **accountkey** with name and key of your Azure Storage account. -- ```csharp - // create a linked service for input data store: Azure Storage - Console.WriteLine("Creating Azure Storage linked service"); - client.LinkedServices.CreateOrUpdate(resourceGroupName, dataFactoryName, - new LinkedServiceCreateOrUpdateParameters() - { - LinkedService = new LinkedService() - { - Name = "AzureStorageLinkedService", - Properties = new LinkedServiceProperties - ( - new AzureStorageLinkedService("DefaultEndpointsProtocol=https;AccountName=<storageaccountname>;AccountKey=<accountkey>") - ) - } - } - ); - ``` -9. Add the following code that creates **input and output datasets** to the **Main** method. -- The **FolderPath** for the input blob is set to **adftutorial/** where **adftutorial** is the name of the container in your blob storage. If this container does not exist in your Azure blob storage, create a container with this name: **adftutorial** and upload a text file to the container. -- The FolderPath for the output blob is set to: **adftutorial/apifactoryoutput/{Slice}** where **Slice** is dynamically calculated based on the value of **SliceStart** (start date-time of each slice.) -- ```csharp - // create input and output datasets - Console.WriteLine("Creating input and output datasets"); - string Dataset_Source = "DatasetBlobSource"; - string Dataset_Destination = "DatasetBlobDestination"; - - client.Datasets.CreateOrUpdate(resourceGroupName, dataFactoryName, - new DatasetCreateOrUpdateParameters() - { - Dataset = new Dataset() - { - Name = Dataset_Source, - Properties = new DatasetProperties() - { - LinkedServiceName = "AzureStorageLinkedService", - TypeProperties = new AzureBlobDataset() - { - FolderPath = "adftutorial/", - FileName = "emp.txt" - }, - External = true, - Availability = new Availability() - { - Frequency = SchedulePeriod.Hour, - Interval = 1, - }, - - Policy = new Policy() - { - Validation = new ValidationPolicy() - { - MinimumRows = 1 - } - } - } - } - }); - - client.Datasets.CreateOrUpdate(resourceGroupName, dataFactoryName, - new DatasetCreateOrUpdateParameters() - { - Dataset = new Dataset() - { - Name = Dataset_Destination, - Properties = new DatasetProperties() - { - - LinkedServiceName = "AzureStorageLinkedService", - TypeProperties = new AzureBlobDataset() - { - FolderPath = "adftutorial/apifactoryoutput/{Slice}", - PartitionedBy = new Collection<Partition>() - { - new Partition() - { - Name = "Slice", - Value = new DateTimePartitionValue() - { - Date = "SliceStart", - Format = "yyyyMMdd-HH" - } - } - } - }, - - Availability = new Availability() - { - Frequency = SchedulePeriod.Hour, - Interval = 1, - }, - } - } - }); - ``` -10. Add the following code that **creates and activates a pipeline** to the **Main** method. This pipeline has a **CopyActivity** that takes **BlobSource** as a source and **BlobSink** as a sink. -- The Copy Activity performs the data movement in Azure Data Factory. The activity is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. See [Data Movement Activities](data-factory-data-movement-activities.md) article for details about the Copy Activity. -- ```csharp - // create a pipeline - Console.WriteLine("Creating a pipeline"); - DateTime PipelineActivePeriodStartTime = new DateTime(2014, 8, 9, 0, 0, 0, 0, DateTimeKind.Utc); - DateTime PipelineActivePeriodEndTime = PipelineActivePeriodStartTime.AddMinutes(60); - string PipelineName = "PipelineBlobSample"; - - client.Pipelines.CreateOrUpdate(resourceGroupName, dataFactoryName, - new PipelineCreateOrUpdateParameters() - { - Pipeline = new Pipeline() - { - Name = PipelineName, - Properties = new PipelineProperties() - { - Description = "Demo Pipeline for data transfer between blobs", - - // Initial value for pipeline's active period. With this, you won't need to set slice status - Start = PipelineActivePeriodStartTime, - End = PipelineActivePeriodEndTime, - - Activities = new List<Activity>() - { - new Activity() - { - Name = "BlobToBlob", - Inputs = new List<ActivityInput>() - { - new ActivityInput() - { - Name = Dataset_Source - } - }, - Outputs = new List<ActivityOutput>() - { - new ActivityOutput() - { - Name = Dataset_Destination - } - }, - TypeProperties = new CopyActivity() - { - Source = new BlobSource(), - Sink = new BlobSink() - { - WriteBatchSize = 10000, - WriteBatchTimeout = TimeSpan.FromMinutes(10) - } - } - } - - }, - } - } - }); - ``` -12. Add the following code to the **Main** method to get the status of a data slice of the output dataset. There is only one slice expected in this sample. -- ```csharp - // Pulling status within a timeout threshold - DateTime start = DateTime.Now; - bool done = false; - - while (DateTime.Now - start < TimeSpan.FromMinutes(5) && !done) - { - Console.WriteLine("Pulling the slice status"); - // wait before the next status check - Thread.Sleep(1000 * 12); - - var datalistResponse = client.DataSlices.List(resourceGroupName, dataFactoryName, Dataset_Destination, - new DataSliceListParameters() - { - DataSliceRangeStartTime = PipelineActivePeriodStartTime.ConvertToISO8601DateTimeString(), - DataSliceRangeEndTime = PipelineActivePeriodEndTime.ConvertToISO8601DateTimeString() - }); - - foreach (DataSlice slice in datalistResponse.DataSlices) - { - if (slice.State == DataSliceState.Failed || slice.State == DataSliceState.Ready) - { - Console.WriteLine("Slice execution is done with status: {0}", slice.State); - done = true; - break; - } - else - { - Console.WriteLine("Slice status is: {0}", slice.State); - } - } - } - ``` -13. **(optional)** Add the following code to get run details for a data slice to the **Main** method. -- ```csharp - Console.WriteLine("Getting run details of a data slice"); - - // give it a few minutes for the output slice to be ready - Console.WriteLine("\nGive it a few minutes for the output slice to be ready and press any key."); - Console.ReadKey(); - - var datasliceRunListResponse = client.DataSliceRuns.List( - resourceGroupName, - dataFactoryName, - Dataset_Destination, - new DataSliceRunListParameters() - { - DataSliceStartTime = PipelineActivePeriodStartTime.ConvertToISO8601DateTimeString() - }); - - foreach (DataSliceRun run in datasliceRunListResponse.DataSliceRuns) - { - Console.WriteLine("Status: \t\t{0}", run.Status); - Console.WriteLine("DataSliceStart: \t{0}", run.DataSliceStart); - Console.WriteLine("DataSliceEnd: \t\t{0}", run.DataSliceEnd); - Console.WriteLine("ActivityId: \t\t{0}", run.ActivityName); - Console.WriteLine("ProcessingStartTime: \t{0}", run.ProcessingStartTime); - Console.WriteLine("ProcessingEndTime: \t{0}", run.ProcessingEndTime); - Console.WriteLine("ErrorMessage: \t{0}", run.ErrorMessage); - } - - Console.WriteLine("\nPress any key to exit."); - Console.ReadKey(); - ``` -14. Add the following helper method used by the **Main** method to the **Program** class. This method pops a dialog box that lets you provide **user name** and **password** that you use to log in to Azure portal. -- ```csharp - public static async Task<string> GetAuthorizationHeader() - { - AuthenticationContext context = new AuthenticationContext(ConfigurationManager.AppSettings["ActiveDirectoryEndpoint"] + ConfigurationManager.AppSettings["ActiveDirectoryTenantId"]); - ClientCredential credential = new ClientCredential( - ConfigurationManager.AppSettings["ApplicationId"], - ConfigurationManager.AppSettings["Password"]); - AuthenticationResult result = await context.AcquireTokenAsync( - resource: ConfigurationManager.AppSettings["WindowsManagementUri"], - clientCredential: credential); -- if (result != null) - return result.AccessToken; -- throw new InvalidOperationException("Failed to acquire token"); - } - ``` --15. In the Solution Explorer, expand the project: **DataFactoryAPITestApp**, right-click **References**, and click **Add Reference**. Select check box for `System.Configuration` assembly and click **OK**. -15. Build the console application. Click **Build** on the menu and click **Build Solution**. -16. Confirm that there is at least one file in the adftutorial container in your Azure blob storage. If not, create Emp.txt file in Notepad with the following content and upload it to the adftutorial container. -- ``` - John, Doe - Jane, Doe - ``` -17. Run the sample by clicking **Debug** -> **Start Debugging** on the menu. When you see the **Getting run details of a data slice**, wait for a few minutes, and press **ENTER**. -18. Use the Azure portal to verify that the data factory **APITutorialFactory** is created with the following artifacts: - * Linked service: **AzureStorageLinkedService** - * Dataset: **DatasetBlobSource** and **DatasetBlobDestination**. - * Pipeline: **PipelineBlobSample** -19. Verify that an output file is created in the **apifactoryoutput** folder in the **adftutorial** container. --## Get a list of failed data slices --```csharp -// Parse the resource path -var ResourceGroupName = "ADFTutorialResourceGroup"; -var DataFactoryName = "DataFactoryAPITestApp"; --var parameters = new ActivityWindowsByDataFactoryListParameters(ResourceGroupName, DataFactoryName); -parameters.WindowState = "Failed"; -var response = dataFactoryManagementClient.ActivityWindows.List(parameters); -do -{ - foreach (var activityWindow in response.ActivityWindowListResponseValue.ActivityWindows) - { - var row = string.Join( - "\t", - activityWindow.WindowStart.ToString(), - activityWindow.WindowEnd.ToString(), - activityWindow.RunStart.ToString(), - activityWindow.RunEnd.ToString(), - activityWindow.DataFactoryName, - activityWindow.PipelineName, - activityWindow.ActivityName, - string.Join(",", activityWindow.OutputDatasets)); - Console.WriteLine(row); - } -- if (response.NextLink != null) - { - response = dataFactoryManagementClient.ActivityWindows.ListNext(response.NextLink, parameters); - } - else - { - response = null; - } -} -while (response != null); -``` --## Next steps -See the following example for creating a pipeline using .NET SDK that copies data from an Azure blob storage to Azure SQL Database: --- [Create a pipeline to copy data from Blob Storage to SQL Database](data-factory-copy-activity-tutorial-using-dotnet-api.md) |
data-factory | Data Factory Create Datasets | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-create-datasets.md | - Title: Create datasets in Azure Data Factory -description: Learn how to create datasets in Azure Data Factory, with examples that use properties such as offset and anchorDateTime. ------ Previously updated : 04/12/2023---# Datasets in Azure Data Factory (version 1) -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-create-datasets.md) -> * [Version 2 (current version)](../concepts-datasets-linked-services.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Datasets in V2](../concepts-datasets-linked-services.md). --This article describes what datasets are, how they are defined in JSON format, and how they are used in Azure Data Factory pipelines. It provides details about each section (for example, structure, availability, and policy) in the dataset JSON definition. The article also provides examples for using the **offset**, **anchorDateTime**, and **style** properties in a dataset JSON definition. --> [!NOTE] -> If you are new to Data Factory, see [Introduction to Azure Data Factory](data-factory-introduction.md) for an overview. If you do not have hands-on experience with creating data factories, you can gain a better understanding by reading the [data transformation tutorial](data-factory-build-your-first-pipeline.md) and the [data movement tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --## Overview -A data factory can have one or more pipelines. A **pipeline** is a logical grouping of **activities** that together perform a task. The activities in a pipeline define actions to perform on your data. For example, you might use a copy activity to copy data from a SQL Server database to Azure Blob storage. Then, you might use a Hive activity that runs a Hive script on an Azure HDInsight cluster to process data from Blob storage to produce output data. Finally, you might use a second copy activity to copy the output data to Azure Synapse Analytics, on top of which business intelligence (BI) reporting solutions are built. For more information about pipelines and activities, see [Pipelines and activities in Azure Data Factory](data-factory-create-pipelines.md). --An activity can take zero or more input **datasets**, and produce one or more output datasets. An input dataset represents the input for an activity in the pipeline, and an output dataset represents the output for the activity. Datasets identify data within different data stores, such as tables, files, folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder in Blob storage from which the pipeline should read the data. --Before you create a dataset, create a **linked service** to link your data store to the data factory. Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources. Datasets identify data within the linked data stores, such as SQL tables, files, folders, and documents. For example, an Azure Storage linked service links a storage account to the data factory. An Azure Blob dataset represents the blob container and the folder that contains the input blobs to be processed. --Here is a sample scenario. To copy data from Blob storage to SQL Database, you create two linked --The following diagram shows the relationships among pipeline, activity, dataset, and linked service in Data Factory: ---## Dataset JSON -A dataset in Data Factory is defined in JSON format as follows: --```json -{ - "name": "<name of dataset>", - "properties": { - "type": "<type of dataset: AzureBlob, AzureSql etc...>", - "external": "<boolean flag to indicate external data. only for input datasets>", - "linkedServiceName": "<Name of the linked service that refers to a data store.>", - "structure": [ - { - "name": "<Name of the column>", - "type": "<Name of the type>" - } - ], - "typeProperties": { - "<type specific property>": "<value>", - "<type specific property 2>": "<value 2>", - }, - "availability": { - "frequency": "<Specifies the time unit for data slice production. Supported frequency: Minute, Hour, Day, Week, Month>", - "interval": "<Specifies the interval within the defined frequency. For example, frequency set to 'Hour' and interval set to 1 indicates that new data slices should be produced hourly>" - }, - "policy": - { - } - } -} -``` --The following table describes properties in the above JSON: --| Property | Description | Required | Default | -| | | | | -| name |Name of the dataset. See [Azure Data Factory - Naming rules](data-factory-naming-rules.md) for naming rules. |Yes |NA | -| type |Type of the dataset. Specify one of the types supported by Data Factory (for example: AzureBlob, AzureSqlTable). <br/><br/>For details, see [Dataset type](#Type). |Yes |NA | -| structure |Schema of the dataset.<br/><br/>For details, see [Dataset structure](#Structure). |No |NA | -| typeProperties | The type properties are different for each type (for example: Azure Blob, Azure SQL table). For details on the supported types and their properties, see [Dataset type](#Type). |Yes |NA | -| external | Boolean flag to specify whether a dataset is explicitly produced by a data factory pipeline or not. If the input dataset for an activity is not produced by the current pipeline, set this flag to true. Set this flag to true for the input dataset of the first activity in the pipeline. |No |false | -| availability | Defines the processing window (for example, hourly or daily) or the slicing model for the dataset production. Each unit of data consumed and produced by an activity run is called a data slice. If the availability of an output dataset is set to daily (frequency - Day, interval - 1), a slice is produced daily. <br/><br/>For details, see Dataset availability. <br/><br/>For details on the dataset slicing model, see the [Scheduling and execution](data-factory-scheduling-and-execution.md) article. |Yes |NA | -| policy |Defines the criteria or the condition that the dataset slices must fulfill. <br/><br/>For details, see the [Dataset policy](#Policy) section. |No |NA | --## Dataset example -In the following example, the dataset represents a table named **MyTable** in a SQL database. --```json -{ - "name": "DatasetSample", - "properties": { - "type": "AzureSqlTable", - "linkedServiceName": "AzureSqlLinkedService", - "typeProperties": - { - "tableName": "MyTable" - }, - "availability": - { - "frequency": "Day", - "interval": 1 - } - } -} -``` --Note the following points: --* **type** is set to AzureSqlTable. -* **tableName** type property (specific to AzureSqlTable type) is set to MyTable. -* **linkedServiceName** refers to a linked service of type AzureSqlDatabase, which is defined in the next JSON snippet. -* **availability frequency** is set to Day, and **interval** is set to 1. This means that the dataset slice is produced daily. --**AzureSqlLinkedService** is defined as follows: --```json -{ - "name": "AzureSqlLinkedService", - "properties": { - "type": "AzureSqlDatabase", - "description": "", - "typeProperties": { - "connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=<databasename>;User ID=<username>@<servername>;Password=<password>;Integrated Security=False;Encrypt=True;Connect Timeout=30" - } - } -} -``` --In the preceding JSON snippet: --* **type** is set to AzureSqlDatabase. -* **connectionString** type property specifies information to connect to a SQL database. --As you can see, the linked service defines how to connect to a SQL database. The dataset defines what table is used as an input and output for the activity in a pipeline. --> [!IMPORTANT] -> Unless a dataset is being produced by the pipeline, it should be marked as **external**. This setting generally applies to inputs of first activity in a pipeline. --## <a name="Type"></a> Dataset type -The type of the dataset depends on the data store you use. See the following table for a list of data stores supported by Data Factory. Click a data store to learn how to create a linked service and a dataset for that data store. ---> [!NOTE] -> Data stores with * can be on-premises or on Azure infrastructure as a service (IaaS). These data stores require you to install [Data Management Gateway](data-factory-data-management-gateway.md). --In the example in the previous section, the type of the dataset is set to **AzureSqlTable**. Similarly, for an Azure Blob dataset, the type of the dataset is set to **AzureBlob**, as shown in the following JSON: --```json -{ - "name": "AzureBlobInput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "fileName": "input.log", - "folderPath": "adfgetstarted/inputdata", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Month", - "interval": 1 - }, - "external": true, - "policy": {} - } -} -``` --## <a name="Structure"></a>Dataset structure -The **structure** section is optional. It defines the schema of the dataset by containing a collection of names and data types of columns. You use the structure section to provide type information that is used to convert types and map columns from the source to the destination. In the following example, the dataset has three columns: `slicetimestamp`, `projectname`, and `pageviews`. They are of type String, String, and Decimal, respectively. --```json -structure: -[ - { "name": "slicetimestamp", "type": "String"}, - { "name": "projectname", "type": "String"}, - { "name": "pageviews", "type": "Decimal"} -] -``` --Each column in the structure contains the following properties: --| Property | Description | Required | -| | | | -| name |Name of the column. |Yes | -| type |Data type of the column. |No | -| culture |.NET-based culture to be used when the type is a .NET type: `Datetime` or `Datetimeoffset`. The default is `en-us`. |No | -| format |Format string to be used when the type is a .NET type: `Datetime` or `Datetimeoffset`. |No | --The following guidelines help you determine when to include structure information, and what to include in the **structure** section. --* **For structured data sources**, specify the structure section only if you want map source columns to sink columns, and their names are not the same. This kind of structured data source stores data schema and type information along with the data itself. Examples of structured data sources include SQL Server, Oracle, and Azure table. - - As type information is already available for structured data sources, you should not include type information when you do include the structure section. -* **For schema on read data sources (specifically Blob storage)**, you can choose to store data without storing any schema or type information with the data. For these types of data sources, include structure when you want to map source columns to sink columns. Also include structure when the dataset is an input for a copy activity, and data types of source dataset should be converted to native types for the sink. - - Data Factory supports the following values for providing type information in structure: **Int16, Int32, Int64, Single, Double, Decimal, Byte[], Boolean, String, Guid, Datetime, Datetimeoffset, and Timespan**. These values are Common Language Specification (CLS)-compliant, .NET-based type values. --Data Factory automatically performs type conversions when moving data from a source data store to a sink data store. --## Dataset availability -The **availability** section in a dataset defines the processing window (for example, hourly, daily, or weekly) for the dataset. For more information about activity windows, see [Scheduling and execution](data-factory-scheduling-and-execution.md). --The following availability section specifies that the output dataset is either produced hourly, or the input dataset is available hourly: --```json -"availability": -{ - "frequency": "Hour", - "interval": 1 -} -``` --If the pipeline has the following start and end times: --```json - "start": "2016-08-25T00:00:00Z", - "end": "2016-08-25T05:00:00Z", -``` --The output dataset is produced hourly within the pipeline start and end times. Therefore, there are five dataset slices produced by this pipeline, one for each activity window (12 AM - 1 AM, 1 AM - 2 AM, 2 AM - 3 AM, 3 AM - 4 AM, 4 AM - 5 AM). --The following table describes properties you can use in the availability section: --| Property | Description | Required | Default | -| | | | | -| frequency |Specifies the time unit for dataset slice production.<br/><br/><b>Supported frequency</b>: Minute, Hour, Day, Week, Month |Yes |NA | -| interval |Specifies a multiplier for frequency.<br/><br/>"Frequency x interval" determines how often the slice is produced. For example, if you need the dataset to be sliced on an hourly basis, you set <b>frequency</b> to <b>Hour</b>, and <b>interval</b> to <b>1</b>.<br/><br/>Note that if you specify **frequency** as **Minute**, you should set the interval to no less than 15. |Yes |NA | -| style |Specifies whether the slice should be produced at the start or end of the interval.<ul><li>StartOfInterval</li><li>EndOfInterval</li></ul>If **frequency** is set to **Month**, and **style** is set to **EndOfInterval**, the slice is produced on the last day of month. If **style** is set to **StartOfInterval**, the slice is produced on the first day of month.<br/><br/>If **frequency** is set to **Day**, and **style** is set to **EndOfInterval**, the slice is produced in the last hour of the day.<br/><br/>If **frequency** is set to **Hour**, and **style** is set to **EndOfInterval**, the slice is produced at the end of the hour. For example, for a slice for the 1 PM - 2 PM period, the slice is produced at 2 PM. |No |EndOfInterval | -| anchorDateTime |Defines the absolute position in time used by the scheduler to compute dataset slice boundaries. <br/><br/>Note that if this property has date parts that are more granular than the specified frequency, the more granular parts are ignored. For example, if the **interval** is **hourly** (frequency: hour and interval: 1), and the **anchorDateTime** contains **minutes and seconds**, then the minutes and seconds parts of **anchorDateTime** are ignored. |No |01/01/0001 | -| offset |Timespan by which the start and end of all dataset slices are shifted. <br/><br/>Note that if both **anchorDateTime** and **offset** are specified, the result is the combined shift. |No |NA | --### offset example -By default, daily (`"frequency": "Day", "interval": 1`) slices start at 12 AM (midnight) Coordinated Universal Time (UTC). If you want the start time to be 6 AM UTC time instead, set the offset as shown in the following snippet: --```json -"availability": -{ - "frequency": "Day", - "interval": 1, - "offset": "06:00:00" -} -``` -### anchorDateTime example -In the following example, the dataset is produced once every 23 hours. The first slice starts at the time specified by **anchorDateTime**, which is set to `2017-04-19T08:00:00` (UTC). --```json -"availability": -{ - "frequency": "Hour", - "interval": 23, - "anchorDateTime":"2017-04-19T08:00:00" -} -``` --### offset/style example -The following dataset is monthly, and is produced on the 3rd of every month at 8:00 AM (`3.08:00:00`): --```json -"availability": { - "frequency": "Month", - "interval": 1, - "offset": "3.08:00:00", - "style": "StartOfInterval" -} -``` --## <a name="Policy"></a>Dataset policy -The **policy** section in the dataset definition defines the criteria or the condition that the dataset slices must fulfill. --### Validation policies -| Policy name | Description | Applied to | Required | Default | -| | | | | | -| minimumSizeMB |Validates that the data in **Azure Blob storage** meets the minimum size requirements (in megabytes). |Azure Blob storage |No |NA | -| minimumRows |Validates that the data in an **Azure SQL database** or an **Azure table** contains the minimum number of rows. |<ul><li>Azure SQL Database</li><li>Azure table</li></ul> |No |NA | --#### Examples -**minimumSizeMB:** --```json -"policy": --{ - "validation": - { - "minimumSizeMB": 10.0 - } -} -``` --**minimumRows:** --```json -"policy": -{ - "validation": - { - "minimumRows": 100 - } -} -``` --### External datasets -External datasets are the ones that are not produced by a running pipeline in the data factory. If the dataset is marked as **external**, the **ExternalData** policy may be defined to influence the behavior of the dataset slice availability. --Unless a dataset is being produced by Data Factory, it should be marked as **external**. This setting generally applies to the inputs of first activity in a pipeline, unless activity or pipeline chaining is being used. --| Name | Description | Required | Default value | -| | | | | -| dataDelay |The time to delay the check on the availability of the external data for the given slice. For example, you can delay an hourly check by using this setting.<br/><br/>The setting only applies to the present time. For example, if it is 1:00 PM right now and this value is 10 minutes, the validation starts at 1:10 PM.<br/><br/>Note that this setting does not affect slices in the past. Slices with **Slice End Time** + **dataDelay** < **Now** are processed without any delay.<br/><br/>Times greater than 23:59 hours should be specified by using the `day.hours:minutes:seconds` format. For example, to specify 24 hours, don't use 24:00:00. Instead, use 1.00:00:00. If you use 24:00:00, it is treated as 24 days (24.00:00:00). For 1 day and 4 hours, specify 1:04:00:00. |No |0 | -| retryInterval |The wait time between a failure and the next attempt. This setting applies to present time. If the previous try failed, the next try is after the **retryInterval** period. <br/><br/>If it is 1:00 PM right now, we begin the first try. If the duration to complete the first validation check is 1 minute and the operation failed, the next retry is at 1:00 + 1min (duration) + 1min (retry interval) = 1:02 PM. <br/><br/>For slices in the past, there is no delay. The retry happens immediately. |No |00:01:00 (1 minute) | -| retryTimeout |The timeout for each retry attempt.<br/><br/>If this property is set to 10 minutes, the validation should be completed within 10 minutes. If it takes longer than 10 minutes to perform the validation, the retry times out.<br/><br/>If all attempts for the validation time out, the slice is marked as **TimedOut**. |No |00:10:00 (10 minutes) | -| maximumRetry |The number of times to check for the availability of the external data. The maximum allowed value is 10. |No |3 | ---## Create datasets -You can create datasets by using one of these tools or SDKs: --- Copy Wizard-- Visual Studio-- PowerShell-- Azure Resource Manager template-- REST API-- .NET API--See the following tutorials for step-by-step instructions for creating pipelines and datasets by using one of these tools or SDKs: --- [Build a pipeline with a data transformation activity](data-factory-build-your-first-pipeline.md)-- [Build a pipeline with a data movement activity](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md)--After a pipeline is created and deployed, you can manage and monitor your pipelines by using the Azure portal blades, or the Monitoring and Management app. See the following topics for step-by-step instructions: --- [Monitor and manage pipelines by using Azure portal blades](data-factory-monitor-manage-pipelines.md)-- [Monitor and manage pipelines by using the Monitoring and Management app](data-factory-monitor-manage-app.md)--## Scoped datasets -You can create datasets that are scoped to a pipeline by using the **datasets** property. These datasets can only be used by activities within this pipeline, not by activities in other pipelines. The following example defines a pipeline with two datasets (InputDataset-rdc and OutputDataset-rdc) to be used within the pipeline. --> [!IMPORTANT] -> Scoped datasets are supported only with one-time pipelines (where **pipelineMode** is set to **OneTime**). See [Onetime pipeline](data-factory-create-pipelines.md#onetime-pipeline) for details. -> -> --```json -{ - "name": "CopyPipeline-rdc", - "properties": { - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "BlobSource", - "recursive": false - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ - { - "name": "InputDataset-rdc" - } - ], - "outputs": [ - { - "name": "OutputDataset-rdc" - } - ], - "scheduler": { - "frequency": "Day", - "interval": 1, - "style": "StartOfInterval" - }, - "name": "CopyActivity-0" - } - ], - "start": "2016-02-28T00:00:00Z", - "end": "2016-02-28T00:00:00Z", - "isPaused": false, - "pipelineMode": "OneTime", - "expirationTime": "15.00:00:00", - "datasets": [ - { - "name": "InputDataset-rdc", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "InputLinkedService-rdc", - "typeProperties": { - "fileName": "emp.txt", - "folderPath": "adftutorial/input", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Day", - "interval": 1 - }, - "external": true, - "policy": {} - } - }, - { - "name": "OutputDataset-rdc", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "OutputLinkedService-rdc", - "typeProperties": { - "fileName": "emp.txt", - "folderPath": "adftutorial/output", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Day", - "interval": 1 - }, - "external": false, - "policy": {} - } - } - ] - } -} -``` --## Next steps -- For more information about pipelines, see [Create pipelines](data-factory-create-pipelines.md).-- For more information about how pipelines are scheduled and executed, see [Scheduling and execution in Azure Data Factory](data-factory-scheduling-and-execution.md). |
data-factory | Data Factory Create Pipelines | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-create-pipelines.md | - Title: Create/Schedule Pipelines, Chain Activities in Data Factory -description: Learn to create a data pipeline in Azure Data Factory to move and transform data. Create a data driven workflow to produce ready to use information. ------ Previously updated : 04/12/2023----# Pipelines and Activities in Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-create-pipelines.md) -> * [Version 2 (current version)](../concepts-pipelines-activities.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Pipelines in V2](../concepts-pipelines-activities.md). --This article helps you understand pipelines and activities in Azure Data Factory and use them to construct end-to-end data-driven workflows for your data movement and data processing scenarios. --> [!NOTE] -> This article assumes that you have gone through [Introduction to Azure Data Factory](data-factory-introduction.md). If you do not have hands-on-experience with creating data factories, going through [data transformation tutorial](data-factory-build-your-first-pipeline.md) and/or [data movement tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) would help you understand this article better. ---## Overview -A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform a task. The activities in a pipeline define actions to perform on your data. For example, you may use a copy activity to copy data from a SQL Server database to an Azure Blob Storage. Then, use a Hive activity that runs a Hive script on an Azure HDInsight cluster to process/transform data from the blob storage to produce output data. Finally, use a second copy activity to copy the output data to an Azure Synapse Analytics on top of which business intelligence (BI) reporting solutions are built. --An activity can take zero or more input [datasets](data-factory-create-datasets.md) and produce one or more output [datasets](data-factory-create-datasets.md). The following diagram shows the relationship between pipeline, activity, and dataset in Data Factory: ---A pipeline allows you to manage activities as a set instead of each one individually. For example, you can deploy, schedule, suspend, and resume a pipeline, instead of dealing with activities in the pipeline independently. --Data Factory supports two types of activities: data movement activities and data transformation activities. Each activity can have zero or more input [datasets](data-factory-create-datasets.md) and produce one or more output datasets. --An input dataset represents the input for an activity in the pipeline and an output dataset represents the output for the activity. Datasets identify data within different data stores, such as tables, files, folders, and documents. After you create a dataset, you can use it with activities in a pipeline. For example, a dataset can be an input/output dataset of a Copy Activity or an HDInsightHive Activity. For more information about datasets, see [Datasets in Azure Data Factory](data-factory-create-datasets.md) article. --### Data movement activities -Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory supports the following data stores. Data from any source can be written to any sink. Click a data store to learn how to copy data to and from that store. ---> [!NOTE] -> Data stores with * can be on-premises or on Azure IaaS, and require you to install [Data Management Gateway](data-factory-data-management-gateway.md) on an on-premises/Azure IaaS machine. --For more information, see [Data Movement Activities](data-factory-data-movement-activities.md) article. --### Data transformation activities --For more information, see [Data Transformation Activities](data-factory-data-transformation-activities.md) article. --### Custom .NET activities -If you need to move data to/from a data store that the Copy Activity doesn't support, or transform data using your own logic, create a **custom .NET activity**. For details on creating and using a custom activity, see [Use custom activities in an Azure Data Factory pipeline](data-factory-use-custom-activities.md). --## Schedule pipelines -A pipeline is active only between its **start** time and **end** time. It is not executed before the start time or after the end time. If the pipeline is paused, it does not get executed irrespective of its start and end time. For a pipeline to run, it should not be paused. See [Scheduling and Execution](data-factory-scheduling-and-execution.md) to understand how scheduling and execution works in Azure Data Factory. --## Pipeline JSON -Let us take a closer look on how a pipeline is defined in JSON format. The generic structure for a pipeline looks as follows: --```json -{ - "name": "PipelineName", - "properties": - { - "description" : "pipeline description", - "activities": - [ -- ], - "start": "<start date-time>", - "end": "<end date-time>", - "isPaused": true/false, - "pipelineMode": "scheduled/onetime", - "expirationTime": "15.00:00:00", - "datasets": - [ - ] - } -} -``` --| Tag | Description | Required | -| | | | -| name |Name of the pipeline. Specify a name that represents the action that the pipeline performs. <br/><ul><li>Maximum number of characters: 260</li><li>Must start with a letter number, or an underscore (\_)</li><li>Following characters are not allowed: ΓÇ£.ΓÇ¥, ΓÇ£+ΓÇ¥, ΓÇ£?ΓÇ¥, ΓÇ£/ΓÇ¥, ΓÇ£<ΓÇ¥,ΓÇ¥>ΓÇ¥,ΓÇ¥\*ΓÇ¥,ΓÇ¥%ΓÇ¥,ΓÇ¥&ΓÇ¥,ΓÇ¥:ΓÇ¥,ΓÇ¥\\ΓÇ¥</li></ul> |Yes | -| description | Specify the text describing what the pipeline is used for. |Yes | -| activities | The **activities** section can have one or more activities defined within it. See the next section for details about the activities JSON element. | Yes | -| start | Start date-time for the pipeline. Must be in [ISO format](https://en.wikipedia.org/wiki/ISO_8601). For example: `2016-10-14T16:32:41Z`. <br/><br/>It is possible to specify a local time, for example an EST time. Here is an example: `2016-02-27T06:00:00-05:00`", which is 6 AM EST.<br/><br/>The start and end properties together specify active period for the pipeline. Output slices are only produced with in this active period. |No<br/><br/>If you specify a value for the end property, you must specify value for the start property.<br/><br/>The start and end times can both be empty to create a pipeline. You must specify both values to set an active period for the pipeline to run. If you do not specify start and end times when creating a pipeline, you can set them using the Set-AzDataFactoryPipelineActivePeriod cmdlet later. | -| end | End date-time for the pipeline. If specified must be in ISO format. For example: `2016-10-14T17:32:41Z` <br/><br/>It is possible to specify a local time, for example an EST time. Here is an example: `2016-02-27T06:00:00-05:00`, which is 6 AM EST.<br/><br/>To run the pipeline indefinitely, specify 9999-09-09 as the value for the end property. <br/><br/> A pipeline is active only between its start time and end time. It is not executed before the start time or after the end time. If the pipeline is paused, it does not get executed irrespective of its start and end time. For a pipeline to run, it should not be paused. See [Scheduling and Execution](data-factory-scheduling-and-execution.md) to understand how scheduling and execution works in Azure Data Factory. |No <br/><br/>If you specify a value for the start property, you must specify value for the end property.<br/><br/>See notes for the **start** property. | -| isPaused | If set to true, the pipeline does not run. It's in the paused state. Default value = false. You can use this property to enable or disable a pipeline. |No | -| pipelineMode | The method for scheduling runs for the pipeline. Allowed values are: scheduled (default), onetime.<br/><br/>ΓÇÿScheduledΓÇÖ indicates that the pipeline runs at a specified time interval according to its active period (start and end time). ΓÇÿOnetimeΓÇÖ indicates that the pipeline runs only once. Onetime pipelines once created cannot be modified/updated currently. See [Onetime pipeline](#onetime-pipeline) for details about onetime setting. |No | -| expirationTime | Duration of time after creation for which the [one-time pipeline](#onetime-pipeline) is valid and should remain provisioned. If it does not have any active, failed, or pending runs, the pipeline is automatically deleted once it reaches the expiration time. The default value: `"expirationTime": "3.00:00:00"`|No | -| datasets |List of datasets to be used by activities defined in the pipeline. This property can be used to define datasets that are specific to this pipeline and not defined within the data factory. Datasets defined within this pipeline can only be used by this pipeline and cannot be shared. See [Scoped datasets](data-factory-create-datasets.md#scoped-datasets) for details. |No | --## Activity JSON -The **activities** section can have one or more activities defined within it. Each activity has the following top-level structure: --```json -{ - "name": "ActivityName", - "description": "description", - "type": "<ActivityType>", - "inputs": "[]", - "outputs": "[]", - "linkedServiceName": "MyLinkedService", - "typeProperties": - { -- }, - "policy": - { - }, - "scheduler": - { - } -} -``` --Following table describes properties in the activity JSON definition: --| Tag | Description | Required | -| | | | -| name | Name of the activity. Specify a name that represents the action that the activity performs. <br/><ul><li>Maximum number of characters: 260</li><li>Must start with a letter number, or an underscore (\_)</li><li>Following characters are not allowed: ΓÇ£.ΓÇ¥, ΓÇ£+ΓÇ¥, ΓÇ£?ΓÇ¥, ΓÇ£/ΓÇ¥, ΓÇ£<ΓÇ¥,ΓÇ¥>ΓÇ¥,ΓÇ¥*ΓÇ¥,ΓÇ¥%ΓÇ¥,ΓÇ¥&ΓÇ¥,ΓÇ¥:ΓÇ¥,ΓÇ¥\\ΓÇ¥</li></ul> |Yes | -| description | Text describing what the activity or is used for |Yes | -| type | Type of the activity. See the [Data Movement Activities](#data-movement-activities) and [Data Transformation Activities](#data-transformation-activities) sections for different types of activities. |Yes | -| inputs |Input tables used by the activity<br/><br/>`// one input table`<br/>`"inputs": [ { "name": "inputtable1" } ],`<br/><br/>`// two input tables` <br/>`"inputs": [ { "name": "inputtable1" }, { "name": "inputtable2" } ],` |Yes | -| outputs |Output tables used by the activity.<br/><br/>`// one output table`<br/>`"outputs": [ { "name": "outputtable1" } ],`<br/><br/>`//two output tables`<br/>`"outputs": [ { "name": "outputtable1" }, { "name": "outputtable2" } ],` |Yes | -| linkedServiceName |Name of the linked service used by the activity. <br/><br/>An activity may require that you specify the linked service that links to the required compute environment. |Yes for HDInsight Activity and ML Studio (classic) Batch Scoring Activity <br/><br/>No for all others | -| typeProperties |Properties in the **typeProperties** section depend on type of the activity. To see type properties for an activity, click links to the activity in the previous section. | No | -| policy |Policies that affect the run-time behavior of the activity. If it is not specified, default policies are used. |No | -| scheduler | ΓÇ£schedulerΓÇ¥ property is used to define desired scheduling for the activity. Its subproperties are the same as the ones in the [availability property in a dataset](data-factory-create-datasets.md#dataset-availability). |No | --### Policies -Policies affect the run-time behavior of an activity, specifically when the slice of a table is processed. The following table provides the details. --| Property | Permitted values | Default Value | Description | -| | | | | -| concurrency |Integer <br/><br/>Max value: 10 |1 |Number of concurrent executions of the activity.<br/><br/>It determines the number of parallel activity executions that can happen on different slices. For example, if an activity needs to go through a large set of available data, having a larger concurrency value speeds up the data processing. | -| executionPriorityOrder |NewestFirst<br/><br/>OldestFirst |OldestFirst |Determines the ordering of data slices that are being processed.<br/><br/>For example, if you have 2 slices (one happening at 4pm, and another one at 5pm), and both are pending execution. If you set the executionPriorityOrder to be NewestFirst, the slice at 5 PM is processed first. Similarly if you set the executionPriorityORder to be OldestFIrst, then the slice at 4 PM is processed. | -| retry |Integer<br/><br/>Max value can be 10 |0 |Number of retries before the data processing for the slice is marked as Failure. Activity execution for a data slice is retried up to the specified retry count. The retry is done as soon as possible after the failure. | -| timeout |TimeSpan |00:00:00 |Timeout for the activity. Example: 00:10:00 (implies timeout 10 mins)<br/><br/>If a value is not specified or is 0, the timeout is infinite.<br/><br/>If the data processing time on a slice exceeds the timeout value, it is canceled, and the system attempts to retry the processing. The number of retries depends on the retry property. When timeout occurs, the status is set to TimedOut. | -| delay |TimeSpan |00:00:00 |Specify the delay before data processing of the slice starts.<br/><br/>The execution of activity for a data slice is started after the Delay is past the expected execution time.<br/><br/>Example: 00:10:00 (implies delay of 10 mins) | -| longRetry |Integer<br/><br/>Max value: 10 |1 |The number of long retry attempts before the slice execution is failed.<br/><br/>longRetry attempts are spaced by longRetryInterval. So if you need to specify a time between retry attempts, use longRetry. If both Retry and longRetry are specified, each longRetry attempt includes Retry attempts and the max number of attempts is Retry * longRetry.<br/><br/>For example, if we have the following settings in the activity policy:<br/>Retry: 3<br/>longRetry: 2<br/>longRetryInterval: 01:00:00<br/><br/>Assume there is only one slice to execute (status is Waiting) and the activity execution fails every time. Initially there would be 3 consecutive execution attempts. After each attempt, the slice status would be Retry. After first 3 attempts are over, the slice status would be LongRetry.<br/><br/>After an hour (that is, longRetryIntevalΓÇÖs value), there would be another set of 3 consecutive execution attempts. After that, the slice status would be Failed and no more retries would be attempted. Hence overall 6 attempts were made.<br/><br/>If any execution succeeds, the slice status would be Ready and no more retries are attempted.<br/><br/>longRetry may be used in situations where dependent data arrives at non-deterministic times or the overall environment is flaky under which data processing occurs. In such cases, doing retries one after another may not help and doing so after an interval of time results in the desired output.<br/><br/>Word of caution: do not set high values for longRetry or longRetryInterval. Typically, higher values imply other systemic issues. | -| longRetryInterval |TimeSpan |00:00:00 |The delay between long retry attempts | --## Sample copy pipeline -In the following sample pipeline, there is one activity of type **Copy** in the **activities** section. In this sample, the [copy activity](data-factory-data-movement-activities.md) copies data from an Azure Blob storage to Azure SQL Database. --```json -{ - "name": "CopyPipeline", - "properties": { - "description": "Copy data from a blob to Azure SQL table", - "activities": [ - { - "name": "CopyFromBlobToSQL", - "type": "Copy", - "inputs": [ - { - "name": "InputDataset" - } - ], - "outputs": [ - { - "name": "OutputDataset" - } - ], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "SqlSink", - "writeBatchSize": 10000, - "writeBatchTimeout": "60:00:00" - } - }, - "Policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ], - "start": "2016-07-12T00:00:00Z", - "end": "2016-07-13T00:00:00Z" - } -} -``` --Note the following points: --* In the activities section, there is only one activity whose **type** is set to **Copy**. -* Input for the activity is set to **InputDataset** and output for the activity is set to **OutputDataset**. See [Datasets](data-factory-create-datasets.md) article for defining datasets in JSON. -* In the **typeProperties** section, **BlobSource** is specified as the source type and **SqlSink** is specified as the sink type. In the [Data movement activities](#data-movement-activities) section, click the data store that you want to use as a source or a sink to learn more about moving data to/from that data store. --For a complete walkthrough of creating this pipeline, see [Tutorial: Copy data from Blob Storage to SQL Database](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --## Sample transformation pipeline -In the following sample pipeline, there is one activity of type **HDInsightHive** in the **activities** section. In this sample, the [HDInsight Hive activity](data-factory-hive-activity.md) transforms data from an Azure Blob storage by running a Hive script file on an Azure HDInsight Hadoop cluster. --```json -{ - "name": "TransformPipeline", - "properties": { - "description": "My first Azure Data Factory pipeline", - "activities": [ - { - "type": "HDInsightHive", - "typeProperties": { - "scriptPath": "adfgetstarted/script/partitionweblogs.hql", - "scriptLinkedService": "AzureStorageLinkedService", - "defines": { - "inputtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata", - "partitionedtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata" - } - }, - "inputs": [ - { - "name": "AzureBlobInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "policy": { - "concurrency": 1, - "retry": 3 - }, - "scheduler": { - "frequency": "Month", - "interval": 1 - }, - "name": "RunSampleHiveActivity", - "linkedServiceName": "HDInsightOnDemandLinkedService" - } - ], - "start": "2016-04-01T00:00:00Z", - "end": "2016-04-02T00:00:00Z", - "isPaused": false - } -} -``` --Note the following points: --* In the activities section, there is only one activity whose **type** is set to **HDInsightHive**. -* The Hive script file, **partitionweblogs.hql**, is stored in the Azure storage account (specified by the scriptLinkedService, called **AzureStorageLinkedService**), and in **script** folder in the container **adfgetstarted**. -* The `defines` section is used to specify the runtime settings that are passed to the hive script as Hive configuration values (e.g `${hiveconf:inputtable}`, `${hiveconf:partitionedtable}`). --The **typeProperties** section is different for each transformation activity. To learn about type properties supported for a transformation activity, click the transformation activity in the [Data transformation activities](#data-transformation-activities) table. --For a complete walkthrough of creating this pipeline, see [Tutorial: Build your first pipeline to process data using Hadoop cluster](data-factory-build-your-first-pipeline.md). --## Multiple activities in a pipeline -The previous two sample pipelines have only one activity in them. You can have more than one activity in a pipeline. --If you have multiple activities in a pipeline and output of an activity is not an input of another activity, the activities may run in parallel if input data slices for the activities are ready. --You can chain two activities by having the output dataset of one activity as the input dataset of the other activity. The second activity executes only when the first one completes successfully. ---In this sample, the pipeline has two activities: Activity1 and Activity2. The Activity1 takes Dataset1 as an input and produces an output Dataset2. The Activity takes Dataset2 as an input and produces an output Dataset3. Since the output of Activity1 (Dataset2) is the input of Activity2, the Activity2 runs only after the Activity completes successfully and produces the Dataset2 slice. If the Activity1 fails for some reason and does not produce the Dataset2 slice, the Activity 2 does not run for that slice (for example: 9 AM to 10 AM). --You can also chain activities that are in different pipelines. ---In this sample, Pipeline1 has only one activity that takes Dataset1 as an input and produces Dataset2 as an output. The Pipeline2 also has only one activity that takes Dataset2 as an input and Dataset3 as an output. --For more information, see [scheduling and execution](data-factory-scheduling-and-execution.md#multiple-activities-in-a-pipeline). -## Create and monitor pipelines -You can create pipelines by using one of these tools or SDKs. --- Copy Wizard-- Visual Studio-- Azure PowerShell-- Azure Resource Manager template-- REST API-- .NET API--See the following tutorials for step-by-step instructions for creating pipelines by using one of these tools or SDKs. --- [Build a pipeline with a data transformation activity](data-factory-build-your-first-pipeline.md)-- [Build a pipeline with a data movement activity](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md)--Once a pipeline is created/deployed, you can manage and monitor your pipelines by using the Azure portal blades or Monitor and Manage App. See the following topics for step-by-step instructions. --- [Monitor and manage pipelines by using Azure portal blades](data-factory-monitor-manage-pipelines.md).-- [Monitor and manage pipelines by using Monitor and Manage App](data-factory-monitor-manage-app.md)--## Onetime pipeline -You can create and schedule a pipeline to run periodically (for example: hourly or daily) within the start and end times you specify in the pipeline definition. See Scheduling activities for details. You can also create a pipeline that runs only once. To do so, you set the **pipelineMode** property in the pipeline definition to **onetime** as shown in the following JSON sample. The default value for this property is **scheduled**. --```json -{ - "name": "CopyPipeline", - "properties": { - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "BlobSource", - "recursive": false - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ - { - "name": "InputDataset" - } - ], - "outputs": [ - { - "name": "OutputDataset" - } - ], - "name": "CopyActivity-0" - } - ], - "pipelineMode": "OneTime" - } -} -``` --Note the following: --* **Start** and **end** times for the pipeline are not specified. -* **Availability** of input and output datasets is specified (**frequency** and **interval**), even though Data Factory does not use the values. -* Diagram view does not show one-time pipelines. This behavior is by design. -* One-time pipelines cannot be updated. You can clone a one-time pipeline, rename it, update properties, and deploy it to create another one. --## Next steps -- For more information about datasets, see [Create datasets](data-factory-create-datasets.md) article.-- For more information about how pipelines are scheduled and executed, see [Scheduling and execution in Azure Data Factory](data-factory-scheduling-and-execution.md) article. |
data-factory | Data Factory Customer Case Studies | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-customer-case-studies.md | - Title: Azure Data Factory - Customer case studies -description: Learn about how some of our customers have been using Azure Data Factory. ------ Previously updated : 04/12/2023---# Azure Data Factory - Customer case studies -Data Factory is a cloud-based information management service that automates the movement and transformation of data. Customers across many industries use Data Factory and other Azure services to build their analytics pipelines and solve their business problems. Learn directly from our customers how and why they are using Data Factory. --## Milliman -[Top Actuarial firm transforms the insurance industry](https://customers.microsoft.com/en-us/story/732463-top-actuarial-firm-transforms-the-insurance-industry-w) --## Rockwell Automation -[Industrial Automation Firm Cuts Costs up to 90 Percent with big data Solutions](https://customers.microsoft.com/en-us/story/726192-industrial-automation-firm-cuts-costs-up-to-90-percent) --## Microsoft Studio -[Delivering epic Xbox experiences by analyzing hundreds of billions of game events each day](https://customers.microsoft.com/en-us/story/xboxunlimited) |
data-factory | Data Factory Customer Profiling Usecase | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-customer-profiling-usecase.md | - Title: Use Case - Customer Profiling -description: Learn how Azure Data Factory is used to create a data-driven workflow (pipeline) to profile gaming customers. ------ Previously updated : 04/12/2023---# Use Case - Customer Profiling -Azure Data Factory is one of many services used to implement the Cortana Intelligence Suite of solution accelerators. For more information about Cortana Intelligence, visit [Cortana Intelligence Suite](https://www.microsoft.com/cortanaanalytics). In this document, we describe a simple use case to help you get started with understanding how Azure Data Factory can solve common analytics problems. --## Scenario -Contoso is a gaming company that creates games for multiple platforms: game consoles, hand held devices, and personal computers (PCs). As players play these games, large volume of log data is produced that tracks the usage patterns, gaming style, and preferences of the user. When combined with demographic, regional, and product data, Contoso can perform analytics to guide them about how to enhance playersΓÇÖ experience and target them for upgrades and in-game purchases. --ContosoΓÇÖs goal is to identify up-sell/cross-sell opportunities based on the gaming history of its players and add compelling features to drive business growth and provide a better experience to customers. For this use case, we use a gaming company as an example of a business. The company wants to optimize its games based on playersΓÇÖ behavior. These principles apply to any business that wants to engage its customers around its goods and services and enhance their customersΓÇÖ experience. --In this solution, Contoso wants to evaluate the effectiveness of a marketing campaign it has recently launched. We start with the raw gaming logs, process and enrich them with geolocation data, join it with advertising reference data, and lastly copy them into an Azure SQL Database to analyze the campaignΓÇÖs impact. --## Deploy Solution -All you need to access and try out this simple use case is an [Azure subscription](https://azure.microsoft.com/pricing/free-trial/), an [Azure Blob storage account](../../storage/common/storage-account-create.md), and an [Azure SQL Database](/azure/azure-sql/database/single-database-create-quickstart). You deploy the customer profiling pipeline from the **Sample pipelines** tile on the home page of your data factory. --1. Create a data factory or open an existing data factory. See [Copy data from Blob Storage to SQL Database using Data Factory](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for steps to create a data factory. -2. In the **DATA FACTORY** blade for the data factory, click the **Sample pipelines** tile. -- :::image type="content" source="./media/data-factory-samples/SamplePipelinesTile.png" alt-text="Sample pipelines tile"::: -3. In the **Sample pipelines** blade, click the **Customer profiling** that you want to deploy. -- :::image type="content" source="./media/data-factory-samples/SampleTile.png" alt-text="Sample pipelines blade"::: -4. Specify configuration settings for the sample. For example, your Azure storage account name and key, logical SQL server name, database, User ID, and password. -- :::image type="content" source="./media/data-factory-samples/SampleBlade.png" alt-text="Sample blade"::: -5. After you are done with specifying the configuration settings, click **Create** to create/deploy the sample pipelines and linked services/tables used by the pipelines. -6. You see the status of deployment on the sample tile you clicked earlier on the **Sample pipelines** blade. -- :::image type="content" source="./media/data-factory-samples/DeploymentStatus.png" alt-text="Deployment status"::: -7. When you see the **Deployment succeeded** message on the tile for the sample, close the **Sample pipelines** blade. -8. On **DATA FACTORY** blade, you see that linked services, data sets, and pipelines are added to your data factory. -- :::image type="content" source="./media/data-factory-samples/DataFactoryBladeAfter.png" alt-text="Data Factory blade"::: --## Solution Overview -This simple use case can be used as an example of how you can use Azure Data Factory to ingest, prepare, transform, analyze, and publish data. ---This Figure depicts how the data pipelines appear in the Azure portal after they have been deployed. --1. The **PartitionGameLogsPipeline** reads the raw game events from blob storage and creates partitions based on year, month, and day. -2. The **EnrichGameLogsPipeline** joins partitioned game events with geo code reference data and enriches the data by mapping IP addresses to the corresponding geo-locations. -3. The **AnalyzeMarketingCampaignPipeline** pipeline uses the enriched data and processes it with the advertising data to create the final output that contains marketing campaign effectiveness. --In this example, Data Factory is used to orchestrate activities that copy input data, transform, and process the data, and output the final data to an Azure SQL Database. You can also visualize the network of data pipelines, manage them, and monitor their status from the UI. --## Benefits -By optimizing their user profile analytics and aligning it with business goals, gaming company is able to quickly collect usage patterns, and analyze the effectiveness of its marketing campaigns. |
data-factory | Data Factory Data Management Gateway High Availability Scalability | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-data-management-gateway-high-availability-scalability.md | - Title: High availability with data management gateway in Azure Data Factory -description: This article explains how you can scale out a data management gateway by adding more nodes and scale up by increasing number of concurrent jobs that can run on a node. ---- Previously updated : 04/12/2023----# Data Management Gateway - high availability and scalability (Preview) -> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [self-hosted integration runtime in](../create-self-hosted-integration-runtime.md). ---This article helps you configure high availability and scalability solution with Data Management Gateway/ Integration. --> [!NOTE] -> This article assumes that you are already familiar with basics of Integration Runtime (Earlier Data Management Gateway). If you are not, see [Data Management Gateway](data-factory-data-management-gateway.md). -> -> **This preview feature is officially supported on Data Management Gateway version 2.12.xxxx.x and above**. Please make sure you are using version 2.12.xxxx.x or above. Download the latest version of Data Management Gateway [here](https://www.microsoft.com/download/details.aspx?id=39717). --## Overview -You can associate data management gateways that are installed on multiple on-premises machines with a single logical gateway from the portal. These machines are called **nodes**. You can have up to **four nodes** associated with a logical gateway. The benefits of having multiple nodes (on-premises machines with gateway installed) for a logical gateway are: --- Improve performance of data movement between on-premises and cloud data stores. -- If one of the nodes goes down for some reason, other nodes are still available for moving the data. -- If one of the nodes need to be taken offline for maintenance, other nodes are still available for moving the data.--You can also configure the number of **concurrent data movement jobs** that can run on a node to scale up the capability of moving data between on-premises and cloud data stores. --Using the Azure portal, you can monitor the status of these nodes, which helps you decide whether to add or remove a node from the logical gateway. --## Architecture -The following diagram provides the architecture overview of scalability and availability feature of the Data Management Gateway: ---A **logical gateway** is the gateway you add to a data factory in the Azure portal. Earlier, you could associate only one on-premises Windows machine with Data Management Gateway installed with a logical gateway. This on-premises gateway machine is called a node. Now, you can associate up to **four physical nodes** with a logical gateway. A logical gateway with multiple nodes is called a **multi-node gateway**. --All these nodes are **active**. They all can process data movement jobs to move data between on-premises and cloud data stores. One of the nodes act as both dispatcher and worker. Other nodes in the groups are worker nodes. A **dispatcher** node pulls data movement tasks/jobs from the cloud service and dispatches them to worker nodes (including itself). A **worker** node executes data movement jobs to move data between on-premises and cloud data stores. All nodes are workers. Only one node can be both dispatch and worker. --You may typically start with one node and **scale out** to add more nodes as the existing node(s) are overwhelmed with the data movement load. You can also **scale up** the data movement capability of a gateway node by increasing the number of concurrent jobs that are allowed to run on the node. This capability is also available with a single-node gateway (even when the scalability and availability feature is not enabled). --A gateway with multiple nodes keeps the data store credentials in sync across all nodes. If there is a node-to-node connectivity issue, the credentials may be out of sync. When you set credentials for an on-premises data store that uses a gateway, it saves credentials on the dispatcher/worker node. The dispatcher node syncs with other worker nodes. This process is known as **credentials sync**. The communication channel between nodes can be **encrypted** by a public SSL/TLS certificate. --## Set up a multi-node gateway -This section assumes that you have gone through the following two articles or familiar with concepts in these articles: --- [Data Management Gateway](data-factory-data-management-gateway.md) - provides a detailed overview of the gateway.-- [Move data between on-premises and cloud data stores](data-factory-move-data-between-onprem-and-cloud.md) - contains a walkthrough with step-by-step instructions for using a gateway with a single node. --> [!NOTE] -> Before you install a data management gateway on an on-premises Windows machine, see prerequisites listed in [the main article](data-factory-data-management-gateway.md#prerequisites). --1. In the [walkthrough](data-factory-move-data-between-onprem-and-cloud.md#create-gateway), while creating a logical gateway, enable the **High Availability & Scalability** feature. -- :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/data-factory-enable-high-availability-scalability.png" alt-text="Data Management Gateway - enable high availability and scalability"::: -2. In the **Configure** page, use either **Express Setup** or **Manual Setup** link to install a gateway on the first node (an on-premises Windows machine). -- :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/data-factory-gateway-express-manual-setup.png" alt-text="Data Management Gateway - express or manual setup"::: -- > [!NOTE] - > If you use the express setup option, the node-to-node communication is done without encryption. The node name is same as the machine name. Use manual setup if the node-node communication needs to be encrypted or you want to specify a node name of your choice. Node names cannot be edited later. -3. If you choose **express setup** - 1. You see the following message after the gateway is successfully installed: -- :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/express-setup-success.png" alt-text="Data Management Gateway - express setup success"::: - 2. Launch Data Management Configuration Manager for the gateway by following [these instructions](data-factory-data-management-gateway.md#configuration-manager). You see the gateway name, node name, status, etc. -- :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/data-factory-gateway-installation-success.png" alt-text="Screenshot that shows where you can see the gateway name, node name, and status."::: -4. If you choose **manual setup**: - 1. Download the installation package from the Microsoft Download Center, run it to install gateway on your machine. - 2. Use the **authentication key** from the **Configure** page to register the gateway. - - :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/data-factory-gateway-authentication-key.png" alt-text="Screenshot that shows where to use the authentication key."::: - 3. In the **New gateway node** page, you can provide a custom **name** to the gateway node. By default, node name is same as the machine name. -- :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/data-factory-gateway-name.png" alt-text="Data Management Gateway - specify name"::: - 4. In the next page, you can choose whether to **enable encryption for node-to-node communication**. Click **Skip** to disable encryption (default). -- :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/data-factory-gateway-node-encryption.png" alt-text="Data Management Gateway - enable encryption"::: - - > [!NOTE] - > Changing of encryption mode is only supported when you have a single gateway node in the logical gateway. To change the encryption mode when a gateway has multiple nodes, do the following steps: delete all the nodes except one node, change the encryption mode, and then add the nodes again. - > - > See [TLS/SSL certificate requirements](#tlsssl-certificate-requirements) section for a list of requirements for using an TLS/SSL certificate. - 5. After the gateway is successfully installed, click Launch Configuration - - :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/manual-setup-launch-configuration-manager.png" alt-text="Manual setup - launch configuration manager"::: - 6. you see Data Management Gateway Configuration Manager on the node (on-premises Windows machine), which shows connectivity status, **gateway name**, and **node name**. -- :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/data-factory-gateway-installation-success.png" alt-text="Data Management Gateway - installation successful"::: -- > [!NOTE] - > If you are provisioning the gateway on an Azure VM, you can use [this Azure Resource Manager template](https://github.com/Azure/azure-quickstart-templates/tree/master/quickstarts/microsoft.datafactory/mutiple-vms-with-data-management-gateway). This script creates a logical gateway, sets up VMs with Data Management Gateway software installed, and registers them with the logical gateway. -6. In Azure portal, launch the **Gateway** page: - 1. On the data factory home page in the portal, click **Linked Services**. - - :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/data-factory-home-page.png" alt-text="Screenshot that highlights the Linked services tile."::: - 2. select the **gateway** to see the **Gateway** page: - - :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/linked-services-gateway.png" alt-text="Data factory home page"::: - 4. You see the **Gateway** page: -- :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/gateway-first-node-portal-view.png" alt-text="Gateway with single node view"::: -7. Click **Add Node** on the toolbar to add a node to the logical gateway. If you are planning to use express setup, do this step from the on-premises machine that will be added as a node to the gateway. -- :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/data-factory-gateway-add-node-menu.png" alt-text="Data Management Gateway - add node menu"::: -8. Steps are similar to setting up the first node. The Configuration Manager UI lets you set the node name if you choose the manual installation option: -- :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/install-second-gateway.png" alt-text="Configuration Manager - install second gateway"::: -9. After the gateway is installed successfully on the node, the Configuration Manager tool displays the following screen: -- :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/second-gateway-installation-successful.png" alt-text="Configuration Manager - install second gateway successful"::: -10. If you open the **Gateway** page in the portal, you see two gateway nodes now: -- :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/data-factory-gateway-multi-node-monitoring.png" alt-text="Gateway with two nodes in the portal"::: -11. To delete a gateway node, click **Delete Node** on the toolbar, select the node you want to delete, and then click **Delete** from the toolbar. This action deletes the selected node from the group. Note that this action does not uninstall the data management gateway software from the node (on-premises Windows machine). Use **Add or remove programs** in Control Panel on the on-premises to uninstall the gateway. When you uninstall gateway from the node, it's automatically deleted in the portal. --## Upgrade an existing gateway -You can upgrade an existing gateway to use the high availability and scalability feature. This feature works only with nodes that have the data management gateway of version >= 2.12.xxxx. You can see the version of data management gateway installed on a machine in the **Help** tab of the Data Management Gateway Configuration Manager. --1. Update the gateway on the on-premises machine to the latest version by following by downloading and running an MSI setup package from the [Microsoft Download Center](https://www.microsoft.com/download/details.aspx?id=39717). See [installation](data-factory-data-management-gateway.md#installation) section for details. -2. Navigate to the Azure portal. Launch the **Data Factory page** for your data factory. Click Linked services tile to launch the **linked services page**. Select the gateway to launch the **gateway page**. Click and enable **Preview Feature** as shown in the following image: -- :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/data-factory-existing-gateway-enable-high-availability.png" alt-text="Data Management Gateway - enable preview feature"::: -2. Once the preview feature is enabled in the portal, close all pages. Reopen the **gateway page** to see the new preview user interface (UI). - - :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/data-factory-gateway-preview-success.png" alt-text="Data Management Gateway - enable preview feature success"::: -- :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/data-factory-gateway-preview.png" alt-text="Data Management Gateway - preview UI"::: -- > [!NOTE] - > During the upgrade, name of the first node is the name of the machine. -3. Now, add a node. In the **Gateway** page, click **Add Node**. -- :::image type="content" source="media/data-factory-data-management-gateway-high-availability-scalability/data-factory-gateway-add-node-menu.png" alt-text="Data Management Gateway - add node menu"::: -- Follow instructions from the previous section to set up the node. --### Installation best practices --- Configure power plan on the host machine for the gateway so that the machine does not hibernate. If the host machine hibernates, the gateway does not respond to data requests.-- Back up the certificate associated with the gateway.-- Ensure all nodes are of similar configuration (recommended) for ideal performance. -- Add at least two nodes to ensure high availability. --### TLS/SSL certificate requirements -Here are the requirements for the TLS/SSL certificate that is used for securing communications between integration runtime nodes: --- The certificate must be a publicly trusted X509 v3 certificate. We recommend that you use certificates that are issued by a public (third-party) certification authority (CA).-- Each integration runtime node must trust this certificate, as well as the client machine that is running the credential manager application. - > [!NOTE] - > Credential manager application is used while securely setting credential from Copy Wizard/ Azure Portal. And this can be fired from any machine within the same network as the on-premises/ private data store. -- Wild card certificates are supported. If your FQDN name is **node1.domain.contoso.com**, you can use ***.domain.contoso.com** as subject name of the certificate.-- SAN certificates are not recommended since only the last item of the Subject Alternative Names will be used and all others will be ignored due to current limitation. E.g. you have a SAN certificate whose SAN are **node1.domain.contoso.com** and **node2.domain.contoso.com**, you can only use this cert on machine whose FQDN is **node2.domain.contoso.com**.-- Supports any key size supported by Windows Server 2012 R2 for TLS/SSL certificates.-- Certificate using CNG keys are not supported.--#### FAQ: When would I not enable this encryption? -Enabling encryption can add certain cost to your infrastructure (owning public certificate) hence you may skip enabling encryption in the below cases: -- When the integration runtime is running on a trusted network, or a network with transparent encryption like IP/SEC. Since this channel communication is only limited within your trusted network, you may not need additional encryption.-- When the integration runtime is not running in a production environment. This can help in reducing TLS/SSL certificate cost.---## Monitor a multi-node gateway -### Multi-node gateway monitoring -In the Azure portal, you can view near-real time snapshot of resource utilization (CPU, memory, network(in/out), etc.) on each node along with statuses of gateway nodes. ---You can enable **Advanced Settings** in the **Gateway** page to see advanced metrics like **Network**(in/out), **Role & Credential Status**, which is helpful in debugging gateway issues, and **Concurrent Jobs** (Running/ Limit) which can be modified/ changed accordingly during performance tuning. The following table provides descriptions of columns in the **Gateway Nodes** list: --Monitoring Property | Description -: | :- -Name | Name of the logical gateway and nodes associated with the gateway. -Status | Status of the logical gateway and the gateway nodes. Example: Online/Offline/Limited/etc. For information about these statuses, See [Gateway status](#gateway-status) section. -Version | Shows the version of the logical gateway and each gateway node. The version of the logical gateway is determined based on version of majority of nodes in the group. If there are nodes with different versions in the logical gateway setup, only the nodes with the same version number as the logical gateway function properly. Others are in the limited mode and need to be manually updated (only in case auto-update fails). -Available memory | Available memory on a gateway node. This value is a near real-time snapshot. -CPU utilization | CPU utilization of a gateway node. This value is a near real-time snapshot. -Networking (In/Out) | Network utilization of a gateway node. This value is a near real-time snapshot. -Concurrent Jobs (Running/ Limit) | Number of jobs or tasks running on each node. This value is a near real-time snapshot. Limit signifies the maximum concurrent jobs for each node. This value is defined based on the machine size. You can increase the limit to scale up concurrent job execution in advanced scenarios, where CPU/ memory/ network is under-utilized, but activities are timing out. This capability is also available with a single-node gateway (even when the scalability and availability feature is not enabled). For more information, see [scale considerations](#scale-considerations) section. -Role | There are two types of roles - Dispatcher and worker. All nodes are workers, which means they can all be used to execute jobs. There is only one dispatcher node, which is used to pull tasks/jobs from cloud services and dispatch them to different worker nodes (including itself). ---### Gateway status --The following table provides possible statuses of a **gateway node**: --Status | Comments/Scenarios -:- | : -Online | Node connected to Data Factory service. -Offline | Node is offline. -Upgrading | The node is being auto-updated. -Limited | Due to Connectivity issue. May be due to HTTP port 8050 issue, service bus connectivity issue, or credential sync issue. -Inactive | Node is in a configuration different from the configuration of other majority nodes.<br/><br/> A node can be inactive when it cannot connect to other nodes. ---The following table provides possible statuses of a **logical gateway**. The gateway status depends on statuses of the gateway nodes. --Status | Comments -:-- | :- -Needs Registration | No node is yet registered to this logical gateway -Online | Gateway Nodes are online -Offline | No node in online status. -Limited | Not all nodes in this gateway are in healthy state. This status is a warning that some node might be down! <br/><br/>Could be due to credential sync issue on dispatcher/worker node. --### Pipeline/ activities monitoring -The Azure portal provides a pipeline monitoring experience with granular node level details. For example, it shows which activities ran on which node. This information can be helpful in understanding performance issues on a particular node, say due to network throttling. ----## Scale considerations --### Scale out -When the **available memory is low** and the **CPU usage is high**, adding a new node helps scale out the load across machines. If activities are failing due to time-out or gateway node being offline, it helps if you add a node to the gateway. - -### Scale up -When the available memory and CPU are not utilized well, but the idle capacity is 0, you should scale up by increasing the number of concurrent jobs that can run on a node. You may also want to scale up when activities are timing out because the gateway is overloaded. As shown in the following image, you can increase the maximum capacity for a node. We suggest doubling it to start with. ----## Known issues/breaking changes --- Currently, you can have up to four physical gateway nodes for a single logical gateway. If you need more than four nodes for performance reasons, send an email to [DMGHelp@microsoft.com](mailto:DMGHelp@microsoft.com).-- You cannot re-register a gateway node with the authentication key from another logical gateway to switch from the current logical gateway. To re-register, uninstall the gateway from the node, reinstall the gateway, and register it with the authentication key for the other logical gateway. -- If HTTP proxy is required for all your gateway nodes, set the proxy in diahost.exe.config and diawp.exe.config, and use the server manager to make sure all nodes have the same diahost.exe.config and diawip.exe.config. See [configure proxy settings](data-factory-data-management-gateway.md#configure-proxy-server-settings) section for details. -- To change encryption mode for node-to-node communication in Gateway Configuration Manager, delete all the nodes in the portal except one. Then, add nodes back after changing the encryption mode.-- Use an official TLS certificate if you choose to encrypt the node-to-node communication channel. Self-signed certificate may cause connectivity issues as the same certificate may not be trusted in certifying authority list on other machines. -- You cannot register a gateway node to a logical gateway when the node version is lower than the logical gateway version. Delete all nodes of the logical gateway from portal so that you can register a lower version node(downgrade) it. If you delete all nodes of a logical gateway, manually install and register new nodes to that logical gateway. Express setup is not supported in this case.-- You cannot use express setup to install nodes to an existing logical gateway, which is still using cloud credentials. You can check where the credentials are stored from the Gateway Configuration Manager on the Settings tab.-- You cannot use express setup to install nodes to an existing logical gateway, which has node-to-node encryption enabled. As setting the encryption mode involves manually adding certificates, express install is no more an option. -- For a file copy from on-premises environment, you should not use \\localhost or C:\files anymore since localhost or local drive might not be accessible via all nodes. Instead, use \\ServerName\files to specify files' location.---## Rolling back from the preview -To roll back from the preview, delete all nodes but one node. It doesn't matter which nodes you delete, but ensure you have at least one node in the logical gateway. You can delete a node either by uninstalling gateway on the machine or by using the Azure portal. In the Azure portal, in the **Data Factory** page, click Linked services to launch the **Linked services** page. Select the gateway to launch the **Gateway** page. In the Gateway page, you can see the nodes associated with the gateway. The page lets you delete a node from the gateway. - -After deleting, click **preview features** in the same Azure portal page, and disable the preview feature. You have reset your gateway to one node GA (general availability) gateway. ---## Next steps -Review the following articles: -- [Data Management Gateway](data-factory-data-management-gateway.md) - provides a detailed overview of the gateway.-- [Move data between on-premises and cloud data stores](data-factory-move-data-between-onprem-and-cloud.md) - contains a walkthrough with step-by-step instructions for using a gateway with a single node. |
data-factory | Data Factory Data Management Gateway | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-data-management-gateway.md | - Title: Data Management Gateway for Data Factory -description: Use Data Management Gateway in Azure Data Factory to move your data. ---- Previously updated : 04/12/2023-----# Data Management Gateway -> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [self-hosted integration runtime in](../create-self-hosted-integration-runtime.md). --> [!NOTE] -> Data Management Gateway has now been rebranded as Self-hosted Integration Runtime. --The Data management gateway is a client agent that you must install in your on-premises environment to copy data between cloud and on-premises data stores. The on-premises data stores supported by Data Factory are listed in the [Supported data sources](data-factory-data-movement-activities.md#supported-data-stores-and-formats) section. --This article complements the walkthrough in the [Move data between on-premises and cloud data stores](data-factory-move-data-between-onprem-and-cloud.md) article. In the walkthrough, you create a pipeline that uses the gateway to move data from a SQL Server database to an Azure blob. This article provides detailed in-depth information about the data management gateway. --You can scale out a data management gateway by associating multiple on-premises machines with the gateway. You can scale up by increasing number of data movement jobs that can run concurrently on a node. This feature is also available for a logical gateway with a single node. See [Scaling data management gateway in Azure Data Factory](data-factory-data-management-gateway-high-availability-scalability.md) article for details. --> [!NOTE] -> Currently, gateway supports only the copy activity and stored procedure activity in Data Factory. It is not possible to use the gateway from a custom activity to access on-premises data sources. ---## Overview -### Capabilities of data management gateway -Data management gateway provides the following capabilities: --* Model on-premises data sources and cloud data sources within the same data factory and move data. -* Have a single pane of glass for monitoring and management with visibility into gateway status from the Data Factory page. -* Manage access to on-premises data sources securely. - * No changes required to corporate firewall. Gateway only makes outbound HTTP-based connections to open internet. - * Encrypt credentials for your on-premises data stores with your certificate. -* Move data efficiently - data is transferred in parallel, resilient to intermittent network issues with auto retry logic. --### Command flow and data flow -When you use a copy activity to copy data between on-premises and cloud, the activity uses a gateway to transfer data from on-premises data source to cloud and vice versa. --Here is the high-level data flow for and summary of steps for copy with data gateway: --1. Data developer creates a gateway for an Azure Data Factory using either the [Azure portal](https://portal.azure.com) or [PowerShell Cmdlet](/powershell/module/az.datafactory/). -2. Data developer creates a linked service for an on-premises data store by specifying the gateway. As part of setting up the linked service, data developer uses the Setting Credentials application to specify authentication types and credentials. The Setting Credentials application dialog communicates with the data store to test connection and the gateway to save credentials. -3. Gateway encrypts the credentials with the certificate associated with the gateway (supplied by data developer), before saving the credentials in the cloud. -4. Data Factory service communicates with the gateway for scheduling & management of jobs via a control channel that uses a shared Azure service bus queue. When a copy activity job needs to be kicked off, Data Factory queues the request along with credential information. Gateway kicks off the job after polling the queue. -5. The gateway decrypts the credentials with the same certificate and then connects to the on-premises data store with proper authentication type and credentials. -6. The gateway copies data from an on-premises store to a cloud storage, or vice versa depending on how the Copy Activity is configured in the data pipeline. For this step, the gateway directly communicates with cloud-based storage services such as Azure Blob Storage over a secure (HTTPS) channel. --### Considerations for using gateway -* A single instance of data management gateway can be used for multiple on-premises data sources. However, **a single gateway instance is tied to only one Azure data factory** and cannot be shared with another data factory. -* You can have **only one instance of data management gateway** installed on a single machine. Suppose, you have two data factories that need to access on-premises data sources, you need to install gateways on two on-premises computers. In other words, a gateway is tied to a specific data factory -* The **gateway does not need to be on the same machine as the data source**. However, having gateway closer to the data source reduces the time for the gateway to connect to the data source. We recommend that you install the gateway on a machine that is different from the one that hosts on-premises data source. When the gateway and data source are on different machines, the gateway does not compete for resources with data source. -* You can have **multiple gateways on different machines connecting to the same on-premises data source**. For example, you may have two gateways serving two data factories but the same on-premises data source is registered with both the data factories. -* If you already have a gateway installed on your computer serving a **Power BI** scenario, install a **separate gateway for Azure Data Factory** on another machine. -* Gateway must be used even when you use **ExpressRoute**. -* Treat your data source as an on-premises data source (that is behind a firewall) even when you use **ExpressRoute**. Use the gateway to establish connectivity between the service and the data source. -* You must **use the gateway** even if the data store is in the cloud on an **Azure IaaS VM**. --## Installation -### Prerequisites -* The supported **Operating System** versions are Windows 7, Windows 8/8.1, Windows 10, Windows Server 2008 R2, Windows Server 2012, Windows Server 2012 R2. Installation of the data management gateway on a domain controller is currently not supported. -* .NET Framework 4.5.1 or above is required. If you are installing gateway on a Windows 7 machine, install .NET Framework 4.5 or later. See [.NET Framework System Requirements](/dotnet/framework/get-started/system-requirements) for details. -* The recommended **configuration** for the gateway machine is at least 2 GHz, 4 cores, 8-GB RAM, and 80-GB disk. -* If the host machine hibernates, the gateway does not respond to data requests. Therefore, configure an appropriate **power plan** on the computer before installing the gateway. If the machine is configured to hibernate, the gateway installation prompts a message. -* You must be an administrator on the machine to install and configure the data management gateway successfully. You can add additional users to the **data management gateway Users** local Windows group. The members of this group are able to use the **Data Management Gateway Configuration Manager** tool to configure the gateway. --As copy activity runs happen on a specific frequency, the resource usage (CPU, memory) on the machine also follows the same pattern with peak and idle times. Resource utilization also depends heavily on the amount of data being moved. When multiple copy jobs are in progress, you see resource usage go up during peak times. --### Installation options -Data management gateway can be installed in the following ways: --* By downloading an MSI setup package from the [Microsoft Download Center](https://www.microsoft.com/download/details.aspx?id=39717). The MSI can also be used to upgrade existing data management gateway to the latest version, with all settings preserved. -* By clicking **Download and install data gateway** link under MANUAL SETUP or **Install directly on this computer** under EXPRESS SETUP. See [Move data between on-premises and cloud](data-factory-move-data-between-onprem-and-cloud.md) article for step-by-step instructions on using express setup. The manual step takes you to the download center. The instructions for downloading and installing the gateway from download center are in the next section. --### Installation best practices: -1. Configure power plan on the host machine for the gateway so that the machine does not hibernate. If the host machine hibernates, the gateway does not respond to data requests. -2. Back up the certificate associated with the gateway. --### Install the gateway from download center -1. Navigate to [Microsoft Data Management Gateway download page](https://www.microsoft.com/download/details.aspx?id=39717). -2. Click **Download**, select the **64-bit** version (32-bit is no more supported), and click **Next**. -3. Run the **MSI** directly or save it to your hard disk and run. -4. On the **Welcome** page, select a **language** click **Next**. -5. **Accept** the End-User License Agreement and click **Next**. -6. Select **folder** to install the gateway and click **Next**. -7. On the **Ready to install** page, click **Install**. -8. Click **Finish** to complete installation. -9. Get the key from the Azure portal. See the next section for step-by-step instructions. -10. On the **Register gateway** page of **Data Management Gateway Configuration Manager** running on your machine, do the following steps: - 1. Paste the key in the text. - 2. Optionally, click **Show gateway key** to see the key text. - 3. Click **Register**. --### Register gateway using key -#### If you haven't already created a logical gateway in the portal -To create a gateway in the portal and get the key from the **Configure** page, Follow steps from walkthrough in the [Move data between on-premises and cloud](data-factory-move-data-between-onprem-and-cloud.md) article. --#### If you have already created the logical gateway in the portal -1. In Azure portal, navigate to the **Data Factory** page, and click **Linked Services** tile. -- :::image type="content" source="media/data-factory-data-management-gateway/data-factory-blade.png" alt-text="Data Factory page"::: -2. In the **Linked Services** page, select the logical **gateway** you created in the portal. -- :::image type="content" source="media/data-factory-data-management-gateway/data-factory-select-gateway.png" alt-text="logical gateway"::: -3. In the **Data Gateway** page, click **Download and install data gateway**. -- :::image type="content" source="media/data-factory-data-management-gateway/download-and-install-link-on-portal.png" alt-text="Download link in the portal"::: -4. In the **Configure** page, click **Recreate key**. Click Yes on the warning message after reading it carefully. -- :::image type="content" source="media/data-factory-data-management-gateway/recreate-key-button.png" alt-text="Recreate key button"::: -5. Click Copy button next to the key. The key is copied to the clipboard. -- :::image type="content" source="media/data-factory-data-management-gateway/copy-gateway-key.png" alt-text="Copy key"::: --### System tray icons/ notifications -The following image shows some of the tray icons that you see. ---If you move cursor over the system tray icon/notification message, you see details about the state of the gateway/update operation in a popup window. --### Ports and firewall -There are two firewalls you need to consider: **corporate firewall** running on the central router of the organization, and **Windows firewall** configured as a daemon on the local machine where the gateway is installed. ---At corporate firewall level, you need configure the following domains and outbound ports: --| Domain names | Ports | Description | -| | | | -| *.servicebus.windows.net |443 |Used for communication with Data Movement Service backend | -| *.core.windows.net |443 |Used for Staged copy using Azure Blob (if configured)| -| *.frontend.clouddatahub.net |443 |Used for communication with Data Movement Service backend | -| *.servicebus.windows.net |9350-9354, 5671 |Optional service bus relay over TCP used by the Copy Wizard | --At Windows firewall level, these outbound ports are normally enabled. If not, you can configure the domains and ports accordingly on gateway machine. --> [!NOTE] -> 1. Based on your source/ sinks, you may have to allow additional domains and outbound ports in your corporate/Windows firewall. -> 2. For some Cloud Databases (For example: [Azure SQL Database](/azure/azure-sql/database/firewall-configure), [Azure Data Lake](../../data-lake-store/data-lake-store-secure-data.md#set-ip-address-range-for-data-access), etc.), you may need to allow IP address of Gateway machine on their firewall configuration. -> -> --#### Copy data from a source data store to a sink data store -Ensure that the firewall rules are enabled properly on the corporate firewall, Windows firewall on the gateway machine, and the data store itself. Enabling these rules allows the gateway to connect to both source and sink successfully. Enable rules for each data store that is involved in the copy operation. --For example, to copy from **an on-premises data store to an Azure SQL Database sink or an Azure Synapse Analytics sink**, do the following steps: --* Allow outbound **TCP** communication on port **1433** for both Windows firewall and corporate firewall. -* Configure the firewall settings of logical SQL server to add the IP address of the gateway machine to the list of allowed IP addresses. --> [!NOTE] -> If your firewall does not allow outbound port 1433, Gateway can't access Azure SQL directly. In this case, you may use [Staged Copy](./data-factory-copy-activity-performance.md#staged-copy) to SQL Database / SQL Managed Instance / SQL Azure DW. In this scenario, you would only require HTTPS (port 443) for the data movement. -> -> --### Proxy server considerations -If your corporate network environment uses a proxy server to access the internet, configure data management gateway to use appropriate proxy settings. You can set the proxy during the initial registration phase. ---Gateway uses the proxy server to connect to the cloud service. Click **Change** link during initial setup. You see the **proxy setting** dialog. ---There are three configuration options: --* **Do not use proxy**: Gateway does not explicitly use any proxy to connect to cloud services. -* **Use system proxy**: Gateway uses the proxy setting that is configured in diahost.exe.config and diawp.exe.config. If no proxy is configured in diahost.exe.config and diawp.exe.config, gateway connects to cloud service directly without going through proxy. -* **Use custom proxy**: Configure the HTTP proxy setting to use for gateway, instead of using configurations in diahost.exe.config and diawp.exe.config. Address and Port are required. User Name and Password are optional depending on your proxy's authentication setting. All settings are encrypted with the credential certificate of the gateway and stored locally on the gateway host machine. --The data management gateway Host Service restarts automatically after you save the updated proxy settings. --After gateway has been successfully registered, if you want to view or update proxy settings, use Data Management Gateway Configuration Manager. --1. Launch **Data Management Gateway Configuration Manager**. -2. Switch to the **Settings** tab. -3. Click **Change** link in **HTTP Proxy** section to launch the **Set HTTP Proxy** dialog. -4. After you click the **Next** button, you see a warning dialog asking for your permission to save the proxy setting and restart the Gateway Host Service. --You can view and update HTTP proxy by using Configuration Manager tool. ---> [!NOTE] -> If you set up a proxy server with NTLM authentication, Gateway Host Service runs under the domain account. If you change the password for the domain account later, remember to update configuration settings for the service and restart it accordingly. Due to this requirement, we suggest you use a dedicated domain account to access the proxy server that does not require you to update the password frequently. -> -> --### Configure proxy server settings -If you select **Use system proxy** setting for the HTTP proxy, gateway uses the proxy setting in diahost.exe.config and diawp.exe.config. If no proxy is specified in diahost.exe.config and diawp.exe.config, gateway connects to cloud service directly without going through proxy. The following procedure provides instructions for updating the diahost.exe.config file. --1. In File Explorer, make a safe copy of *C:\\Program Files\\Microsoft Integration Runtime\\5.0\\Shared\\diahost.exe.config* to back up the original file. -2. Launch Notepad.exe running as administrator, and open text file *C:\\Program Files\\Microsoft Integration Runtime\\5.0\\Shared\\diahost.exe.config*. You find the default tag for system.net as shown in the following code: -- ``` - <system.net> - <defaultProxy useDefaultCredentials="true" /> - </system.net> - ``` -- You can then add proxy server details as shown in the following example: -- ``` - <system.net> - <defaultProxy enabled="true"> - <proxy bypassonlocal="true" proxyaddress="http://proxy.domain.org:8888/" /> - </defaultProxy> - </system.net> - ``` -- Additional properties are allowed inside the proxy tag to specify the required settings like scriptLocation. Refer to [proxy Element (Network Settings)](/dotnet/framework/configure-apps/file-schema/network/proxy-element-network-settings) on syntax. -- ``` - <proxy autoDetect="true|false|unspecified" bypassonlocal="true|false|unspecified" proxyaddress="uriString" scriptLocation="uriString" usesystemdefault="true|false|unspecified "/> - ``` -3. Save the configuration file into the original location, then restart the Data Management Gateway Host service, which picks up the changes. To restart the service: use services applet from the control panel, or from the **Data Management Gateway Configuration Manager** > click the **Stop Service** button, then click the **Start Service**. If the service does not start, it is likely that an incorrect XML tag syntax has been added into the application configuration file that was edited. --> [!IMPORTANT] -> Do not forget to update **both** diahost.exe.config and diawp.exe.config. --In addition to these points, you also need to make sure Microsoft Azure is in your company's allowed list. The list of valid Microsoft Azure IP addresses can be downloaded from the [Microsoft Download Center](https://www.microsoft.com/download/details.aspx?id=41653). --#### Possible symptoms for firewall and proxy server-related issues -If you encounter errors similar to the following ones, it is likely due to improper configuration of the firewall or proxy server, which blocks gateway from connecting to Data Factory to authenticate itself. Refer to previous section to ensure your firewall and proxy server are properly configured. --1. When you try to register the gateway, you receive the following error: "Failed to register the gateway key. Before trying to register the gateway key again, confirm that the data management gateway is in a connected state and the Data Management Gateway Host Service is Started." -2. When you open Configuration Manager, you see status as "Disconnected" or "Connecting." When viewing Windows event logs, under "Event Viewer" > "Application and Services Logs" > "Data Management Gateway", you see error messages such as the following error: - `Unable to connect to the remote server` - `A component of Data Management Gateway has become unresponsive and restarts automatically. Component name: Gateway.` --### Open port 8050 for credential encryption -The **Setting Credentials** application uses the inbound port **8050** to relay credentials to the gateway when you set up an on-premises linked service in the Azure portal. During gateway setup, by default, the gateway installation opens it on the gateway machine. --If you are using a third-party firewall, you can manually open the port 8050. If you run into firewall issue during gateway setup, you can try using the following command to install the gateway without configuring the firewall. --```cmd -msiexec /q /i DataManagementGateway.msi NOFIREWALL=1 -``` --If you choose not to open the port 8050 on the gateway machine, use mechanisms other than using the **Setting Credentials** application to configure data store credentials. For example, you could use [New-AzDataFactoryEncryptValue](/powershell/module/az.datafactory/new-azdatafactoryencryptvalue) PowerShell cmdlet. See Setting Credentials and Security section on how data store credentials can be set. --## Update -By default, data management gateway is automatically updated when a newer version of the gateway is available. The gateway is not updated until all the scheduled tasks are done. No further tasks are processed by the gateway until the update operation is completed. If the update fails, gateway is rolled back to the old version. --You see the scheduled update time in the following places: --* The gateway properties page in the Azure portal. -* Home page of the Data Management Gateway Configuration Manager -* System tray notification message. --The Home tab of the Data Management Gateway Configuration Manager displays the update schedule and the last time the gateway was installed/updated. ---You can install the update right away or wait for the gateway to be automatically updated at the scheduled time. For example, the following image shows you the notification message shown in the Gateway Configuration Manager along with the Update button that you can click to install it immediately. ---The notification message in the system tray would look as shown in the following image: ---You see the status of update operation (manual or automatic) in the system tray. When you launch Gateway Configuration Manager next time, you see a message on the notification bar that the gateway has been updated along with a link to [what's new topic](data-factory-gateway-release-notes.md). --### To disable/enable auto-update feature -You can disable/enable the auto-update feature by doing the following steps: --[For single node gateway] --1. Launch Windows PowerShell on the gateway machine. -2. Switch to the *C:\\\\Program Files\\Microsoft Integration Runtime\\5.0\\PowerShellScript\\* folder. -3. Run the following command to turn the auto-update feature OFF (disable). -- ```powershell - .\IntegrationRuntimeAutoUpdateToggle.ps1 -off - ``` --4. To turn it back on: -- ```powershell - .\IntegrationRuntimeAutoUpdateToggle.ps1 -on - ``` -[For multi-node highly available and scalable gateway](data-factory-data-management-gateway-high-availability-scalability.md) --1. Launch Windows PowerShell on the gateway machine. --2. Switch to the *C:\\\\Program Files\\Microsoft Integration Runtime\\5.0\\PowerShellScript\\* folder. --3. Run the following command to turn the auto-update feature OFF (disable). -- For gateway with high availability feature, an extra AuthKey param is required. -- ```powershell - .\IntegrationRuntimeAutoUpdateToggle.ps1 -off -AuthKey <your auth key> - ``` --4. To turn it back on: -- ```powershell - .\IntegrationRuntimeAutoUpdateToggle.ps1 -on -AuthKey <your auth key> - ``` --## Configuration Manager --Once you install the gateway, you can launch Data Management Gateway Configuration Manager in one of the following ways: --1. In the **Search** window, type **Data Management Gateway** to access this utility. -2. Run the executable *ConfigManager.exe* in the folder: *C:\\Program Files\\Microsoft Integration Runtime\\5.0\\Shared\\*. --### Home page -The Home page allows you to do the following actions: --* View status of the gateway (connected to the cloud service etc.). -* **Register** using a key from the portal. -* **Stop** and start the **Integration Runtime service** on the gateway machine. -* **Schedule updates** at a specific time of the days. -* View the date when the gateway was **last updated**. --### Settings page -The Settings page allows you to do the following actions: --* View, change, and export **certificate** used by the gateway. This certificate is used to encrypt data source credentials. -* Change **HTTPS port** for the endpoint. The gateway opens a port for setting the data source credentials. -* **Status** of the endpoint -* View **SSL certificate** is used for TLS/SSL communication between portal and the gateway to set credentials for data sources. --### Remote access from intranet -You can enable/ disable any remote connectivity that today happens using port 8050 (see section above) while using PowerShell or Credential Manager application for encrypting credentials. --### Diagnostics page -The Diagnostics page allows you to do the following actions: --* Enable verbose **logging**, view logs in event viewer, and send logs to Microsoft if there was a failure. -* **Test connection** to a data source. --### Help page -The Help page displays the following information: --* Brief description of the gateway -* Version number -* Links to online help, privacy statement, and license agreement. --## Monitor gateway in the portal -In the Azure portal, you can view near-real time snapshot of resource utilization (CPU, memory, network(in/out), etc.) on a gateway machine. --1. In Azure portal, navigate to the home page for your data factory, and click **Linked services** tile. -- :::image type="content" source="./media/data-factory-data-management-gateway/monitor-data-factory-home-page.png" alt-text="Data factory home page"::: -2. Select the **gateway** in the **Linked services** page. -- :::image type="content" source="./media/data-factory-data-management-gateway/monitor-linked-services-blade.png" alt-text="Linked services page"::: -3. In the **Gateway** page, you can see the memory and CPU usage of the gateway. -- :::image type="content" source="./media/data-factory-data-management-gateway/gateway-simple-monitoring.png" alt-text="CPU and memory usage of gateway"::: -4. Enable **Advanced settings** to see more details such as network usage. - - :::image type="content" source="./media/data-factory-data-management-gateway/gateway-advanced-monitoring.png" alt-text="Advanced monitoring of gateway"::: --The following table provides descriptions of columns in the **Gateway Nodes** list: --Monitoring Property | Description -: | :- -Name | Name of the logical gateway and nodes associated with the gateway. Node is an on-premises Windows machine that has the gateway installed on it. For information on having more than one node (up to four nodes) in a single logical gateway, see [Data Management Gateway - high availability and scalability](data-factory-data-management-gateway-high-availability-scalability.md). -Status | Status of the logical gateway and the gateway nodes. Example: Online/Offline/Limited/etc. For information about these statuses, See [Gateway status](#gateway-status) section. -Version | Shows the version of the logical gateway and each gateway node. The version of the logical gateway is determined based on version of majority of nodes in the group. If there are nodes with different versions in the logical gateway setup, only the nodes with the same version number as the logical gateway function properly. Others are in the limited mode and need to be manually updated (only in case auto-update fails). -Available memory | Available memory on a gateway node. This value is a near real-time snapshot. -CPU utilization | CPU utilization of a gateway node. This value is a near real-time snapshot. -Networking (In/Out) | Network utilization of a gateway node. This value is a near real-time snapshot. -Concurrent Jobs (Running/ Limit) | Number of jobs or tasks running on each node. This value is a near real-time snapshot. Limit signifies the maximum concurrent jobs for each node. This value is defined based on the machine size. You can increase the limit to scale up concurrent job execution in advanced scenarios, where CPU/memory/network is under-utilized, but activities are timing out. This capability is also available with a single-node gateway (even when the scalability and availability feature is not enabled). -Role | There are two types of roles in a multi-node gateway - Dispatcher and worker. All nodes are workers, which means they can all be used to execute jobs. There is only one dispatcher node, which is used to pull tasks/jobs from cloud services and dispatch them to different worker nodes (including itself). --In this page, you see some settings that make more sense when there are two or more nodes (scale out scenario) in the gateway. See [Data Management Gateway - high availability and scalability](data-factory-data-management-gateway-high-availability-scalability.md) for details about setting up a multi-node gateway. --### Gateway status -The following table provides possible statuses of a **gateway node**: --Status | Comments/Scenarios -:- | : -Online | Node connected to Data Factory service. -Offline | Node is offline. -Upgrading | The node is being auto-updated. -Limited | Due to Connectivity issue. May be due to HTTP port 8050 issue, service bus connectivity issue, or credential sync issue. -Inactive | Node is in a configuration different from the configuration of other majority nodes.<br/><br/> A node can be inactive when it cannot connect to other nodes. --The following table provides possible statuses of a **logical gateway**. The gateway status depends on statuses of the gateway nodes. --Status | Comments -:-- | :- -Needs Registration | No node is yet registered to this logical gateway -Online | Gateway Nodes are online -Offline | No node in online status. -Limited | Not all nodes in this gateway are in healthy state. This status is a warning that some node might be down! <br/><br/>Could be due to credential sync issue on dispatcher/worker node. --## Scale up gateway -You can configure the number of **concurrent data movement jobs** that can run on a node to scale up the capability of moving data between on-premises and cloud data stores. --When the available memory and CPU are not utilized well, but the idle capacity is 0, you should scale up by increasing the number of concurrent jobs that can run on a node. You may also want to scale up when activities are timing out because the gateway is overloaded. In the advanced settings of a gateway node, you can increase the maximum capacity for a node. --## Troubleshooting gateway issues -See [Troubleshooting gateway issues](data-factory-troubleshoot-gateway-issues.md) article for information/tips for troubleshooting issues with using the data management gateway. --## Move gateway from one machine to another -This section provides steps for moving gateway client from one machine to another machine. --1. In the portal, navigate to the **Data Factory home page**, and click the **Linked Services** tile. -- :::image type="content" source="./media/data-factory-data-management-gateway/DataGatewaysLink.png" alt-text="Data Gateways Link"::: -2. Select your gateway in the **DATA GATEWAYS** section of the **Linked Services** page. -- :::image type="content" source="./media/data-factory-data-management-gateway/LinkedServiceBladeWithGateway.png" alt-text="Linked Services page with gateway selected"::: -3. In the **Data gateway** page, click **Download and install data gateway**. -- :::image type="content" source="./media/data-factory-data-management-gateway/DownloadGatewayLink.png" alt-text="Download gateway link"::: -4. In the **Configure** page, click **Download and install data gateway**, and follow instructions to install the data gateway on the machine. -- :::image type="content" source="./media/data-factory-data-management-gateway/ConfigureBlade.png" alt-text="Configure page"::: -5. Keep the **Microsoft Data Management Gateway Configuration Manager** open. -- :::image type="content" source="./media/data-factory-data-management-gateway/ConfigurationManager.png" alt-text="Configuration Manager"::: -6. In the **Configure** page in the portal, click **Recreate key** on the command bar, and click **Yes** for the warning message. Click **copy button** next to key text that copies the key to the clipboard. The gateway on the old machine stops functioning as soon you recreate the key. -- :::image type="content" source="./media/data-factory-data-management-gateway/RecreateKey.png" alt-text="Recreate key 2"::: -7. Paste the **key** into text box in the **Register Gateway** page of the **Data Management Gateway Configuration Manager** on your machine. (optional) Click **Show gateway key** check box to see the key text. -- :::image type="content" source="./media/data-factory-data-management-gateway/CopyKeyAndRegister.png" alt-text="Copy key and Register"::: -8. Click **Register** to register the gateway with the cloud service. -9. On the **Settings** tab, click **Change** to select the same certificate that was used with the old gateway, enter the **password**, and click **Finish**. -- :::image type="content" source="./media/data-factory-data-management-gateway/SpecifyCertificate.png" alt-text="Specify Certificate"::: -- You can export a certificate from the old gateway by doing the following steps: launch Data Management Gateway Configuration Manager on the old machine, switch to the **Certificate** tab, click **Export** button and follow the instructions. -10. After successful registration of the gateway, you should see the **Registration** set to **Registered** and **Status** set to **Started** on the Home page of the Gateway Configuration Manager. --## Encrypting credentials -To encrypt credentials in the Data Factory Editor, do the following steps: --1. Launch web browser on the **gateway machine**, navigate to [Azure portal](https://portal.azure.com). Search for your data factory if needed, open data factory in the **DATA FACTORY** page and then click **Author & Deploy** to launch Data Factory Editor. -2. Click an existing **linked service** in the tree view to see its JSON definition or create a linked service that requires a data management gateway (for example: SQL Server or Oracle). -3. In the JSON editor, for the **gatewayName** property, enter the name of the gateway. -4. Enter server name for the **Data Source** property in the **connectionString**. -5. Enter database name for the **Initial Catalog** property in the **connectionString**. -6. Click **Encrypt** button on the command bar that launches the click-once **Credential Manager** application. You should see the **Setting Credentials** dialog box. -- :::image type="content" source="./media/data-factory-data-management-gateway/setting-credentials-dialog.png" alt-text="Setting credentials dialog"::: -7. In the **Setting Credentials** dialog box, do the following steps: - 1. Select **authentication** that you want the Data Factory service to use to connect to the database. - 2. Enter name of the user who has access to the database for the **USERNAME** setting. - 3. Enter password for the user for the **PASSWORD** setting. - 4. Click **OK** to encrypt credentials and close the dialog box. -8. You should see a **encryptedCredential** property in the **connectionString** now. -- ```json - { - "name": "SqlServerLinkedService", - "properties": { - "type": "OnPremisesSqlServer", - "description": "", - "typeProperties": { - "connectionString": "data source=myserver;initial catalog=mydatabase;Integrated Security=False;EncryptedCredential=eyJDb25uZWN0aW9uU3R", - "gatewayName": "adftutorialgateway" - } - } - } - ``` -- If you access the portal from a machine that is different from the gateway machine, you must make sure that the Credentials Manager application can connect to the gateway machine. If the application cannot reach the gateway machine, it does not allow you to set credentials for the data source and to test connection to the data source. --When you use the **Setting Credentials** application, the portal encrypts the credentials with the certificate specified in the **Certificate** tab of the **Gateway Configuration Manager** on the gateway machine. --If you are looking for an API-based approach for encrypting the credentials, you can use the [New-AzDataFactoryEncryptValue](/powershell/module/az.datafactory/new-azdatafactoryencryptvalue) PowerShell cmdlet to encrypt credentials. The cmdlet uses the certificate that gateway is configured to use to encrypt the credentials. You add encrypted credentials to the **EncryptedCredential** element of the **connectionString** in the JSON. You use the JSON with the [New-AzDataFactoryLinkedService](/powershell/module/az.datafactory/new-azdatafactorylinkedservice) cmdlet or in the Data Factory Editor. --```JSON -"connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated Security=True;EncryptedCredential=<encrypted credential>", -``` --There is one more approach for setting credentials using Data Factory Editor. If you create a SQL Server linked service by using the editor and you enter credentials in plain text, the credentials are encrypted using a certificate that the Data Factory service owns. It does NOT use the certificate that gateway is configured to use. While this approach might be a little faster in some cases, it is less secure. Therefore, we recommend that you follow this approach only for development/testing purposes. --## PowerShell cmdlets -This section describes how to create and register a gateway using Azure PowerShell cmdlets. --1. Launch **Azure PowerShell** in administrator mode. -2. Log in to your Azure account by running the following command and entering your Azure credentials. -- ```powershell - Connect-AzAccount - ``` -3. Use the **New-AzDataFactoryGateway** cmdlet to create a logical gateway as follows: -- ```powershell - $MyDMG = New-AzDataFactoryGateway -Name <gatewayName> -DataFactoryName <dataFactoryName> -ResourceGroupName ADF -Description <desc> - ``` - **Example command and output**: -- ``` - PS C:\> $MyDMG = New-AzDataFactoryGateway -Name MyGateway -DataFactoryName $df -ResourceGroupName ADF -Description "gateway for walkthrough" -- Name : MyGateway - Description : gateway for walkthrough - Version : - Status : NeedRegistration - VersionStatus : None - CreateTime : 9/28/2014 10:58:22 - RegisterTime : - LastConnectTime : - ExpiryTime : - ProvisioningState : Succeeded - Key : ADF#00000000-0000-4fb8-a867-947877aef6cb@fda06d87-f446-43b1-9485-78af26b8bab0@4707262b-dc25-4fe5-881c-c8a7c3c569fe@wu#nfU4aBlq/heRyYFZ2Xt/CD+7i73PEO521Sj2AFOCmiI - ``` --1. In Azure PowerShell, switch to the folder: *C:\\\\Program Files\\Microsoft Integration Runtime\\5.0\\PowerShellScript\\*. Run *RegisterGateway.ps1* associated with the local variable **$Key** as shown in the following command. This script registers the client agent installed on your machine with the logical gateway you create earlier. -- ```powershell - PS C:\> .\RegisterGateway.ps1 $MyDMG.Key - ``` - ``` - Agent registration is successful! - ``` - You can register the gateway on a remote machine by using the IsRegisterOnRemoteMachine parameter. Example: -- ```powershell - .\RegisterGateway.ps1 $MyDMG.Key -IsRegisterOnRemoteMachine true - ``` -2. You can use the **Get-AzDataFactoryGateway** cmdlet to get the list of Gateways in your data factory. When the **Status** shows **online**, it means your gateway is ready to use. -- ```powershell - Get-AzDataFactoryGateway -DataFactoryName <dataFactoryName> -ResourceGroupName ADF - ``` -- You can remove a gateway using the **Remove-AzDataFactoryGateway** cmdlet and update description for a gateway using the **Set-AzDataFactoryGateway** cmdlets. For syntax and other details about these cmdlets, see Data Factory Cmdlet Reference. --### List gateways using PowerShell --```powershell -Get-AzDataFactoryGateway -DataFactoryName jasoncopyusingstoredprocedure -ResourceGroupName ADF_ResourceGroup -``` --### Remove gateway using PowerShell --```powershell -Remove-AzDataFactoryGateway -Name JasonHDMG_byPSRemote -ResourceGroupName ADF_ResourceGroup -DataFactoryName jasoncopyusingstoredprocedure -Force -``` --## Next steps -* See [Move data between on-premises and cloud data stores](data-factory-move-data-between-onprem-and-cloud.md) article. In the walkthrough, you create a pipeline that uses the gateway to move data from a SQL Server database to an Azure blob. |
data-factory | Data Factory Data Movement Activities | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-data-movement-activities.md | - Title: Move data by using Copy Activity -description: 'Learn about data movement in Data Factory pipelines: data migration between cloud stores, and between an on-premises store and a cloud store. Use Copy Activity.' ---- Previously updated : 04/12/2023----# Move data by using Copy Activity -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-data-movement-activities.md) -> * [Version 2 (current version)](../copy-activity-overview.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Copy Activity in V2](../copy-activity-overview.md). --## Overview -In Azure Data Factory, you can use Copy Activity to copy data between on-premises and cloud data stores. After the data is copied, it can be further transformed and analyzed. You can also use Copy Activity to publish transformation and analysis results for business intelligence (BI) and application consumption. ---Copy Activity is powered by a secure, reliable, scalable, and [globally available service](#global). This article provides details on data movement in Data Factory and Copy Activity. --First, let's see how data migration occurs between two cloud data stores, and between an on-premises data store and a cloud data store. --> [!NOTE] -> To learn about activities in general, see [Understanding pipelines and activities](data-factory-create-pipelines.md). -> -> --### Copy data between two cloud data stores -When both source and sink data stores are in the cloud, Copy Activity goes through the following stages to copy data from the source to the sink. The service that powers Copy Activity: --1. Reads data from the source data store. -2. Performs serialization/deserialization, compression/decompression, column mapping, and type conversion. It does these operations based on the configurations of the input dataset, output dataset, and Copy Activity. -3. Writes data to the destination data store. --The service automatically chooses the optimal region to perform the data movement. This region is usually the one closest to the sink data store. ---### Copy data between an on-premises data store and a cloud data store -To securely move data between an on-premises data store and a cloud data store, install Data Management Gateway on your on-premises machine. Data Management Gateway is an agent that enables hybrid data movement and processing. You can install it on the same machine as the data store itself, or on a separate machine that has access to the data store. --In this scenario, Data Management Gateway performs the serialization/deserialization, compression/decompression, column mapping, and type conversion. Data does not flow through the Azure Data Factory service. Instead, Data Management Gateway directly writes the data to the destination store. ---See [Move data between on-premises and cloud data stores](data-factory-move-data-between-onprem-and-cloud.md) for an introduction and walkthrough. See [Data Management Gateway](data-factory-data-management-gateway.md) for detailed information about this agent. --You can also move data from/to supported data stores that are hosted on Azure IaaS virtual machines (VMs) by using Data Management Gateway. In this case, you can install Data Management Gateway on the same VM as the data store itself, or on a separate VM that has access to the data store. --## Supported data stores and formats -Copy Activity in Data Factory copies data from a source data store to a sink data store. Data Factory supports the following data stores. Data from any source can be written to any sink. Click a data store to learn how to copy data to and from that store. --> [!NOTE] -> If you need to move data to/from a data store that Copy Activity doesn't support, use a **custom activity** in Data Factory with your own logic for copying/moving data. For details on creating and using a custom activity, see [Use custom activities in an Azure Data Factory pipeline](data-factory-use-custom-activities.md). ---> [!NOTE] -> Data stores with * can be on-premises or on Azure IaaS, and require you to install [Data Management Gateway](data-factory-data-management-gateway.md) on an on-premises/Azure IaaS machine. --### Supported file formats -You can use Copy Activity to **copy files as-is** between two file-based data stores, you can skip the [format section](data-factory-create-datasets.md) in both the input and output dataset definitions. The data is copied efficiently without any serialization/deserialization. --Copy Activity also reads from and writes to files in specified formats: **Text, JSON, Avro, ORC, and Parquet**, and compression codec **GZip, Deflate, BZip2, and ZipDeflate** are supported. See [Supported file and compression formats](data-factory-supported-file-and-compression-formats.md) with details. --For example, you can do the following copy activities: --* Copy data in a SQL Server database and write to Azure Data Lake Store in ORC format. -* Copy files in text (CSV) format from on-premises File System and write to Azure Blob in Avro format. -* Copy zipped files from on-premises File System and decompress then land to Azure Data Lake Store. -* Copy data in GZip compressed text (CSV) format from Azure Blob and write to Azure SQL Database. --## <a name="global"></a>Globally available data movement -Azure Data Factory is available only in the West US, East US, and North Europe regions. However, the service that powers Copy Activity is available globally in the following regions and geographies. The globally available topology ensures efficient data movement that usually avoids cross-region hops. See [Services by region](https://azure.microsoft.com/regions/#services) for availability of Data Factory and Data Movement in a region. --### Copy data between cloud data stores -When both source and sink data stores are in the cloud, Data Factory uses a service deployment in the region that is closest to the sink in the same geography to move the data. Refer to the following table for mapping: --| Geography of the destination data stores | Region of the destination data store | Region used for data movement | -|: |: |: | -| United States | East US | East US | -| | East US 2 | East US 2 | -| | Central US | Central US | -| | North Central US | North Central US | -| | South Central US | South Central US | -| | West Central US | West Central US | -| | West US | West US | -| | West US 2 | West US 2 | -| Canada | Canada East | Canada Central | -| | Canada Central | Canada Central | -| Brazil | Brazil South | Brazil South | -| Europe | North Europe | North Europe | -| | West Europe | West Europe | -| United Kingdom | UK West | UK South | -| | UK South | UK South | -| Asia Pacific | Southeast Asia | Southeast Asia | -| | East Asia | Southeast Asia | -| Australia | Australia East | Australia East | -| | Australia Southeast | Australia Southeast | -| India | Central India | Central India | -| | West India | Central India | -| | South India | Central India | -| Japan | Japan East | Japan East | -| | Japan West | Japan East | -| Korea | Korea Central | Korea Central | -| | Korea South | Korea Central | --Alternatively, you can explicitly indicate the region of Data Factory service to be used to perform the copy by specifying `executionLocation` property under Copy Activity `typeProperties`. Supported values for this property are listed in above **Region used for data movement** column. Note your data goes through that region over the wire during copy. For example, to copy between Azure stores in Korea, you can specify `"executionLocation": "Japan East"` to route through Japan region (see [sample JSON](#by-using-json-scripts) as reference). --> [!NOTE] -> If the region of the destination data store is not in preceding list or undetectable, by default Copy Activity fails instead of going through an alternative region, unless `executionLocation` is specified. The supported region list will be expanded over time. -> --### Copy data between an on-premises data store and a cloud data store -When data is being copied between on-premises (or Azure virtual machines/IaaS) and cloud stores, [Data Management Gateway](data-factory-data-management-gateway.md) performs data movement on an on-premises machine or virtual machine. The data does not flow through the service in the cloud, unless you use the [staged copy](data-factory-copy-activity-performance.md#staged-copy) capability. In this case, data flows through the staging Azure Blob storage before it is written into the sink data store. --## Create a pipeline with Copy Activity -You can create a pipeline with Copy Activity in a couple of ways: --### By using the Copy Wizard -The Data Factory Copy Wizard helps you to create a pipeline with Copy Activity. This pipeline allows you to copy data from supported sources to destinations *without writing JSON* definitions for linked services, datasets, and pipelines. See [Data Factory Copy Wizard](data-factory-copy-wizard.md) for details about the wizard. --### By using JSON scripts -You can use Data Factory Editor in Visual Studio, or Azure PowerShell to create a JSON definition for a pipeline (by using Copy Activity). Then, you can deploy it to create the pipeline in Data Factory. See [Tutorial: Use Copy Activity in an Azure Data Factory pipeline](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for a tutorial with step-by-step instructions. --JSON properties (such as name, description, input and output tables, and policies) are available for all types of activities. Properties that are available in the `typeProperties` section of the activity vary with each activity type. --For Copy Activity, the `typeProperties` section varies depending on the types of sources and sinks. Click a source/sink in the [Supported sources and sinks](#supported-data-stores-and-formats) section to learn about type properties that Copy Activity supports for that data store. --Here's a sample JSON definition: --```json -{ - "name": "ADFTutorialPipeline", - "properties": { - "description": "Copy data from Azure blob to Azure SQL table", - "activities": [ - { - "name": "CopyFromBlobToSQL", - "type": "Copy", - "inputs": [ - { - "name": "InputBlobTable" - } - ], - "outputs": [ - { - "name": "OutputSQLTable" - } - ], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "SqlSink" - }, - "executionLocation": "Japan East" - }, - "Policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ], - "start": "2016-07-12T00:00:00Z", - "end": "2016-07-13T00:00:00Z" - } -} -``` -The schedule that is defined in the output dataset determines when the activity runs (for example: **daily**, frequency as **day**, and interval as **1**). The activity copies data from an input dataset (**source**) to an output dataset (**sink**). --You can specify more than one input dataset to Copy Activity. They are used to verify the dependencies before the activity is run. However, only the data from the first dataset is copied to the destination dataset. For more information, see [Scheduling and execution](data-factory-scheduling-and-execution.md). --## Performance and tuning -See the [Copy Activity performance and tuning guide](data-factory-copy-activity-performance.md), which describes key factors that affect the performance of data movement (Copy Activity) in Azure Data Factory. It also lists the observed performance during internal testing and discusses various ways to optimize the performance of Copy Activity. --## Fault tolerance -By default, copy activity will stop copying data and return failure when encounter incompatible data between source and sink; while you can explicitly configure to skip and log the incompatible rows and only copy those compatible data to make the copy succeeded. See the [Copy Activity fault tolerance](data-factory-copy-activity-fault-tolerance.md) on more details. --## Security considerations -See the [Security considerations](data-factory-data-movement-security-considerations.md), which describes security infrastructure that data movement services in Azure Data Factory use to secure your data. --## Scheduling and sequential copy -See [Scheduling and execution](data-factory-scheduling-and-execution.md) for detailed information about how scheduling and execution works in Data Factory. It is possible to run multiple copy operations one after another in a sequential/ordered manner. See the [Copy sequentially](data-factory-scheduling-and-execution.md#multiple-activities-in-a-pipeline) section. --## Type conversions -Different data stores have different native type systems. Copy Activity performs automatic type conversions from source types to sink types with the following two-step approach: --1. Convert from native source types to a .NET type. -2. Convert from a .NET type to a native sink type. --The mapping from a native type system to a .NET type for a data store is in the respective data store article. (Click the specific link in the Supported data stores table). You can use these mappings to determine appropriate types while creating your tables, so that Copy Activity performs the right conversions. --## Next steps -* To learn about the Copy Activity more, see [Copy data from Azure Blob storage to Azure SQL Database](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). -* To learn about moving data from an on-premises data store to a cloud data store, see [Move data from on-premises to cloud data stores](data-factory-move-data-between-onprem-and-cloud.md). |
data-factory | Data Factory Data Movement Security Considerations | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-data-movement-security-considerations.md | - Title: Security considerations for data movement in Azure Data Factory -description: 'Learn about securing data movement in Azure Data Factory.' ---- Previously updated : 04/12/2023-----# Azure Data Factory - Security considerations for data movement --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [data movement security considerations for Data Factory](../data-movement-security-considerations.md). --## Introduction -This article describes basic security infrastructure that data movement services in Azure Data Factory use to secure your data. Azure Data Factory management resources are built on Azure security infrastructure and use all possible security measures offered by Azure. --In a Data Factory solution, you create one or more data [pipelines](data-factory-create-pipelines.md). A pipeline is a logical grouping of activities that together perform a task. These pipelines reside in the region where the data factory was created. --Even though Data Factory is available in only **West US**, **East US**, and **North Europe** regions, the data movement service is available [globally in several regions](data-factory-data-movement-activities.md#global). Data Factory service ensures that data does not leave a geographical area/ region unless you explicitly instruct the service to use an alternate region if the data movement service is not yet deployed to that region. --Azure Data Factory itself does not store any data except for linked service credentials for cloud data stores, which are encrypted using certificates. It lets you create data-driven workflows to orchestrate movement of data between [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) and processing of data using [compute services](data-factory-compute-linked-services.md) in other regions or in an on-premises environment. It also allows you to [monitor and manage workflows](data-factory-monitor-manage-pipelines.md) using both programmatic and UI mechanisms. --Data movement using Azure Data Factory has been **certified** for: -- [HIPAA/HITECH](/compliance/regulatory/offering-hipaa-hitech) -- [ISO/IEC 27001](https://www.microsoft.com/en-us/trustcenter/Compliance/ISO-IEC-27001) -- [ISO/IEC 27018](https://www.microsoft.com/en-us/trustcenter/Compliance/ISO-IEC-27018) -- [CSA STAR](https://www.microsoft.com/en-us/trustcenter/Compliance/CSA-STAR-Certification)- -If you are interested in Azure compliance and how Azure secures its own infrastructure, visit the [Microsoft Trust Center](https://microsoft.com/en-us/trustcenter/default.aspx). --In this article, we review security considerations in the following two data movement scenarios: --- **Cloud scenario**- In this scenario, both your source and destination are publicly accessible through internet. These include managed cloud storage services like Azure Storage, Azure Synapse Analytics, Azure SQL Database, Azure Data Lake Store, Amazon S3, Amazon Redshift, SaaS services such as Salesforce, and web protocols such as FTP and OData. You can find a complete list of supported data sources [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats).-- **Hybrid scenario**- In this scenario, either your source or destination is behind a firewall or inside an on-premises corporate network or the data store is in a private network/ virtual network (most often the source) and is not publicly accessible. Database servers hosted on virtual machines also fall under this scenario.---## Cloud scenarios -### Securing data store credentials -Azure Data Factory protects your data store credentials by **encrypting** them by using **certificates managed by Microsoft**. These certificates are rotated every **two years** (which includes renewal of certificate and migration of credentials). These encrypted credentials are securely stored in an **Azure Storage managed by Azure Data Factory management services**. For more information about Azure Storage security, refer [Azure Storage Security Overview](../../storage/blobs/security-recommendations.md). --### Data encryption in transit -If the cloud data store supports HTTPS or TLS, all data transfers between data movement services in Data Factory and a cloud data store are via secure channel HTTPS or TLS. --> [!NOTE] -> All connections to **Azure SQL Database** and **Azure Synapse Analytics** always require encryption (SSL/TLS) while data is in transit to and from the database. While authoring a pipeline using a JSON editor, add the **encryption** property and set it to **true** in the **connection string**. When you use the [Copy Wizard](data-factory-azure-copy-wizard.md), the wizard sets this property by default. For **Azure Storage**, you can use **HTTPS** in the connection string. --### Data encryption at rest -Some data stores support encryption of data at rest. We suggest that you enable data encryption mechanism for those data stores. --#### Azure Synapse Analytics -Transparent Data Encryption (TDE) in Azure Synapse Analytics helps with protecting against the threat of malicious activity by performing real-time encryption and decryption of your data at rest. This behavior is transparent to the client. For more information, see [Secure a database in Azure Synapse Analytics](../../synapse-analytics/sql-data-warehouse/sql-data-warehouse-overview-manage-security.md). --#### Azure SQL Database -Azure SQL Database also supports transparent data encryption (TDE), which helps with protecting against the threat of malicious activity by performing real-time encryption and decryption of the data without requiring changes to the application. This behavior is transparent to the client. For more information, see [Transparent Data Encryption with Azure SQL Database](/sql/relational-databases/security/encryption/transparent-data-encryption-with-azure-sql-database). --#### Azure Data Lake Store -Azure Data Lake store also provides encryption for data stored in the account. When enabled, Data Lake store automatically encrypts data before persisting and decrypts before retrieval, making it transparent to the client accessing the data. For more information, see [Security in Azure Data Lake Store](../../data-lake-store/data-lake-store-security-overview.md). --#### Azure Blob Storage and Azure Table Storage -Azure Blob Storage and Azure Table storage supports Storage Service Encryption (SSE), which automatically encrypts your data before persisting to storage and decrypts before retrieval. For more information, see [Azure Storage Service Encryption for Data at Rest](../../storage/common/storage-service-encryption.md). --#### Amazon S3 -Amazon S3 supports both client and server encryption of data at Rest. For more information, see [Protecting Data Using Encryption](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingEncryption.html). Currently, Data Factory does not support Amazon S3 inside a virtual private cloud (VPC). --#### Amazon Redshift -Amazon Redshift supports cluster encryption for data at rest. For more information, see [Amazon Redshift Database Encryption](https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-db-encryption.html). Currently, Data Factory does not support Amazon Redshift inside a VPC. --#### Salesforce -Salesforce supports Shield Platform Encryption that allows encryption of all files, attachments, custom fields. For more information, see [Understanding the Web Server OAuth Authentication Flow](https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/intro_understanding_web_server_oauth_flow.htm). --## Hybrid Scenarios (using Data Management Gateway) -Hybrid scenarios require Data Management Gateway to be installed in an on-premises network or inside a virtual network (Azure) or a virtual private cloud (Amazon). The gateway must be able to access the local data stores. For more information about the gateway, see [Data Management Gateway](data-factory-data-management-gateway.md). ---The **command channel** allows communication between data movement services in Data Factory and Data Management Gateway. The communication contains information related to the activity. The data channel is used for transferring data between on-premises data stores and cloud data stores. --### On-premises data store credentials -The credentials for your on-premises data stores are stored locally (not in the cloud). They can be set in three different ways. --- Using **plain-text** (less secure) via HTTPS from Azure Portal/ Copy Wizard. The credentials are passed in plain-text to the on-premises gateway.-- Using **JavaScript Cryptography library from Copy Wizard**.-- Using **click-once based credentials manager app**. The click-once application executes on the on-premises machine that has access to the gateway and sets credentials for the data store. This option and the next one are the most secure options. The credential manager app, by default, uses the port 8050 on the machine with gateway for secure communication. -- Use [New-AzDataFactoryEncryptValue](/powershell/module/az.datafactory/New-azDataFactoryEncryptValue) PowerShell cmdlet to encrypt credentials. The cmdlet uses the certificate that gateway is configured to use to encrypt the credentials. You can use the encrypted credentials returned by this cmdlet and add it to **EncryptedCredential** element of the **connectionString** in the JSON file that you use with the [New-AzDataFactoryLinkedService](/powershell/module/az.datafactory/new-azdatafactorylinkedservice) cmdlet or in the JSON snippet in the Data Factory Editor in the portal. This option and the click-once application are the most secure options. --#### JavaScript cryptography library-based encryption -You can encrypt data store credentials using [JavaScript Cryptography library](https://www.microsoft.com/download/details.aspx?id=52439) from the [Copy Wizard](data-factory-copy-wizard.md). When you select this option, the Copy Wizard retrieves the public key of gateway and uses it to encrypt the data store credentials. The credentials are decrypted by the gateway machine and protected by Windows [DPAPI](/previous-versions/ms995355(v=msdn.10)). --**Supported browsers:** IE8, IE9, IE10, IE11, Microsoft Edge, and latest Firefox, Chrome, Opera, Safari browsers. --#### Click-once credentials manager app -You can launch the click-once based credential manager app from Azure portal/Copy Wizard when authoring pipelines. This application ensures that credentials are not transferred in plain text over the wire. By default, it uses the port **8050** on the machine with gateway for secure communication. If necessary, this port can be changed. - --Currently, Data Management Gateway uses a single **certificate**. This certificate is created during the gateway installation (applies to Data Management Gateway created after November 2016 and version 2.4.xxxx.x or later). You can replace this certificate with your own SSL/TLS certificate. This certificate is used by the click-once credential manager application to securely connect to the gateway machine for setting data store credentials. It stores data store credentials securely on-premises by using the Windows [DPAPI](/previous-versions/ms995355(v=msdn.10)) on the machine with gateway. --> [!NOTE] -> Older gateways that were installed before November 2016 or of version 2.3.xxxx.x continue to use credentials encrypted and stored on cloud. Even if you upgrade the gateway to the latest version, the credentials are not migrated to an on-premises machine - -| Gateway version (during creation) | Credentials Stored | Credential encryption/ security | -| | | | -| < = 2.3.xxxx.x | On cloud | Encrypted using certificate (different from the one used by Credential manager app) | -| > = 2.4.xxxx.x | On premises | Secured via DPAPI | - --### Encryption in transit -All data transfers are via secure channel **HTTPS** and **TLS over TCP** to prevent man-in-the-middle attacks during communication with Azure services. - -You can also use [IPSec VPN](../../vpn-gateway/vpn-gateway-about-vpn-devices.md) or [Express Route](../../expressroute/expressroute-introduction.md) to further secure the communication channel between your on-premises network and Azure. --Virtual network is a logical representation of your network in the cloud. You can connect an on-premises network to your Azure virtual network (VNet) by setting up IPSec VPN (site-to-site) or Express Route (Private Peering) --The following table summarizes the network and gateway configuration recommendations based on different combinations of source and destination locations for hybrid data movement. --| Source | Destination | Network configuration | Gateway setup | -| | -- | | - | -| On-premises | Virtual machines and cloud services deployed in virtual networks | IPSec VPN (point-to-site or site-to-site) | Gateway can be installed either on-premises or on an Azure virtual machine (VM) in VNet | -| On-premises | Virtual machines and cloud services deployed in virtual networks | ExpressRoute (Private Peering) | Gateway can be installed either on-premises or on an Azure VM in VNet | -| On-premises | Azure-based services that have a public endpoint | ExpressRoute (Public Peering) | Gateway must be installed on-premises | --The following images show the usage of Data Management Gateway for moving data between an on-premises database and Azure services using Express route and IPSec VPN (with Virtual Network): --**Express Route:** - --**IPSec VPN:** ---### Firewall configurations and filtering IP address of gateway --#### Firewall requirements for on-premises/private network -In an enterprise, a **corporate firewall** runs on the central router of the organization. And, **Windows firewall** runs as a daemon on the local machine on which the gateway is installed. --The following table provides **outbound port** and domain requirements for the **corporate firewall**. --| Domain names | Outbound ports | Description | -| | -- | -- | -| `*.servicebus.windows.net` | 443, 80 | Required by the gateway to connect to data movement services in Data Factory | -| `*.core.windows.net` | 443 | Used by the gateway to connect to Azure Storage Account when you use the [staged copy](data-factory-copy-activity-performance.md#staged-copy) feature. | -| `*.frontend.clouddatahub.net` | 443 | Required by the gateway to connect to the Azure Data Factory service. | -| `*.database.windows.net` | 1433 | (OPTIONAL) needed when your destination is Azure SQL Database/ Azure Synapse Analytics. Use the staged copy feature to copy data to Azure SQL Database/Azure Synapse Analytics without opening the port 1433. | -| `*.azuredatalakestore.net` | 443 | (OPTIONAL) needed when your destination is Azure Data Lake store | --> [!NOTE] -> You may have to manage ports/filtering domains at the corporate firewall level as required by respective data sources. This table only uses Azure SQL Database, Azure Synapse Analytics, Azure Data Lake Store as examples. --The following table provides **inbound port** requirements for the **windows firewall**. --| Inbound ports | Description | -| - | -- | -| 8050 (TCP) | Required by the credential manager application to securely set credentials for on-premises data stores on the gateway. | ---#### IP configurations/filtering in data store -Some data stores in the cloud also require approving of IP address of the machine accessing them. Ensure that the IP address of the gateway machine is approved/configured in firewall appropriately. --The following cloud data stores require approving of IP address of the gateway machine. Some of these data stores, by default, may not require approving of the IP address. --- [Azure SQL Database](/azure/azure-sql/database/firewall-configure) -- [Azure Synapse Analytics](../../synapse-analytics/sql-data-warehouse/create-data-warehouse-portal.md)-- [Azure Data Lake Store](../../data-lake-store/data-lake-store-secure-data.md#set-ip-address-range-for-data-access)-- [Azure Cosmos DB](../../cosmos-db/how-to-configure-firewall.md)-- [Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-authorize-cluster-access.html) --## Frequently asked questions --**Question:** Can the Gateway be shared across different data factories? -**Answer:** We do not support this feature yet. We are actively working on it. --**Question:** What are the port requirements for the gateway to work? -**Answer:** Gateway makes HTTP-based connections to open internet. The **outbound ports 443 and 80** must be opened for gateway to make this connection. Open **inbound port 8050** only at the machine level (not at corporate firewall level) for Credential Manager application. If Azure SQL Database or Azure Synapse Analytics is used as source or destination, then you need to open **port 1433** as well. For more information, see [Firewall configurations and filtering IP addresses](#firewall-configurations-and-filtering-ip-address-of-gateway) section. --**Question:** What are certificate requirements for Gateway? -**Answer:** Current gateway requires a certificate that is used by the credential manager application for securely setting data store credentials. This certificate is a self-signed certificate created and configured by the gateway setup. You can use your own TLS/SSL certificate instead. For more information, see [click-once credential manager application](#click-once-credentials-manager-app) section. --## Next steps -For information about performance of copy activity, see [Copy activity performance and tuning guide](data-factory-copy-activity-performance.md). |
data-factory | Data Factory Data Processing Using Batch | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-data-processing-using-batch.md | - Title: Process large-scale datasets by using Data Factory and Batch -description: Describes how to process huge amounts of data in an Azure Data Factory pipeline by using the parallel processing capability of Azure Batch. ------ Previously updated : 04/12/2023----# Process large-scale datasets by using Data Factory and Batch -> [!NOTE] -> This article applies to version 1 of Azure Data Factory, which is generally available. If you use the current version of the Data Factory service, see [Custom activities in Data Factory](../transform-data-using-dotnet-custom-activity.md). --This article describes an architecture of a sample solution that moves and processes large-scale datasets in an automatic and scheduled manner. It also provides an end-to-end walkthrough to implement the solution by using Data Factory and Azure Batch. ---This article is longer than a typical article because it contains a walkthrough of an entire sample solution. If you're new to Batch and Data Factory, you can learn about these services and how they work together. If you know something about the services and are designing/architecting a solution, you can focus on the architecture section of the article. If you're developing a prototype or a solution, you might want to try out the step-by-step instructions in the walkthrough. We invite your comments about this content and how you use it. --First, let's look at how Data Factory and Batch services can help you process large datasets in the cloud. ---## Why Azure Batch? - You can use Batch to run large-scale parallel and high-performance computing (HPC) applications efficiently in the cloud. It's a platform service that schedules compute-intensive work to run on a managed collection of virtual machines (VMs). It can automatically scale compute resources to meet the needs of your jobs. --With the Batch service, you define Azure compute resources to execute your applications in parallel, and at scale. You can run on-demand or scheduled jobs. You don't need to manually create, configure, and manage an HPC cluster, individual VMs, virtual networks, or a complex job and task-scheduling infrastructure. -- If you aren't familiar with Batch, the following articles help you understand the architecture/implementation of the solution described in this article: --* [Basics of Batch](/azure/azure-sql/database/sql-database-paas-overview) -* [Batch feature overview](../../batch/batch-service-workflow-features.md) --Optionally, to learn more about Batch, see [the Batch documentation](../../batch/index.yml). --## Why Azure Data Factory? -Data Factory is a cloud-based data integration service that orchestrates and automates the movement and transformation of data. You can use Data Factory to create managed data pipelines that move data from on-premises and cloud data stores to a centralized data store. An example is Azure Blob storage. You can use Data Factory to process/transform data by using services such as Azure HDInsight and Azure Machine Learning. You also can schedule data pipelines to run in a scheduled manner (for example, hourly, daily, and weekly). You can monitor and manage the pipelines at a glance to identify issues and take action. -- If you aren't familiar with Data Factory, the following articles help you understand the architecture/implementation of the solution described in this article: --* [Introduction to Data Factory](data-factory-introduction.md) -* [Build your first data pipeline](data-factory-build-your-first-pipeline.md) --Optionally, to learn more about Data Factory, see [the Data Factory documentation](/rest/api/datafactory/v1/data-factory-data-factory). --## Data Factory and Batch together -Data Factory includes built-in activities. For example, the Copy activity is used to copy/move data from a source data store to a destination data store. The Hive activity is used to process data by using Hadoop clusters (HDInsight) on Azure. For a list of supported transformation activities, see [Data transformation activities](data-factory-data-transformation-activities.md). --You also can create custom .NET activities to move or process data with your own logic. You can run these activities on an HDInsight cluster or on a Batch pool of VMs. When you use Batch, you can configure the pool to autoscale (add or remove VMs based on the workload) based on a formula you provide. --## Architecture of a sample solution - The architecture described in this article is for a simple solution. It's also relevant to complex scenarios, such as risk modeling by financial services, image processing and rendering, and genomic analysis. --The diagram illustrates how Data Factory orchestrates data movement and processing. It also shows how Batch processes the data in a parallel manner. Download and print the diagram for easy reference (11 x 17 inches or A3 size). To access the diagram so that you can print it, see [HPC and data orchestration by using Batch and Data Factory](https://go.microsoft.com/fwlink/?LinkId=717686). --[:::image type="content" alt-text="Large-scale data processing diagram." source="./media/data-factory-data-processing-using-batch/large-scale-data-processing-diagram.png"::: --The following list provides the basic steps of the process. The solution includes code and explanations to build the end-to-end solution. --* **Configure Batch with a pool of compute nodes (VMs).** You can specify the number of nodes and the size of each node. --* **Create a Data Factory instance** that is configured with entities that represent blob storage, the Batch compute service, input/output data, and a workflow/pipeline with activities that move and transform data. --* **Create a custom .NET activity in the Data Factory pipeline.** The activity is your user code that runs on the Batch pool. --* **Store large amounts of input data as blobs in Azure Storage.** Data is divided into logical slices (usually by time). --* **Data Factory copies data that is processed in parallel** to the secondary location. --* **Data Factory runs the custom activity by using the pool allocated by Batch.** Data Factory can run activities concurrently. Each activity processes a slice of data. The results are stored in storage. --* **Data Factory moves the final results to a third location,** either for distribution via an app or for further processing by other tools. --## Implementation of the sample solution -The sample solution is intentionally simple. It's designed to show you how to use Data Factory and Batch together to process datasets. The solution counts the number of occurrences of the search term "Microsoft" in input files that are organized in a time series. It then outputs the count to output files. --**Time:** If you're familiar with the basics of Azure, Data Factory, and Batch and have completed the following prerequisites, this solution takes one to two hours to complete. --### Prerequisites -#### Azure subscription -If you don't have an Azure subscription, you can create a free trial account quickly. For more information, see [Free trial](https://azure.microsoft.com/pricing/free-trial/). --#### Azure storage account -You use a storage account to store the data in this tutorial. If you don't have a storage account, see [Create a storage account](../../storage/common/storage-account-create.md). The sample solution uses blob storage. --#### Azure Batch account -Create a Batch account by using the [Azure portal](https://portal.azure.com/). For more information, see [Create and manage a Batch account](../../batch/batch-account-create-portal.md). Note the Batch account name and account key. You also can use the [New-AzBatchAccount](/powershell/module/az.batch/new-azbatchaccount) cmdlet to create a Batch account. For instructions on how to use this cmdlet, see [Get started with Batch PowerShell cmdlets](../../batch/batch-powershell-cmdlets-get-started.md). --The sample solution uses Batch (indirectly via a data factory pipeline) to process data in a parallel manner on a pool of compute nodes (a managed collection of VMs). --#### Azure Batch pool of virtual machines -Create a Batch pool with at least two compute nodes. --1. In the [Azure portal](https://portal.azure.com), select **Browse** in the left menu, and select **Batch Accounts**. --1. Select your Batch account to open the **Batch Account** blade. --1. Select the **Pools** tile. --1. On the **Pools** blade, select the **Add** button on the toolbar to add a pool. -- a. Enter an ID for the pool (**Pool ID**). Note the ID of the pool. You need it when you create the data factory solution. -- b. Specify **Windows Server 2012 R2** for the **Operating System Family** setting. -- c. Select a **node pricing tier**. -- d. Enter **2** as the value for the **Target Dedicated** setting. -- e. Enter **2** as the value for the **Max tasks per node** setting. -- f. Select **OK** to create the pool. --#### Azure Storage Explorer -You use [Azure Storage Explorer 6](https://azure.microsoft.com/features/storage-explorer/) or [CloudXplorer](https://clumsyleaf.com/products/cloudxplorer) (from ClumsyLeaf Software) to inspect and alter the data in your Storage projects. You also can inspect and alter the data in the logs of your cloud-hosted applications. --1. Create a container named **mycontainer** with private access (no anonymous access). --1. If you use CloudXplorer, create folders and subfolders with the following structure: -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image3.png" alt-text="Folder and subfolder structure"::: -- `Inputfolder` and `outputfolder` are top-level folders in `mycontainer`. The `inputfolder` folder has subfolders with date-time stamps (YYYY-MM-DD-HH). -- If you use Storage Explorer, in the next step, you upload files with the following names: `inputfolder/2015-11-16-00/file.txt`, `inputfolder/2015-11-16-01/file.txt`, and so on. This step automatically creates the folders. --1. Create a text file **file.txt** on your machine with content that has the keyword **Microsoft**. An example is "test custom activity Microsoft test custom activity Microsoft." --1. Upload the file to the following input folders in blob storage: -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image4.png" alt-text="Input folders"::: -- If you use Storage Explorer, upload the **file.txt** file to **mycontainer**. Select **Copy** on the toolbar to create a copy of the blob. In the **Copy Blob** dialog box, change the **destination blob name** to `inputfolder/2015-11-16-00/file.txt`. Repeat this step to create `inputfolder/2015-11-16-01/file.txt`, `inputfolder/2015-11-16-02/file.txt`, `inputfolder/2015-11-16-03/file.txt`, `inputfolder/2015-11-16-04/file.txt`, and so on. This action automatically creates the folders. --1. Create another container named `customactivitycontainer`. Upload the custom activity zip file to this container. --#### Visual Studio -Install Visual Studio 2012 or later to create the custom Batch activity to be used in the data factory solution. --### High-level steps to create the solution -1. Create a custom activity that contains the data processing logic. --1. Create a data factory that uses the custom activity. --### Create the custom activity -The data factory custom activity is the heart of this sample solution. The sample solution uses Batch to run the custom activity. For information about how to develop custom activities and use them in data factory pipelines, see [Use custom activities in a data factory pipeline](data-factory-use-custom-activities.md). --To create a .NET custom activity that you can use in a data factory pipeline, you create a .NET class library project with a class that implements the IDotNetActivity interface. This interface has only one method: Execute. Here is the signature of the method: --```csharp -public IDictionary<string, string> Execute( - IEnumerable<LinkedService> linkedServices, - IEnumerable<Dataset> datasets, - Activity activity, - IActivityLogger logger) -``` --The method has a few key components that you need to understand: --* The method takes four parameters: -- * **linkedServices**. This parameter is an enumerable list of linked services that link input/output data sources (for example, blob storage) to the data factory. In this sample, there is only one linked service of the type Azure Storage used for both input and output. - * **datasets**. This parameter is an enumerable list of datasets. You can use this parameter to get the locations and schemas defined by input and output datasets. - * **activity**. This parameter represents the current compute entity. In this case, it's a Batch service. - * **logger**. You can use the logger to write debug comments that surface as the "User" log for the pipeline. -* The method returns a dictionary that can be used to chain custom activities together in the future. This feature isn't implemented yet, so just return an empty dictionary from the method. --#### Procedure: Create the custom activity -1. Create a .NET class library project in Visual Studio. -- a. Start Visual Studio 2012/2013/2015. -- b. Select **File** > **New** > **Project**. -- c. Expand **Templates**, and select **Visual C\#**. In this walkthrough, you use C\#, but you can use any .NET language to develop the custom activity. -- d. Select **Class Library** from the list of project types on the right. -- e. Enter **MyDotNetActivity** for the **Name**. -- f. Select **C:\\ADF** for the **Location**. Create the folder **ADF** if it doesn't exist. -- g. Select **OK** to create the project. --1. Select **Tools** > **NuGet Package Manager** > **Package Manager Console**. --1. In the Package Manager Console, execute the following command to import Microsoft.Azure.Management.DataFactories: -- ```powershell - Install-Package Microsoft.Azure.Management.DataFactories - ``` -1. Import the **Azure Storage** NuGet package into the project. You need this package because you use the Blob Storage API in this sample: -- ```powershell - Install-Package Az.Storage - ``` -1. Add the following using directives to the source file in the project: -- ```csharp - using System.IO; - using System.Globalization; - using System.Diagnostics; - using System.Linq; -- using Microsoft.Azure.Management.DataFactories.Models; - using Microsoft.Azure.Management.DataFactories.Runtime; -- using Microsoft.WindowsAzure.Storage; - using Microsoft.WindowsAzure.Storage.Blob; - ``` -1. Change the name of the namespace to **MyDotNetActivityNS**. -- ```csharp - namespace MyDotNetActivityNS - ``` -1. Change the name of the class to **MyDotNetActivity**, and derive it from the **IDotNetActivity** interface as shown: -- ```csharp - public class MyDotNetActivity : IDotNetActivity - ``` -1. Implement (add) the **Execute** method of the **IDotNetActivity** interface to the **MyDotNetActivity** class. Copy the following sample code to the method. For an explanation of the logic used in this method, see the [Execute method](#execute-method) section. -- ```csharp - /// <summary> - /// The Execute method is the only method of IDotNetActivity interface you must implement. - /// In this sample, the method invokes the Calculate method to perform the core logic. - /// </summary> - public IDictionary<string, string> Execute( - IEnumerable<LinkedService> linkedServices, - IEnumerable<Dataset> datasets, - Activity activity, - IActivityLogger logger) - { -- // Declare types for the input and output data stores. - AzureStorageLinkedService inputLinkedService; -- Dataset inputDataset = datasets.Single(dataset => dataset.Name == activity.Inputs.Single().Name); -- foreach (LinkedService ls in linkedServices) - logger.Write("linkedService.Name {0}", ls.Name); -- // Use the First method instead of Single because we are using the same - // Azure Storage linked service for input and output. - inputLinkedService = linkedServices.First( - linkedService => - linkedService.Name == - inputDataset.Properties.LinkedServiceName).Properties.TypeProperties - as AzureStorageLinkedService; -- string connectionString = inputLinkedService.ConnectionString; // To create an input storage client. - string folderPath = GetFolderPath(inputDataset); - string output = string.Empty; // for use later. -- // Create the storage client for input. Pass the connection string. - CloudStorageAccount inputStorageAccount = CloudStorageAccount.Parse(connectionString); - CloudBlobClient inputClient = inputStorageAccount.CreateCloudBlobClient(); -- // Initialize the continuation token before using it in the do-while loop. - BlobContinuationToken continuationToken = null; - do - { // get the list of input blobs from the input storage client object. - BlobResultSegment blobList = inputClient.ListBlobsSegmented(folderPath, - true, - BlobListingDetails.Metadata, - null, - continuationToken, - null, - null); -- // The Calculate method returns the number of occurrences of - // the search term "Microsoft" in each blob associated - // with the data slice. - // - // The definition of the method is shown in the next step. - output = Calculate(blobList, logger, folderPath, ref continuationToken, "Microsoft"); -- } while (continuationToken != null); -- // Get the output dataset by using the name of the dataset matched to a name in the Activity output collection. - Dataset outputDataset = datasets.Single(dataset => dataset.Name == activity.Outputs.Single().Name); -- folderPath = GetFolderPath(outputDataset); -- logger.Write("Writing blob to the folder: {0}", folderPath); -- // Create a storage object for the output blob. - CloudStorageAccount outputStorageAccount = CloudStorageAccount.Parse(connectionString); - // Write the name of the file. - Uri outputBlobUri = new Uri(outputStorageAccount.BlobEndpoint, folderPath + "/" + GetFileName(outputDataset)); -- logger.Write("output blob URI: {0}", outputBlobUri.ToString()); - // Create a blob and upload the output text. - CloudBlockBlob outputBlob = new CloudBlockBlob(outputBlobUri, outputStorageAccount.Credentials); - logger.Write("Writing {0} to the output blob", output); - outputBlob.UploadText(output); -- // The dictionary can be used to chain custom activities together in the future. - // This feature is not implemented yet, so just return an empty dictionary. - return new Dictionary<string, string>(); - } - ``` -1. Add the following helper methods to the class. These methods are invoked by the **Execute** method. Most important, the **Calculate** method isolates the code that iterates through each blob. -- ```csharp - /// <summary> - /// Gets the folderPath value from the input/output dataset. - /// </summary> - private static string GetFolderPath(Dataset dataArtifact) - { - if (dataArtifact == null || dataArtifact.Properties == null) - { - return null; - } -- AzureBlobDataset blobDataset = dataArtifact.Properties.TypeProperties as AzureBlobDataset; - if (blobDataset == null) - { - return null; - } -- return blobDataset.FolderPath; - } -- /// <summary> - /// Gets the fileName value from the input/output dataset. - /// </summary> -- private static string GetFileName(Dataset dataArtifact) - { - if (dataArtifact == null || dataArtifact.Properties == null) - { - return null; - } -- AzureBlobDataset blobDataset = dataArtifact.Properties.TypeProperties as AzureBlobDataset; - if (blobDataset == null) - { - return null; - } -- return blobDataset.FileName; - } -- /// <summary> - /// Iterates through each blob (file) in the folder, counts the number of instances of the search term in the file, - /// and prepares the output text that is written to the output blob. - /// </summary> -- public static string Calculate(BlobResultSegment Bresult, IActivityLogger logger, string folderPath, ref BlobContinuationToken token, string searchTerm) - { - string output = string.Empty; - logger.Write("number of blobs found: {0}", Bresult.Results.Count<IListBlobItem>()); - foreach (IListBlobItem listBlobItem in Bresult.Results) - { - CloudBlockBlob inputBlob = listBlobItem as CloudBlockBlob; - if ((inputBlob != null) && (inputBlob.Name.IndexOf("$$$.$$$") == -1)) - { - string blobText = inputBlob.DownloadText(Encoding.ASCII, null, null, null); - logger.Write("input blob text: {0}", blobText); - string[] source = blobText.Split(new char[] { '.', '?', '!', ' ', ';', ':', ',' }, StringSplitOptions.RemoveEmptyEntries); - var matchQuery = from word in source - where word.ToLowerInvariant() == searchTerm.ToLowerInvariant() - select word; - int wordCount = matchQuery.Count(); - output += string.Format("{0} occurrences(s) of the search term \"{1}\" were found in the file {2}.\r\n", wordCount, searchTerm, inputBlob.Name); - } - } - return output; - } - ``` - The GetFolderPath method returns the path to the folder that the dataset points to and the GetFileName method returns the name of the blob/file that the dataset points to. -- ```csharp -- "name": "InputDataset", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "fileName": "file.txt", - "folderPath": "mycontainer/inputfolder/{Year}-{Month}-{Day}-{Hour}", - ``` -- The Calculate method calculates the number of instances of the keyword "Microsoft" in the input files (blobs in the folder). The search term "Microsoft" is hard-coded in the code. --1. Compile the project. Select **Build** from the menu, and then select **Build Solution**. --1. Start Windows Explorer, and go to the **bin\\debug** or **bin\\release** folder. The folder choice depends on the type of build. --1. Create a zip file **MyDotNetActivity.zip** that contains all the binaries in the **\\bin\\Debug** folder. You might want to include the MyDotNetActivity.**pdb** file so that you get additional details such as the line number in the source code that caused the issue when a failure occurs. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image5.png" alt-text="The bin\Debug folder list"::: --1. Upload **MyDotNetActivity.zip** as a blob to the blob container `customactivitycontainer` in the blob storage that the StorageLinkedService linked service in ADFTutorialDataFactory uses. Create the blob container `customactivitycontainer` if it doesn't already exist. --#### Execute method -This section provides more details about the code in the Execute method. --1. The members for iterating through the input collection are found in the [Microsoft.WindowsAzure.Storage.Blob](/jav). A basic loop is shown here: -- ```csharp - // Initialize the continuation token. - BlobContinuationToken continuationToken = null; - do - { - // Get the list of input blobs from the input storage client object. - BlobResultSegment blobList = inputClient.ListBlobsSegmented(folderPath, - true, - BlobListingDetails.Metadata, - null, - continuationToken, - null, - null); - // Return a string derived from parsing each blob. -- output = Calculate(blobList, logger, folderPath, ref continuationToken, "Microsoft"); -- } while (continuationToken != null); - ``` -- For more information, see the documentation for the [ListBlobsSegmented](/java/api/com.microsoft.azure.storage.blob.cloudblobcontainer.listblobssegmented) method. --1. The code for working through the set of blobs logically goes within the do-while loop. In the **Execute** method, the do-while loop passes the list of blobs to a method named **Calculate**. The method returns a string variable named **output** that is the result of having iterated through all the blobs in the segment. -- It returns the number of occurrences of the search term "Microsoft" in the blob passed to the **Calculate** method. -- ```csharp - output += string.Format("{0} occurrences of the search term \"{1}\" were found in the file {2}.\r\n", wordCount, searchTerm, inputBlob.Name); - ``` -1. After the **Calculate** method is finished, it must be written to a new blob. For every set of blobs processed, a new blob can be written with the results. To write to a new blob, first find the output dataset. -- ```csharp - // Get the output dataset by using the name of the dataset matched to a name in the Activity output collection. - Dataset outputDataset = datasets.Single(dataset => dataset.Name == activity.Outputs.Single().Name); - ``` -1. The code also calls the helper method **GetFolderPath** to retrieve the folder path (the storage container name). -- ```csharp - folderPath = GetFolderPath(outputDataset); - ``` -- The GetFolderPath method casts the DataSet object to an AzureBlobDataSet, which has a property named FolderPath. -- ```csharp - AzureBlobDataset blobDataset = dataArtifact.Properties.TypeProperties as AzureBlobDataset; -- return blobDataset.FolderPath; - ``` -1. The code calls the **GetFileName** method to retrieve the file name (blob name). The code is similar to the previous code that was used to get the folder path. -- ```csharp - AzureBlobDataset blobDataset = dataArtifact.Properties.TypeProperties as AzureBlobDataset; -- return blobDataset.FileName; - ``` -1. The name of the file is written by creating a URI object. The URI constructor uses the **BlobEndpoint** property to return the container name. The folder path and file name are added to construct the output blob URI. -- ```csharp - // Write the name of the file. - Uri outputBlobUri = new Uri(outputStorageAccount.BlobEndpoint, folderPath + "/" + GetFileName(outputDataset)); - ``` -1. After the name of the file is written, you can write the output string from the **Calculate** method to a new blob: -- ```csharp - // Create a blob and upload the output text. - CloudBlockBlob outputBlob = new CloudBlockBlob(outputBlobUri, outputStorageAccount.Credentials); - logger.Write("Writing {0} to the output blob", output); - outputBlob.UploadText(output); - ``` --### Create the data factory -In the [Create the custom activity](#create-the-custom-activity) section, you created a custom activity and uploaded the zip file with binaries and the PDB file to a blob container. In this section, you create a data factory with a pipeline that uses the custom activity. --The input dataset for the custom activity represents the blobs (files) in the input folder (`mycontainer\\inputfolder`) in blob storage. The output dataset for the activity represents the output blobs in the output folder (`mycontainer\\outputfolder`) in blob storage. --Drop one or more files into the input folders: --``` -mycontainer -\> inputfolder - 2015-11-16-00 - 2015-11-16-01 - 2015-11-16-02 - 2015-11-16-03 - 2015-11-16-04 -``` --For example, drop one file (file.txt) with the following content into each of the folders: --``` -test custom activity Microsoft test custom activity Microsoft -``` --Each input folder corresponds to a slice in the data factory even if the folder has two or more files. When each slice is processed by the pipeline, the custom activity iterates through all the blobs in the input folder for that slice. --You see five output files with the same content. For example, the output file from processing the file in the 2015-11-16-00 folder has the following content: --``` -2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-00/file.txt. -``` --If you drop multiple files (file.txt, file2.txt, file3.txt) with the same content into the input folder, you see the following content in the output file. Each folder (2015-11-16-00, etc.) corresponds to a slice in this sample even though the folder has multiple input files. --```csharp -2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-00/file.txt. -2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-00/file2.txt. -2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-00/file3.txt. -``` --The output file has three lines now, one for each input file (blob) in the folder associated with the slice (2015-11-16-00). --A task is created for each activity run. In this sample, there is only one activity in the pipeline. When a slice is processed by the pipeline, the custom activity runs on Batch to process the slice. Because there are five slices (each slice can have multiple blobs or file), five tasks are created in Batch. When a task runs on Batch, it's the custom activity that is running. --The following walkthrough provides additional details. --#### Step 1: Create the data factory -1. After you sign in to the [Azure portal](https://portal.azure.com/), take the following steps: -- a. Select **NEW** on the left menu. -- b. Select **Data + Analytics** on the **New** blade. -- c. Select **Data Factory** on the **Data analytics** blade. --1. On the **New data factory** blade, enter **CustomActivityFactory** for the name. The name of the data factory must be globally unique. If you receive the error "Data factory name CustomActivityFactory is not available," change the name of the data factory. For example, use yournameCustomActivityFactory, and create the data factory again. --1. Select **RESOURCE GROUP NAME**, and select an existing resource group or create a resource group. --1. Verify that the subscription and region where you want the data factory to be created are correct. --1. Select **Create** on the **New data factory** blade. --1. The data factory is created in the dashboard of the portal. --1. After the data factory is created successfully, you see the **Data factory** page, which shows you the contents of the data factory. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image6.png" alt-text="Data factory page"::: --#### Step 2: Create linked services -Linked services link data stores or compute services to a data factory. In this step, you link your storage account and Batch account to your data factory. --#### Create an Azure Storage linked service -1. Select the **Author and deploy** tile on the **Data factory** blade for **CustomActivityFactory**. The Data Factory Editor appears. --1. Select **New data store** on the command bar, and choose **Azure storage.** The JSON script you use to create a Storage linked service in the editor appears. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image7.png" alt-text="New data store"::: --1. Replace **account name** with the name of your storage account. Replace **account key** with the access key of the storage account. To learn how to get your storage access key, see [Manage storage account access keys](../../storage/common/storage-account-keys-manage.md). --1. Select **Deploy** on the command bar to deploy the linked service. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image8.png" alt-text="Deploy"::: --#### Create an Azure Batch linked service -In this step, you create a linked service for your Batch account that is used to run the data factory custom activity. --1. Select **New compute** on the command bar, and choose **Azure Batch.** The JSON script you use to create a Batch linked service in the editor appears. --1. In the JSON script: -- a. Replace **account name** with the name of your Batch account. -- b. Replace **access key** with the access key of the Batch account. -- c. Enter the ID of the pool for the **poolName** property. For this property, you can specify either the pool name or the pool ID. -- d. Enter the batch URI for the **batchUri** JSON property. -- > [!IMPORTANT] - > The URL from the **Batch Account** blade is in the following format: `<accountname>.<region>.batch.azure.com`. For the `batchUri` property in the JSON script, you need to remove `<accountname>.` from the URL. An example is `"batchUri": "https://eastus.batch.azure.com"`. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image9.png" alt-text="Batch Account blade"::: -- For the **poolName** property, you also can specify the ID of the pool instead of the name of the pool. -- > [!NOTE] - > The Data Factory service doesn't support an on-demand option for Batch as it does for HDInsight. You can use only your own Batch pool in a data factory. - > - > -- e. Specify **StorageLinkedService** for the **linkedServiceName** property. You created this linked service in the previous step. This storage is used as a staging area for files and logs. --1. Select **Deploy** on the command bar to deploy the linked service. --#### Step 3: Create datasets -In this step, you create datasets to represent input and output data. --#### Create the input dataset -1. In the Data Factory Editor, select the **New dataset** button on the toolbar. Select **Azure Blob storage** from the drop-down list. --1. Replace the JSON script in the right pane with the following JSON snippet: -- ```json - { - "name": "InputDataset", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/inputfolder/{Year}-{Month}-{Day}-{Hour}", - "format": { - "type": "TextFormat" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": {} - } - } - ``` -- You create a pipeline later in this walkthrough with the start time 2015-11-16T00:00:00Z and the end time 2015-11-16T05:00:00Z. It's scheduled to produce data hourly, so there are five input/output slices (between **00**:00:00 -\> **05**:00:00). -- The **frequency** and **interval** for the input dataset are set to **Hour** and **1**, which means that the input slice is available hourly. -- The start time for each slice is represented by the **SliceStart** system variable in the previous JSON snippet. Here are the start times for each slice. -- | **Slice** | **Start time** | - |--|-| - | 1 | 2015-11-16T**00**:00:00 | - | 2 | 2015-11-16T**01**:00:00 | - | 3 | 2015-11-16T**02**:00:00 | - | 4 | 2015-11-16T**03**:00:00 | - | 5 | 2015-11-16T**04**:00:00 | -- The **folderPath** is calculated by using the year, month, day, and hour part of the slice start time (**SliceStart**). Here is how an input folder is mapped to a slice. -- | **Slice** | **Start time** | **Input folder** | - |--|-|-| - | 1 | 2015-11-16T**00**:00:00 | 2015-11-16-**00** | - | 2 | 2015-11-16T**01**:00:00 | 2015-11-16-**01** | - | 3 | 2015-11-16T**02**:00:00 | 2015-11-16-**02** | - | 4 | 2015-11-16T**03**:00:00 | 2015-11-16-**03** | - | 5 | 2015-11-16T**04**:00:00 | 2015-11-16-**04** | --1. Select **Deploy** on the toolbar to create and deploy the **InputDataset** table. --#### Create the output dataset -In this step, you create another dataset of the type AzureBlob to represent the output data. --1. In the Data Factory Editor, select the **New dataset** button on the toolbar. Select **Azure Blob storage** from the drop-down list. --1. Replace the JSON script in the right pane with the following JSON snippet: -- ```json - { - "name": "OutputDataset", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "fileName": "{slice}.txt", - "folderPath": "mycontainer/outputfolder", - "partitionedBy": [ - { - "name": "slice", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy-MM-dd-HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } - } - ``` -- An output blob/file is generated for each input slice. Here is how an output file is named for each slice. All the output files are generated in one output folder, `mycontainer\\outputfolder`. -- | **Slice** | **Start time** | **Output file** | - |--|-|--| - | 1 | 2015-11-16T**00**:00:00 | 2015-11-16-**00.txt** | - | 2 | 2015-11-16T**01**:00:00 | 2015-11-16-**01.txt** | - | 3 | 2015-11-16T**02**:00:00 | 2015-11-16-**02.txt** | - | 4 | 2015-11-16T**03**:00:00 | 2015-11-16-**03.txt** | - | 5 | 2015-11-16T**04**:00:00 | 2015-11-16-**04.txt** | -- Remember that all the files in an input folder (for example, 2015-11-16-00) are part of a slice with the start time 2015-11-16-00. When this slice is processed, the custom activity scans through each file and produces a line in the output file with the number of occurrences of the search term "Microsoft." If there are three files in the folder 2015-11-16-00, there are three lines in the output file 2015-11-16-00.txt. --1. Select **Deploy** on the toolbar to create and deploy the **OutputDataset**. --#### Step 4: Create and run the pipeline with a custom activity -In this step, you create a pipeline with one activity, the custom activity you created previously. --> [!IMPORTANT] -> If you haven't uploaded **file.txt** to input folders in the blob container, do so before you create the pipeline. The **isPaused** property is set to false in the pipeline JSON, so the pipeline runs immediately because the **start** date is in the past. --1. In the Data Factory Editor, select **New pipeline** on the command bar. If you don't see the command, select the ellipsis symbol to display it. --1. Replace the JSON script in the right pane with the following JSON snippet: -- ```json - { - "name": "PipelineCustom", - "properties": { - "description": "Use custom activity", - "activities": [ - { - "type": "DotNetActivity", - "typeProperties": { - "assemblyName": "MyDotNetActivity.dll", - "entryPoint": "MyDotNetActivityNS.MyDotNetActivity", - "packageLinkedService": "AzureStorageLinkedService", - "packageFile": "customactivitycontainer/MyDotNetActivity.zip" - }, - "inputs": [ - { - "name": "InputDataset" - } - ], - "outputs": [ - { - "name": "OutputDataset" - } - ], - "policy": { - "timeout": "00:30:00", - "concurrency": 5, - "retry": 3 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "MyDotNetActivity", - "linkedServiceName": "AzureBatchLinkedService" - } - ], - "start": "2015-11-16T00:00:00Z", - "end": "2015-11-16T05:00:00Z", - "isPaused": false - } - } - ``` -- Note the following points: -- * Only one activity is in the pipeline, and it's of the type **DotNetActivity**. - * **AssemblyName** is set to the name of the DLL **MyDotNetActivity.dll**. - * **EntryPoint** is set to **MyDotNetActivityNS.MyDotNetActivity**. It's basically \<namespace\>.\<classname\> in your code. - * **PackageLinkedService** is set to **StorageLinkedService**, which points to the blob storage that contains the custom activity zip file. If you use different storage accounts for input/output files and the custom activity zip file, you have to create another Storage linked service. This article assumes that you use the same storage account. - * **PackageFile** is set to **customactivitycontainer/MyDotNetActivity.zip**. It's in the format \<containerforthezip\>/\<nameofthezip.zip\>. - * The custom activity takes **InputDataset** as input and **OutputDataset** as output. - * The **linkedServiceName** property of the custom activity points to **AzureBatchLinkedService**, which tells Data Factory that the custom activity needs to run on Batch. - * The **concurrency** setting is important. If you use the default value, which is 1, even if you have two or more compute nodes in the Batch pool, the slices are processed one after another. Therefore, you aren't taking advantage of the parallel processing capability of Batch. If you set **concurrency** to a higher value, say 2, it means that two slices (corresponds to two tasks in Batch) can be processed at the same time. In this case, both the VMs in the Batch pool are utilized. Set the concurrency property appropriately. - * Only one task (slice) is executed on a VM at any point by default. By default, **Maximum tasks per VM** is set to 1 for a Batch pool. As part of the prerequisites, you created a pool with this property set to 2. Therefore, two data factory slices can run on a VM at the same time. - - The **isPaused** property is set to false by default. The pipeline runs immediately in this example because the slices start in the past. You can set this property to **true** to pause the pipeline and set it back to **false** to restart. - - The **start** and **end** times are five hours apart. Slices are produced hourly, so five slices are produced by the pipeline. --1. Select **Deploy** on the command bar to deploy the pipeline. --#### Step 5: Test the pipeline -In this step, you test the pipeline by dropping files into the input folders. Start by testing the pipeline with one file for each input folder. --1. On the **Data factory** blade in the Azure portal, select **Diagram**. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image10.png" alt-text="Diagram"::: --1. In the **Diagram** view, double-click the input dataset **InputDataset**. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image11.png" alt-text="InputDataset"::: --1. The **InputDataset** blade appears with all five slices ready. Notice the **SLICE START TIME** and **SLICE END TIME** for each slice. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image12.png" alt-text="Input slice start and end times"::: --1. In the **Diagram** view, select **OutputDataset**. --1. The five output slices appear in the **Ready** state if they were produced. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image13.png" alt-text="Output slice start and end times"::: --1. Use the portal to view the tasks associated with the slices and see what VM each slice ran on. For more information, see the [Data Factory and Batch integration](#data-factory-and-batch-integration) section. --1. The output files appear under `mycontainer` in `outputfolder` in your blob storage. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image15.png" alt-text="Output files in storage"::: -- Five output files are listed, one for each input slice. Each of the output files has content similar to the following output: -- ```output - 2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-00/file.txt. - ``` -- The following diagram illustrates how the data factory slices map to tasks in Batch. In this example, a slice has only one run. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image16.png" alt-text="Slice mapping diagram"::: --1. Now try with multiple files in a folder. Create the files **file2.txt**, **file3.txt**, **file4.txt**, and **file5.txt** with the same content as in file.txt in the folder **2015-11-06-01**. --1. In the output folder, delete the output file **2015-11-16-01.txt**. --1. On the **OutputDataset** blade, right-click the slice with **SLICE START TIME** set to **11/16/2015 01:00:00 AM**. Select **Run** to rerun/reprocess the slice. The slice now has five files instead of one file. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image17.png" alt-text="Run"::: --1. After the slice runs and its status is **Ready**, verify the content in the output file for this slice (**2015-11-16-01.txt**). The output file appears under `mycontainer` in `outputfolder` in your blob storage. There should be a line for each file of the slice. -- ``` - 2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-01/file.txt. - 2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-01/file2.txt. - 2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-01/file3.txt. - 2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-01/file4.txt. - 2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2015-11-16-01/file5.txt. - ``` --> [!NOTE] -> If you didn't delete the output file 2015-11-16-01.txt before you tried with five input files, you see one line from the previous slice run and five lines from the current slice run. By default, the content is appended to the output file if it already exists. -> -> --#### Data Factory and Batch integration -The Data Factory service creates a job in Batch with the name `adf-poolname:job-xxx`. ---A task in the job is created for each activity run of a slice. If 10 slices are ready to be processed, 10 tasks are created in the job. You can have more than one slice running in parallel if you have multiple compute nodes in the pool. If the maximum number of tasks per compute node is set to greater than one, more than one slice can run on the same compute. --In this example, there are five slices, so there are five tasks in Batch. With **concurrency** set to **5** in the pipeline JSON in the data factory and **Maximum tasks per VM** set to **2** in the Batch pool with **2** VMs, the tasks run fast. (Check the start and end times for tasks.) --Use the portal to view the Batch job and its tasks that are associated with the slices and see what VM each slice ran on. ---### Debug the pipeline -Debugging consists of a few basic techniques. --1. If the input slice isn't set to **Ready**, confirm that the input folder structure is correct and that file.txt exists in the input folders. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image3.png" alt-text="Input folder structure"::: --1. In the **Execute** method of your custom activity, use the **IActivityLogger** object to log information that helps you troubleshoot issues. The logged messages show up in the user\_0.log file. -- On the **OutputDataset** blade, select the slice to see the **Data slice** blade for that slice. Under **Activity runs**, you see one activity run for the slice. If you select **Run** in the command bar, you can start another activity run for the same slice. -- When you select the activity run, you see the **Activity run details** blade with a list of log files. You see logged messages in the user\_0.log file. When an error occurs, you see three activity runs because the retry count is set to 3 in the pipeline/activity JSON. When you select the activity run, you see the log files that you can review to troubleshoot the error. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image18.png" alt-text="OutputDataset and Data slice blades"::: -- In the list of log files, select **user-0.log**. In the right panel, the results of using the **IActivityLogger.Write** method appear. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image19.png" alt-text="Activity run details blade"::: -- Check the system-0.log for any system error messages and exceptions. -- ``` - Trace\_T\_D\_12/6/2015 1:43:35 AM\_T\_D\_\_T\_D\_Verbose\_T\_D\_0\_T\_D\_Loading assembly file MyDotNetActivity... -- Trace\_T\_D\_12/6/2015 1:43:35 AM\_T\_D\_\_T\_D\_Verbose\_T\_D\_0\_T\_D\_Creating an instance of MyDotNetActivityNS.MyDotNetActivity from assembly file MyDotNetActivity... -- Trace\_T\_D\_12/6/2015 1:43:35 AM\_T\_D\_\_T\_D\_Verbose\_T\_D\_0\_T\_D\_Executing Module -- Trace\_T\_D\_12/6/2015 1:43:38 AM\_T\_D\_\_T\_D\_Information\_T\_D\_0\_T\_D\_Activity e3817da0-d843-4c5c-85c6-40ba7424dce2 finished successfully - ``` -1. Include the **PDB** file in the zip file so that the error details have information such as call stack when an error occurs. --1. All the files in the zip file for the custom activity must be at the top level with no subfolders. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image20.png" alt-text="Custom activity zip file list"::: --1. Ensure that **assemblyName** (MyDotNetActivity.dll), **entryPoint** (MyDotNetActivityNS.MyDotNetActivity), **packageFile** (customactivitycontainer/MyDotNetActivity.zip), and **packageLinkedService** (should point to the blob storage that contains the zip file) are set to the correct values. --1. If you fixed an error and want to reprocess the slice, right-click the slice in the **OutputDataset** blade and select **Run**. -- :::image type="content" source="./media/data-factory-data-processing-using-batch/image21.png" alt-text="OutputDataset blade Run option"::: -- > [!NOTE] - > A container is in your blob storage named `adfjobs`. This container isn't automatically deleted, but you can safely delete it after you finish testing the solution. Similarly, the data factory solution creates a Batch job named `adf-\<pool ID/name\>:job-0000000001`. You can delete this job after you test the solution if you like. - > - > -1. The custom activity doesn't use the **app.config** file from your package. Therefore, if your code reads any connection strings from the configuration file, it doesn't work at runtime. The best practice when you use Batch is to hold any secrets in Azure Key Vault. Then use a certificate-based service principal to protect the key vault and distribute the certificate to the Batch pool. The .NET custom activity can access secrets from the key vault at runtime. This generic solution can scale to any type of secret, not just a connection string. -- There is an easier workaround, but it's not a best practice. You can create a SQL Database linked service with connection string settings. Then you can create a dataset that uses the linked service and chain the dataset as a dummy input dataset to the custom .NET activity. You can then access the linked service's connection string in the custom activity code. It should work fine at runtime. --#### Extend the sample -You can extend this sample to learn more about Data Factory and Batch features. For example, to process slices in a different time range, take the following steps: --1. Add the following subfolders in `inputfolder`: 2015-11-16-05, 2015-11-16-06, 201-11-16-07, 2011-11-16-08, and 2015-11-16-09. Place input files in those folders. Change the end time for the pipeline from `2015-11-16T05:00:00Z` to `2015-11-16T10:00:00Z`. In the **Diagram** view, double-click **InputDataset** and confirm that the input slices are ready. Double-click **OutputDataset** to see the state of the output slices. If they're in the **Ready** state, check the output folder for the output files. --1. Increase or decrease the **concurrency** setting to understand how it affects the performance of your solution, especially the processing that occurs on Batch. For more information on the **concurrency** setting, see "Step 4: Create and run the pipeline with a custom activity." --1. Create a pool with higher/lower **Maximum tasks per VM**. To use the new pool you created, update the Batch linked service in the data factory solution. For more information on the **Maximum tasks per VM** setting, see "Step 4: Create and run the pipeline with a custom activity." --1. Create a Batch pool with the **autoscale** feature. Automatically scaling compute nodes in a Batch pool is the dynamic adjustment of processing power used by your application. -- The sample formula here achieves the following behavior. When the pool is initially created, it starts with one VM. The $PendingTasks metric defines the number of tasks in the running and active (queued) states. The formula finds the average number of pending tasks in the last 180 seconds and sets TargetDedicated accordingly. It ensures that TargetDedicated never goes beyond 25 VMs. As new tasks are submitted, the pool automatically grows. As tasks complete, VMs become free one by one and the autoscaling shrinks those VMs. You can adjust startingNumberOfVMs and maxNumberofVMs to your needs. -- Autoscale formula: -- ``` - startingNumberOfVMs = 1; - maxNumberofVMs = 25; - pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second); - pendingTaskSamples = pendingTaskSamplePercent < 70 ? startingNumberOfVMs : avg($PendingTasks.GetSample(180 * TimeInterval_Second)); - $TargetDedicated=min(maxNumberofVMs,pendingTaskSamples); - ``` -- For more information, see [Automatically scale compute nodes in a Batch pool](../../batch/batch-automatic-scaling.md). -- If the pool uses the default [autoScaleEvaluationInterval](/rest/api/batchservice/pool/enableautoscale), the Batch service might take 15 to 30 minutes to prepare the VM before running the custom activity. If the pool uses a different autoScaleEvaluationInterval, the Batch service might take autoScaleEvaluationInterval plus 10 minutes. --1. In the sample solution, the **Execute** method invokes the **Calculate** method that processes an input data slice to produce an output data slice. You can write your own method to process input data and replace the **Calculate** method call in the **Execute** method with a call to your method. --### Next steps: Consume the data -After you process data, you can consume it with online tools such as Power BI. Here are links to help you understand Power BI and how to use it in Azure: --* [Explore a dataset in Power BI](https://powerbi.microsoft.com/documentation/powerbi-service-get-data/) -* [Get started with Power BI Desktop](/power-bi/fundamentals/desktop-getting-started) -* [Refresh data in Power BI](https://powerbi.microsoft.com/documentation/powerbi-refresh-data/) -* [Azure and Power BI: Basic overview](https://powerbi.microsoft.com/documentation/powerbi-azure-and-power-bi/) --## References -* [Azure Data Factory](../index.yml) -- * [Introduction to the Data Factory service](data-factory-introduction.md) - * [Get started with Data Factory](data-factory-build-your-first-pipeline.md) - * [Use custom activities in a Data Factory pipeline](data-factory-use-custom-activities.md) -* [Azure Batch](../../batch/index.yml) -- * [Basics of Batch](/azure/azure-sql/database/sql-database-paas-overview) - * [Overview of Batch features](../../batch/batch-service-workflow-features.md)) - * [Create and manage a Batch account in the Azure portal](../../batch/batch-account-create-portal.md) - * [Get started with the Batch client library for .NET](../../batch/quick-run-dotnet.md) --[batch-explorer]: https://github.com/Azure/azure-batch-samples/tree/master/CSharp/BatchExplorer -[batch-explorer-walkthrough]: /archive/blogs/windowshpc/azure-batch-explorer-sample-walkthrough |
data-factory | Data Factory Data Transformation Activities | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-data-transformation-activities.md | - Title: 'Data Transformation: Process & transform data ' -description: Learn how to transform data or process data in Azure Data Factory using Hadoop, ML Studio (classic), or Azure Data Lake Analytics. ------- Previously updated : 04/12/2023---# Transform data in Azure Data Factory version 1 ---> [!div class="op_single_selector"] -> * [Hive](data-factory-hive-activity.md) -> * [Pig](data-factory-pig-activity.md) -> * [MapReduce](data-factory-map-reduce.md) -> * [Hadoop Streaming](data-factory-hadoop-streaming-activity.md) -> * [ML Studio (classic)](data-factory-azure-ml-batch-execution-activity.md) -> * [Stored Procedure](data-factory-stored-proc-activity.md) -> * [Data Lake Analytics U-SQL](data-factory-usql-activity.md) -> * [.NET custom](data-factory-use-custom-activities.md) --## Overview -> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [data transformation activities in Data Factory](../transform-data.md). --This article explains data transformation activities in Azure Data Factory that you can use to transform and processes your raw data into predictions and insights. A transformation activity executes in a computing environment such as Azure HDInsight cluster or an Azure Batch. It provides links to articles with detailed information on each transformation activity. --Data Factory supports the following data transformation activities that can be added to [pipelines](data-factory-create-pipelines.md) either individually or chained with another activity. --> [!NOTE] -> For a walkthrough with step-by-step instructions, see [Create a pipeline with Hive transformation](data-factory-build-your-first-pipeline.md) article. -> -> --## HDInsight Hive activity -The HDInsight Hive activity in a Data Factory pipeline executes Hive queries on your own or on-demand Windows/Linux-based HDInsight cluster. See [Hive Activity](data-factory-hive-activity.md) article for details about this activity. --## HDInsight Pig activity -The HDInsight Pig activity in a Data Factory pipeline executes Pig queries on your own or on-demand Windows/Linux-based HDInsight cluster. See [Pig Activity](data-factory-pig-activity.md) article for details about this activity. --## HDInsight MapReduce activity -The HDInsight MapReduce activity in a Data Factory pipeline executes MapReduce programs on your own or on-demand Windows/Linux-based HDInsight cluster. See [MapReduce Activity](data-factory-map-reduce.md) article for details about this activity. --## HDInsight Streaming activity -The HDInsight Streaming Activity in a Data Factory pipeline executes Hadoop Streaming programs on your own or on-demand Windows/Linux-based HDInsight cluster. See [HDInsight Streaming activity](data-factory-hadoop-streaming-activity.md) for details about this activity. --## HDInsight Spark Activity -The HDInsight Spark activity in a Data Factory pipeline executes Spark programs on your own HDInsight cluster. For details, see [Invoke Spark programs from Azure Data Factory](data-factory-spark.md). --## ML Studio (classic) activities ---Azure Data Factory enables you to easily create pipelines that use a published ML Studio (classic) web service for predictive analytics. Using the [Batch Execution Activity](data-factory-azure-ml-batch-execution-activity.md#invoking-a-web-service-using-batch-execution-activity) in an Azure Data Factory pipeline, you can invoke a Studio (classic) web service to make predictions on the data in batch. --Over time, the predictive models in the Studio (classic) scoring experiments need to be retrained using new input datasets. After you are done with retraining, you want to update the scoring web service with the retrained machine learning model. You can use the [Update Resource Activity](data-factory-azure-ml-batch-execution-activity.md#updating-models-using-update-resource-activity) to update the web service with the newly trained model. --See [Use ML Studio (classic) activities](data-factory-azure-ml-batch-execution-activity.md) for details about these Studio (classic) activities. --## Stored procedure activity -You can use the SQL Server Stored Procedure activity in a Data Factory pipeline to invoke a stored procedure in one of the following data stores: Azure SQL Database, Azure Synapse Analytics, SQL Server Database in your enterprise or an Azure VM. See [Stored Procedure Activity](data-factory-stored-proc-activity.md) article for details. --## Data Lake Analytics U-SQL activity -Data Lake Analytics U-SQL Activity runs a U-SQL script on an Azure Data Lake Analytics cluster. See [Data Analytics U-SQL Activity](data-factory-usql-activity.md) article for details. --## .NET custom activity -If you need to transform data in a way that is not supported by Data Factory, you can create a custom activity with your own data processing logic and use the activity in the pipeline. You can configure the custom .NET activity to run using either an Azure Batch service or an Azure HDInsight cluster. See [Use custom activities](data-factory-use-custom-activities.md) article for details. --You can create a custom activity to run R scripts on your HDInsight cluster with R installed. See [Run R Script using Azure Data Factory](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/RunRScriptUsingADFSample). --## Compute environments -You create a linked service for the compute environment and then use the linked service when defining a transformation activity. There are two types of compute environments supported by Data Factory. --1. **On-Demand**: In this case, the computing environment is fully managed by Data Factory. It is automatically created by the Data Factory service before a job is submitted to process data and removed when the job is completed. You can configure and control granular settings of the on-demand compute environment for job execution, cluster management, and bootstrapping actions. -2. **Bring Your Own**: In this case, you can register your own computing environment (for example HDInsight cluster) as a linked service in Data Factory. The computing environment is managed by you and the Data Factory service uses it to execute the activities. --See [Compute Linked Services](data-factory-compute-linked-services.md) article to learn about compute services supported by Data Factory. --## Summary -Azure Data Factory supports the following data transformation activities and the compute environments for the activities. The transformation activities can be added to pipelines either individually or chained with another activity. --| Data transformation activity | Compute environment | -|: |: | -| [Hive](data-factory-hive-activity.md) |HDInsight [Hadoop] | -| [Pig](data-factory-pig-activity.md) |HDInsight [Hadoop] | -| [MapReduce](data-factory-map-reduce.md) |HDInsight [Hadoop] | -| [Hadoop Streaming](data-factory-hadoop-streaming-activity.md) |HDInsight [Hadoop] | -| [ML Studio (classic) activities: Batch Execution and Update Resource](data-factory-azure-ml-batch-execution-activity.md) |Azure VM | -| [Stored Procedure](data-factory-stored-proc-activity.md) |Azure SQL, Azure Synapse Analytics, or SQL Server | -| [Data Lake Analytics U-SQL](data-factory-usql-activity.md) |Azure Data Lake Analytics | -| [DotNet](data-factory-use-custom-activities.md) |HDInsight [Hadoop] or Azure Batch | |
data-factory | Data Factory Ftp Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-ftp-connector.md | - Title: Move data from an FTP server by using Azure Data Factory -description: Learn about how to move data from an FTP server using Azure Data Factory. ---- Previously updated : 04/12/2023----# Move data from an FTP server by using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-ftp-connector.md) -> * [Version 2 (current version)](../connector-ftp.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [FTP connector in V2](../connector-ftp.md). --This article explains how to use the copy activity in Azure Data Factory to move data from an FTP server. It builds on the [Data movement activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --You can copy data from an FTP server to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. Data Factory currently supports only moving data from an FTP server to other data stores, but not moving data from other data stores to an FTP server. It supports both on-premises and cloud FTP servers. --> [!NOTE] -> The copy activity does not delete the source file after it is successfully copied to the destination. If you need to delete the source file after a successful copy, create a custom activity to delete the file, and use the activity in the pipeline. --## Enable connectivity -If you are moving data from an **on-premises** FTP server to a cloud data store (for example, to Azure Blob storage), install and use Data Management Gateway. The Data Management Gateway is a client agent that is installed on your on-premises machine, and it allows cloud services to connect to an on-premises resource. For details, see [Data Management Gateway](data-factory-data-management-gateway.md). For step-by-step instructions on setting up the gateway and using it, see [Moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md). You use the gateway to connect to an FTP server, even if the server is on an Azure infrastructure as a service (IaaS) virtual machine (VM). --It is possible to install the gateway on the same on-premises machine or IaaS VM as the FTP server. However, we recommend that you install the gateway on a separate machine or IaaS VM to avoid resource contention, and for better performance. When you install the gateway on a separate machine, the machine should be able to access the FTP server. --## Get started -You can create a pipeline with a copy activity that moves data from an FTP source by using different tools or APIs. --The easiest way to create a pipeline is to use the **Data Factory Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough. --You can also use the following tools to create a pipeline: **Visual Studio**, **PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools or APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an FTP data store, see the [JSON example: Copy data from FTP server to Azure blob](#json-example-copy-data-from-ftp-server-to-azure-blob) section of this article. --> [!NOTE] -> For details about supported file and compression formats to use, see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md). --The following sections provide details about JSON properties that are used to define Data Factory entities specific to FTP. --## Linked service properties -The following table describes JSON elements specific to an FTP linked service. --| Property | Description | Required | Default | -| | | | | -| type |Set this to FtpServer. |Yes | | -| host |Specify the name or IP address of the FTP server. |Yes | | -| authenticationType |Specify the authentication type. |Yes |Basic, Anonymous | -| username |Specify the user who has access to the FTP server. |No | | -| password |Specify the password for the user (username). |No | | -| encryptedCredential |Specify the encrypted credential to access the FTP server. |No | | -| gatewayName |Specify the name of the gateway in Data Management Gateway to connect to an on-premises FTP server. |No | | -| port |Specify the port on which the FTP server is listening. |No |21 | -| enableSsl |Specify whether to use FTP over an SSL/TLS channel. |No |true | -| enableServerCertificateValidation |Specify whether to enable server TLS/SSL certificate validation when you are using FTP over SSL/TLS channel. |No |true | -->[!NOTE] ->The FTP connector supports accessing FTP server with either no encryption or explicit SSL/TLS encryption; it doesn't support implicit SSL/TLS encryption. --### Use Anonymous authentication --```JSON -{ - "name": "FTPLinkedService", - "properties": { - "type": "FtpServer", - "typeProperties": { - "authenticationType": "Anonymous", - "host": "myftpserver.com" - } - } -} -``` --### Use username and password in plain text for basic authentication --```JSON -{ - "name": "FTPLinkedService", - "properties": { - "type": "FtpServer", - "typeProperties": { - "host": "myftpserver.com", - "authenticationType": "Basic", - "username": "Admin", - "password": "123456" - } - } -} -``` --### Use port, enableSsl, enableServerCertificateValidation --```JSON -{ - "name": "FTPLinkedService", - "properties": { - "type": "FtpServer", - "typeProperties": { - "host": "myftpserver.com", - "authenticationType": "Basic", - "username": "Admin", - "password": "123456", - "port": "21", - "enableSsl": true, - "enableServerCertificateValidation": true - } - } -} -``` --### Use encryptedCredential for authentication and gateway --```JSON -{ - "name": "FTPLinkedService", - "properties": { - "type": "FtpServer", - "typeProperties": { - "host": "myftpserver.com", - "authenticationType": "Basic", - "encryptedCredential": "xxxxxxxxxxxxxxxxx", - "gatewayName": "mygateway" - } - } -} -``` --## Dataset properties -For a full list of sections and properties available for defining datasets, see [Creating datasets](data-factory-create-datasets.md). Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types. --The **typeProperties** section is different for each type of dataset. It provides information that is specific to the dataset type. The **typeProperties** section for a dataset of type **FileShare** has the following properties: --| Property | Description | Required | -| | | | -| folderPath |Subpath to the folder. Use escape character ' \ ' for special characters in the string. See Sample linked service and dataset definitions for examples.<br/><br/>You can combine this property with **partitionBy** to have folder paths based on slice start and end date-times. |Yes | -| fileName |Specify the name of the file in the **folderPath** if you want the table to refer to a specific file in the folder. If you do not specify any value for this property, the table points to all files in the folder.<br/><br/>When **fileName** is not specified for an output dataset, the name of the generated file is in the following format: <br/><br/>`Data.<Guid>.txt` (Example: Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt) |No | -| fileFilter |Specify a filter to be used to select a subset of files in the **folderPath**, rather than all files.<br/><br/>Allowed values are: `*` (multiple characters) and `?` (single character).<br/><br/>Example 1: `"fileFilter": "*.log"`<br/>Example 2: `"fileFilter": 2014-1-?.txt"`<br/><br/> **fileFilter** is applicable for an input FileShare dataset. This property is not supported with Hadoop Distributed File System (HDFS). |No | -| partitionedBy |Used to specify a dynamic **folderPath** and **fileName** for time series data. For example, you can specify a **folderPath** that is parameterized for every hour of data. |No | -| format | The following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see the [Text Format](data-factory-supported-file-and-compression-formats.md#text-format), [Json Format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro Format](data-factory-supported-file-and-compression-formats.md#avro-format), [Orc Format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet Format](data-factory-supported-file-and-compression-formats.md#parquet-format) sections. <br><br> If you want to copy files as they are between file-based stores (binary copy), skip the format section in both input and output dataset definitions. |No | -| compression | Specify the type and level of compression for the data. Supported types are **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**, and supported levels are **Optimal** and **Fastest**. For more information, see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). |No | -| useBinaryTransfer |Specify whether to use the binary transfer mode. The values are true for binary mode (this is the default value), and false for ASCII. This property can only be used when the associated linked service type is of type: FtpServer. |No | --> [!NOTE] -> **fileName** and **fileFilter** cannot be used simultaneously. --### Use the partionedBy property -As mentioned in the previous section, you can specify a dynamic **folderPath** and **fileName** for time series data with the **partitionedBy** property. --To learn about time series datasets, scheduling, and slices, see [Creating datasets](data-factory-create-datasets.md), [Scheduling and execution](data-factory-scheduling-and-execution.md), and [Creating pipelines](data-factory-create-pipelines.md). --#### Sample 1 --```json -"folderPath": "wikidatagateway/wikisampledataout/{Slice}", -"partitionedBy": -[ - { "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } }, -], -``` -In this example, {Slice} is replaced with the value of Data Factory system variable SliceStart, in the format specified (YYYYMMDDHH). The SliceStart refers to start time of the slice. The folder path is different for each slice. (For example, wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104.) --#### Sample 2 --```json -"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}", -"fileName": "{Hour}.csv", -"partitionedBy": -[ - { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, - { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, - { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, - { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } } -], -``` -In this example, the year, month, day, and time of SliceStart are extracted into separate variables that are used by the **folderPath** and **fileName** properties. --## Copy activity properties -For a full list of sections and properties available for defining activities, see [Creating pipelines](data-factory-create-pipelines.md). Properties such as name, description, input and output tables, and policies are available for all types of activities. --Properties available in the **typeProperties** section of the activity, on the other hand, vary with each activity type. For the copy activity, the type properties vary depending on the types of sources and sinks. --In copy activity, when the source is of type **FileSystemSource**, the following property is available in **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| recursive |Indicates whether the data is read recursively from the subfolders, or only from the specified folder. |True, False (default) |No | --## JSON example: Copy data from FTP server to Azure Blob -This sample shows how to copy data from an FTP server to Azure Blob storage. However, data can be copied directly to any of the sinks stated in the [supported data stores and formats](data-factory-data-movement-activities.md#supported-data-stores-and-formats), by using the copy activity in Data Factory. --The following examples provide sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md), or [PowerShell](data-factory-copy-activity-tutorial-using-powershell.md): --* A linked service of type [FtpServer](#linked-service-properties) -* A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties) -* An input [dataset](data-factory-create-datasets.md) of type [FileShare](#dataset-properties) -* An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties) -* A [pipeline](data-factory-create-pipelines.md) with copy activity that uses [FileSystemSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties) --The sample copies data from an FTP server to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples. --### FTP linked service --This example uses basic authentication, with the user name and password in plain text. You can also use one of the following ways: --* Anonymous authentication -* Basic authentication with encrypted credentials -* FTP over SSL/TLS (FTPS) --See the [FTP linked service](#linked-service-properties) section for different types of authentication you can use. --```JSON -{ - "name": "FTPLinkedService", - "properties": { - "type": "FtpServer", - "typeProperties": { - "host": "myftpserver.com", - "authenticationType": "Basic", - "username": "Admin", - "password": "123456" - } - } -} -``` -### Azure Storage linked service --```JSON -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -### FTP input dataset --This dataset refers to the FTP folder `mysharedfolder` and file `test.csv`. The pipeline copies the file to the destination. --Setting **external** to **true** informs the Data Factory service that the dataset is external to the data factory, and is not produced by an activity in the data factory. --```JSON -{ - "name": "FTPFileInput", - "properties": { - "type": "FileShare", - "linkedServiceName": "FTPLinkedService", - "typeProperties": { - "folderPath": "mysharedfolder", - "fileName": "test.csv", - "useBinaryTransfer": true - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --### Azure Blob output dataset --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated, based on the start time of the slice that is being processed. The folder path uses the year, month, day, and hours parts of the start time. --```JSON -{ - "name": "AzureBlobOutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/ftp/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` ---### A copy activity in a pipeline with file system source and blob sink --The pipeline contains a copy activity that is configured to use the input and output datasets, and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **FileSystemSource**, and the **sink** type is set to **BlobSink**. --```JSON -{ - "name": "pipeline", - "properties": { - "activities": [{ - "name": "FTPToBlobCopy", - "inputs": [{ - "name": "FtpFileInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "type": "Copy", - "typeProperties": { - "source": { - "type": "FileSystemSource" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "00:05:00" - } - }], - "start": "2016-08-24T18:00:00Z", - "end": "2016-08-24T19:00:00Z" - } -} -``` -> [!NOTE] -> To map columns from source dataset to columns from sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Next steps -See the following articles: --* To learn about key factors that impact performance of data movement (copy activity) in Data Factory, and various ways to optimize it, see the [Copy activity performance and tuning guide](data-factory-copy-activity-performance.md). --* For step-by-step instructions for creating a pipeline with a copy activity, see the [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). |
data-factory | Data Factory Functions Variables | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-functions-variables.md | - Title: Data Factory Functions and System Variables -description: Provides a list of Azure Data Factory functions and system variables ------ Previously updated : 04/12/2023---# Azure Data Factory - Functions and System Variables -> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [System variables in Data Factory](../control-flow-system-variables.md). --This article provides information about functions and variables supported by Azure Data Factory. --## Data Factory system variables --| Variable Name | Description | Object Scope | JSON Scope and Use Cases | -| | | | | -| WindowStart |Start of time interval for current activity run window |activity |<ol><li>Specify data selection queries. See connector articles referenced in the [Data Movement Activities](data-factory-data-movement-activities.md) article.</li> | -| WindowEnd |End of time interval for current activity run window |activity |same as WindowStart. | -| SliceStart |Start of time interval for data slice being produced |activity<br/>dataset |<ol><li>Specify dynamic folder paths and file names while working with [Azure Blob](data-factory-azure-blob-connector.md) and [File System datasets](data-factory-onprem-file-system-connector.md).</li><li>Specify input dependencies with data factory functions in activity inputs collection.</li></ol> | -| SliceEnd |End of time interval for current data slice. |activity<br/>dataset |same as SliceStart. | --> [!NOTE] -> Currently data factory requires that the schedule specified in the activity exactly matches the schedule specified in availability of the output dataset. Therefore, WindowStart, WindowEnd, and SliceStart and SliceEnd always map to the same time period and a single output slice. -> --### Example for using a system variable -In the following example, year, month, day, and time of **SliceStart** are extracted into separate variables that are used by **folderPath** and **fileName** properties. --```json -"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}", -"fileName": "{Hour}.csv", -"partitionedBy": - [ - { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, - { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, - { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, - { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } } -], -``` --## Data Factory functions -You can use functions in data factory along with system variables for the following purposes: --1. Specifying data selection queries (see connector articles referenced by the [Data Movement Activities](data-factory-data-movement-activities.md) article. - - The syntax to invoke a data factory function is: **$$\<function>** for data selection queries and other properties in the activity and datasets. -2. Specifying input dependencies with data factory functions in activity inputs collection. - - $$ is not needed for specifying input dependency expressions. --In the following sample, **sqlReaderQuery** property in a JSON file is assigned to a value returned by the `Text.Format` function. This sample also uses a system variable named **WindowStart**, which represents the start time of the activity run window. --```json -{ - "Type": "SqlSource", - "sqlReaderQuery": "$$Text.Format('SELECT * FROM MyTable WHERE StartTime = \\'{0:yyyyMMdd-HH}\\'', WindowStart)" -} -``` --See [Custom Date and Time Format Strings](/dotnet/standard/base-types/custom-date-and-time-format-strings) topic that describes different formatting options you can use (for example: ay vs. yyyy). --### Functions -The following tables list all the functions in Azure Data Factory: --| Category | Function | Parameters | Description | -| | | | | -| Time |AddHours(X,Y) |X: DateTime <br/><br/>Y: int |Adds Y hours to the given time X. <br/><br/>Example: `9/5/2013 12:00:00 PM + 2 hours = 9/5/2013 2:00:00 PM` | -| Time |AddMinutes(X,Y) |X: DateTime <br/><br/>Y: int |Adds Y minutes to X.<br/><br/>Example: `9/15/2013 12: 00:00 PM + 15 minutes = 9/15/2013 12: 15:00 PM` | -| Time |StartOfHour(X) |X: Datetime |Gets the starting time for the hour represented by the hour component of X. <br/><br/>Example: `StartOfHour of 9/15/2013 05: 10:23 PM is 9/15/2013 05: 00:00 PM` | -| Date |AddDays(X,Y) |X: DateTime<br/><br/>Y: int |Adds Y days to X. <br/><br/>Example: 9/15/2013 12:00:00 PM + 2 days = 9/17/2013 12:00:00 PM.<br/><br/>You can subtract days too by specifying Y as a negative number.<br/><br/>Example: `9/15/2013 12:00:00 PM - 2 days = 9/13/2013 12:00:00 PM`. | -| Date |AddMonths(X,Y) |X: DateTime<br/><br/>Y: int |Adds Y months to X.<br/><br/>`Example: 9/15/2013 12:00:00 PM + 1 month = 10/15/2013 12:00:00 PM`.<br/><br/>You can subtract months too by specifying Y as a negative number.<br/><br/>Example: `9/15/2013 12:00:00 PM - 1 month = 8/15/2013 12:00:00 PM`.| -| Date |AddQuarters(X,Y) |X: DateTime <br/><br/>Y: int |Adds Y * 3 months to X.<br/><br/>Example: `9/15/2013 12:00:00 PM + 1 quarter = 12/15/2013 12:00:00 PM` | -| Date |AddWeeks(X,Y) |X: DateTime<br/><br/>Y: int |Adds Y * 7 days to X<br/><br/>Example: 9/15/2013 12:00:00 PM + 1 week = 9/22/2013 12:00:00 PM<br/><br/>You can subtract weeks too by specifying Y as a negative number.<br/><br/>Example: `9/15/2013 12:00:00 PM - 1 week = 9/7/2013 12:00:00 PM`. | -| Date |AddYears(X,Y) |X: DateTime<br/><br/>Y: int |Adds Y years to X.<br/><br/>`Example: 9/15/2013 12:00:00 PM + 1 year = 9/15/2014 12:00:00 PM`<br/><br/>You can subtract years too by specifying Y as a negative number.<br/><br/>Example: `9/15/2013 12:00:00 PM - 1 year = 9/15/2012 12:00:00 PM`. | -| Date |Day(X) |X: DateTime |Gets the day component of X.<br/><br/>Example: `Day of 9/15/2013 12:00:00 PM is 9`. | -| Date |DayOfWeek(X) |X: DateTime |Gets the day of week component of X.<br/><br/>Example: `DayOfWeek of 9/15/2013 12:00:00 PM is Sunday`. | -| Date |DayOfYear(X) |X: DateTime |Gets the day in the year represented by the year component of X.<br/><br/>Examples:<br/>`12/1/2015: day 335 of 2015`<br/>`12/31/2015: day 365 of 2015`<br/>`12/31/2016: day 366 of 2016 (Leap Year)` | -| Date |DaysInMonth(X) |X: DateTime |Gets the days in the month represented by the month component of parameter X.<br/><br/>Example: `DaysInMonth of 9/15/2013 are 30 since there are 30 days in the September month`. | -| Date |EndOfDay(X) |X: DateTime |Gets the date-time that represents the end of the day (day component) of X.<br/><br/>Example: `EndOfDay of 9/15/2013 05:10:23 PM is 9/15/2013 11:59:59 PM`. | -| Date |EndOfMonth(X) |X: DateTime |Gets the end of the month represented by month component of parameter X. <br/><br/>Example: `EndOfMonth of 9/15/2013 05:10:23 PM is 9/30/2013 11:59:59 PM` (date time that represents the end of September month) | -| Date |StartOfDay(X) |X: DateTime |Gets the start of the day represented by the day component of parameter X.<br/><br/>Example: `StartOfDay of 9/15/2013 05:10:23 PM is 9/15/2013 12:00:00 AM`. | -| DateTime |From(X) |X: String |Parse string X to a date time. | -| DateTime |Ticks(X) |X: DateTime |Gets the ticks property of the parameter X. One tick equals 100 nanoseconds. The value of this property represents the number of ticks that have elapsed since 12:00:00 midnight, January 1, 0001. | -| Text |Format(X) |X: String variable |Formats the text (use `\\'` combination to escape `'` character).| --> [!IMPORTANT] -> When using a function within another function, you do not need to use **$$** prefix for the inner function. For example: $$Text.Format('PartitionKey eq \\'my_pkey_filter_value\\' and RowKey ge \\'{0: yyyy-MM-dd HH:mm:ss}\\'', Time.AddHours(SliceStart, -6)). In this example, notice that **$$** prefix is not used for the **Time.AddHours** function. --#### Example -In the following example, input and output parameters for the Hive activity are determined by using the `Text.Format` function and SliceStart system variable. --```json -{ - "name": "HiveActivitySamplePipeline", - "properties": { - "activities": [ - { - "name": "HiveActivitySample", - "type": "HDInsightHive", - "inputs": [ - { - "name": "HiveSampleIn" - } - ], - "outputs": [ - { - "name": "HiveSampleOut" - } - ], - "linkedServiceName": "HDInsightLinkedService", - "typeproperties": { - "scriptPath": "adfwalkthrough\\scripts\\samplehive.hql", - "scriptLinkedService": "StorageLinkedService", - "defines": { - "Input": "$$Text.Format('wasb://adfwalkthrough@<storageaccountname>.blob.core.windows.net/samplein/yearno={0:yyyy}/monthno={0:MM}/dayno={0:dd}/', SliceStart)", - "Output": "$$Text.Format('wasb://adfwalkthrough@<storageaccountname>.blob.core.windows.net/sampleout/yearno={0:yyyy}/monthno={0:MM}/dayno={0:dd}/', SliceStart)" - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - } - } - } - ] - } -} -``` --### Example 2 --In the following example, the DateTime parameter for the Stored Procedure Activity is determined by using the Text. Format function and the SliceStart variable. --```json -{ - "name": "SprocActivitySamplePipeline", - "properties": { - "activities": [ - { - "type": "SqlServerStoredProcedure", - "typeProperties": { - "storedProcedureName": "usp_sample", - "storedProcedureParameters": { - "DateTime": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)" - } - }, - "outputs": [ - { - "name": "sprocsampleout" - } - ], - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "SprocActivitySample" - } - ], - "start": "2016-08-02T00:00:00Z", - "end": "2016-08-02T05:00:00Z", - "isPaused": false - } -} -``` --### Example 3 -To read data from previous day instead of day represented by the SliceStart, use the AddDays function as shown in the following example: --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-01-01T08:00:00", - "end": "2017-01-01T11:00:00", - "description": "hive activity", - "activities": [ - { - "name": "SampleHiveActivity", - "inputs": [ - { - "name": "MyAzureBlobInput", - "startTime": "Date.AddDays(SliceStart, -1)", - "endTime": "Date.AddDays(SliceEnd, -1)" - } - ], - "outputs": [ - { - "name": "MyAzureBlobOutput" - } - ], - "linkedServiceName": "HDInsightLinkedService", - "type": "HDInsightHive", - "typeProperties": { - "scriptPath": "adftutorial\\hivequery.hql", - "scriptLinkedService": "StorageLinkedService", - "defines": { - "Year": "$$Text.Format('{0:yyyy}',WindowsStart)", - "Month": "$$Text.Format('{0:MM}',WindowStart)", - "Day": "$$Text.Format('{0:dd}',WindowStart)" - } - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 2, - "timeout": "01:00:00" - } - } - ] - } -} -``` --See [Custom Date and Time Format Strings](/dotnet/standard/base-types/custom-date-and-time-format-strings) topic that describes different formatting options you can use (for example: yy vs. yyyy). |
data-factory | Data Factory Gateway Release Notes | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-gateway-release-notes.md | - Title: Release notes for Data Management Gateway -description: Data Management Gateway tory release notes ---- Previously updated : 04/12/2023----# Release notes for Data Management Gateway -> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [self-hosted integration runtime in V2](../create-self-hosted-integration-runtime.md). --One of the challenges for modern data integration is to move data to and from on-premises to cloud. Data Factory makes this integration with Data Management Gateway, which is an agent that you can install on-premises to enable hybrid data movement. --See the following articles for detailed information about Data Management Gateway and how to use it: --* [Data Management Gateway](data-factory-data-management-gateway.md) -* [Move data between on-premises and cloud using Azure Data Factory](data-factory-move-data-between-onprem-and-cloud.md) ---## CURRENT VERSION -We no more maintain the Release notes here. Get latest release notes [here](https://go.microsoft.com/fwlink/?linkid=853077) -----## Earlier versions -## 2.10.6347.7 -### Enhancements- -- You can add DNS entries to allow service bus rather than allowing all Azure IP addresses from your firewall (if needed). You can find respective DNS entry on Azure portal (Data Factory -> 'Author and Deploy' -> 'Gateways' -> "serviceUrls" (in JSON)-- HDFS connector now supports self-signed public certificate by letting you skip TLS validation.-- Fixed: Issue with gateway offline during update (due to clock skew)---## 2.9.6313.2 -### Enhancements- -- You can add DNS entries to allow Service Bus rather than allowing all Azure IP addresses from your firewall (if needed). More details here.-- You can now copy data to/from a single block blob up to 4.75 TB, which is the max supported size of block blob. (earlier limit was 195 GB).-- Fixed: Out of memory issue while unzipping several small files during copy activity.-- Fixed: Index out of range issue while copying from Document DB to SQL Server with idempotency feature.-- Fixed: SQL cleanup script doesn't work with SQL Server from Copy Wizard.-- Fixed: Column name with space at the end does not work in copy activity.--## 2.8.66283.3 -### Enhancements- -- Fixed: Issue with missing credentials on gateway machine reboot.-- Fixed: Issue with registration during gateway restore using a backup file.---## 2.7.6240.1 -### Enhancements- -- Fixed: Incorrect read of Decimal null value from Oracle as source.--## 2.6.6192.2 -### What's new -- Customers can provide feedback on gateway registering experience.-- Support a new compression format: ZIP (Deflate)--### Enhancements- -- Performance improvement for Oracle Sink, HDFS source.-- Bug fix for gateway auto update, gateway parallel processing capacity.---## 2.5.6164.1 -### Enhancements -- Improved and more robust Gateway registration experience- Now you can track progress status during the Gateway registration process, which makes the registration experience more responsive.-- Improvement in Gateway Restore Process- You can still recover gateway even if you do not have the gateway backup file with this update. This would require you to reset Linked Service credentials in Portal.-- Bug fix.--## 2.4.6151.1 --### What's new --- You can now store data source credentials locally. The credentials are encrypted. The data source credentials can be recovered and restored using the backup file that can be exported from the existing Gateway, all on-premises.--### Enhancements- --- Improved and more robust Gateway registration experience.-- Support auto detection of QuoteChar configuration for Text format in copy wizard, and improve the overall format detection accuracy.--## 2.3.6100.2 --- Support firstRowAsHeader and SkipLineCount auto detection in copy wizard for text files in on-premises File system and HDFS.-- Enhance the stability of network connection between gateway and Service Bus-- A few bug fixes---## 2.2.6072.1 --* Supports setting HTTP proxy for the gateway using the Gateway Configuration Manager. If configured, Azure Blob, Azure Table, Azure Data Lake, and Document DB are accessed through HTTP proxy. -* Supports header handling for TextFormat when copying data from/to Azure Blob, Azure Data Lake Store, on-premises File System, and on-premises HDFS. -* Supports copying data from Append Blob and Page Blob along with the already supported Block Blob. -* Introduces a new gateway status **Online (Limited)**, which indicates that the main functionality of the gateway works except the interactive operation support for Copy Wizard. -* Enhances the robustness of gateway registration using registration key. --## 2.1.6040. --* DB2 driver is included in the gateway installation package now. You do not need to install it separately. -* DB2 driver now supports z/OS and DB2 for i (AS/400) along with the platforms already supported (Linux, Unix, and Windows). -* Supports using Azure Cosmos DB as a source or destination for on-premises data stores -* Supports copying data from/to cold/hot blob storage along with the already supported general-purpose storage account. -* Allows you to connect to SQL Server via gateway with remote login privileges. --## 2.0.6013.1 --* You can select the language/culture to be used by a gateway during manual installation. --* When gateway does not work as expected, you can choose to send gateway logs of last seven days to Microsoft to facilitate troubleshooting of the issue. If gateway is not connected to the cloud service, you can choose to save and archive gateway logs. --* User interface improvements for gateway configuration -- * Make gateway status more visible on the Home tab. -- * Reorganized and simplified controls. -- * You can copy data from a storage using the [code-free copy tool](data-factory-copy-data-wizard-tutorial.md). See [Staged Copy](data-factory-copy-activity-performance.md#staged-copy) for details about this feature in general. -* You can use Data Management Gateway to ingress data directly from a SQL Server database into Azure Machine Learning. --* Performance improvements -- * Improve performance on viewing Schema/Preview against SQL Server in code-free copy tool. --## 1.12.5953.1 --* Bug fixes --## 1.11.5918.1 --* Maximum size of the gateway event log has been increased from 1 MB to 40 MB. --* A warning dialog is displayed in case a restart is needed during gateway auto-update. You can choose to restart right then or later. --* In case auto-update fails, gateway installer retries auto-updating three times at maximum. --* Performance improvements -- * Improve performance for loading large tables from on-premises server in code-free copy scenario. --* Bug fixes --## 1.10.5892.1 --* Performance improvements --* Bug fixes --## 1.9.5865.2 --* Zero touch auto update capability -* New tray icon with gateway status indicators -* Ability to "Update now" from the client -* Ability to set update schedule time -* PowerShell script for toggling auto-update on/off -* Support for JSON format -* Performance improvements -* Bug fixes --## 1.8.5822.1 --* Improve troubleshooting experience -* Performance improvements -* Bug fixes --### 1.7.5795.1 --* Performance improvements -* Bug fixes --### 1.7.5764.1 --* Performance improvements -* Bug fixes --### 1.6.5735.1 --* Support on-premises HDFS Source/Sink -* Performance improvements -* Bug fixes --### 1.6.5696.1 --* Performance improvements -* Bug fixes --### 1.6.5676.1 --* Support diagnostic tools on Configuration Manager -* Support table columns for tabular data sources for Azure Data Factory -* Support Azure Synapse Analytics for Azure Data Factory -* Support Reclusive in BlobSource and FileSource for Azure Data Factory -* Support CopyBehavior - MergeFiles, PreserveHierarchy, and FlattenHierarchy in BlobSink and FileSink with Binary Copy for Azure Data Factory -* Support Copy Activity reporting progress for Azure Data Factory -* Support Data Source Connectivity Validation for Azure Data Factory -* Bug fixes --### 1.6.5672.1 --* Support table name for ODBC data source for Azure Data Factory -* Performance improvements -* Bug fixes --### 1.6.5658.1 --* Support File Sink for Azure Data Factory -* Support preserving hierarchy in binary copy for Azure Data Factory -* Support Copy Activity Idempotency for Azure Data Factory -* Bug fixes --### 1.6.5640.1 --* Support 3 more data sources for Azure Data Factory (ODBC, OData, HDFS) -* Support quote character in csv parser for Azure Data Factory -* Compression support (BZip2) -* Bug fixes --### 1.5.5612.1 --* Support five relational databases for Azure Data Factory (MySQL, PostgreSQL, DB2, Teradata, and Sybase) -* Compression support (Gzip and Deflate) -* Performance improvements -* Bug fixes --### 1.4.5549.1 --* Add Oracle data source support for Azure Data Factory -* Performance improvements -* Bug fixes --### 1.4.5492.1 --* Unified binary that supports both Microsoft Azure Data Factory and Office 365 Power BI services -* Refine the Configuration UI and registration process -* Azure Data Factory - Azure Ingress and Egress support for SQL Server data source --### 1.2.5303.1 --* Fix timeout issue to support more time-consuming data source connections. --### 1.1.5526.8 --* Requires .NET Framework 4.5.1 as a prerequisite during setup. --### 1.0.5144.2 --* No changes that affect Azure Data Factory scenarios. |
data-factory | Data Factory Hadoop Streaming Activity | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-hadoop-streaming-activity.md | - Title: Transform data using Hadoop Streaming Activity - Azure -description: Learn how you can use the Hadoop Streaming Activity in an Azure data factory to transform data by running Hadoop Streaming programs on an on-demand/your own HDInsight cluster. ------ Previously updated : 04/12/2023---# Transform data using Hadoop Streaming Activity in Azure Data Factory -> [!div class="op_single_selector" title1="Transformation Activities"] -> * [Hive Activity](data-factory-hive-activity.md) -> * [Pig Activity](data-factory-pig-activity.md) -> * [MapReduce Activity](data-factory-map-reduce.md) -> * [Hadoop Streaming Activity](data-factory-hadoop-streaming-activity.md) -> * [Spark Activity](data-factory-spark.md) -> * [ML Studio (classic) Batch Execution Activity](data-factory-azure-ml-batch-execution-activity.md) -> * [ML Studio (classic) Update Resource Activity](data-factory-azure-ml-update-resource-activity.md) -> * [Stored Procedure Activity](data-factory-stored-proc-activity.md) -> * [Data Lake Analytics U-SQL Activity](data-factory-usql-activity.md) -> * [.NET Custom Activity](data-factory-use-custom-activities.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [transform data using Hadoop streaming activity in Data Factory](../transform-data-using-hadoop-streaming.md). ---You can use the HDInsightStreamingActivity Activity invoke a Hadoop Streaming job from an Azure Data Factory pipeline. The following JSON snippet shows the syntax for using the HDInsightStreamingActivity in a pipeline JSON file. --The HDInsight Streaming Activity in a Data Factory [pipeline](data-factory-create-pipelines.md) executes Hadoop Streaming programs on [your own](data-factory-compute-linked-services.md#azure-hdinsight-linked-service) or [on-demand](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service) Windows/Linux-based HDInsight cluster. This article builds on the [data transformation activities](data-factory-data-transformation-activities.md) article, which presents a general overview of data transformation and the supported transformation activities. --> [!NOTE] -> If you are new to Azure Data Factory, read through [Introduction to Azure Data Factory](data-factory-introduction.md) and do the tutorial: [Build your first data pipeline](data-factory-build-your-first-pipeline.md) before reading this article. --## JSON sample -The HDInsight cluster is automatically populated with example programs (wc.exe and cat.exe) and data (davinci.txt). By default, name of the container that is used by the HDInsight cluster is the name of the cluster itself. For example, if your cluster name is myhdicluster, name of the blob container associated would be myhdicluster. --```JSON -{ - "name": "HadoopStreamingPipeline", - "properties": { - "description": "Hadoop Streaming Demo", - "activities": [ - { - "type": "HDInsightStreaming", - "typeProperties": { - "mapper": "cat.exe", - "reducer": "wc.exe", - "input": "wasb://<nameofthecluster>@spestore.blob.core.windows.net/example/data/gutenberg/davinci.txt", - "output": "wasb://<nameofthecluster>@spestore.blob.core.windows.net/example/data/StreamingOutput/wc.txt", - "filePaths": [ - "<nameofthecluster>/example/apps/wc.exe", - "<nameofthecluster>/example/apps/cat.exe" - ], - "fileLinkedService": "AzureStorageLinkedService", - "getDebugInfo": "Failure" - }, - "outputs": [ - { - "name": "StreamingOutputDataset" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1 - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - }, - "name": "RunHadoopStreamingJob", - "description": "Run a Hadoop streaming job", - "linkedServiceName": "HDInsightLinkedService" - } - ], - "start": "2014-01-04T00:00:00Z", - "end": "2014-01-05T00:00:00Z" - } -} -``` --Note the following points: --1. Set the **linkedServiceName** to the name of the linked service that points to your HDInsight cluster on which the streaming mapreduce job is run. -2. Set the type of the activity to **HDInsightStreaming**. -3. For the **mapper** property, specify the name of mapper executable. In the example, cat.exe is the mapper executable. -4. For the **reducer** property, specify the name of reducer executable. In the example, wc.exe is the reducer executable. -5. For the **input** type property, specify the input file (including the location) for the mapper. In the example: `wasb://adfsample@<account name>.blob.core.windows.net/example/data/gutenberg/davinci.txt`: adfsample is the blob container, example/data/Gutenberg is the folder, and davinci.txt is the blob. -6. For the **output** type property, specify the output file (including the location) for the reducer. The output of the Hadoop Streaming job is written to the location specified for this property. -7. In the **filePaths** section, specify the paths for the mapper and reducer executables. In the example: "adfsample/example/apps/wc.exe", adfsample is the blob container, example/apps is the folder, and wc.exe is the executable. -8. For the **fileLinkedService** property, specify the Azure Storage linked service that represents the Azure storage that contains the files specified in the filePaths section. -9. For the **arguments** property, specify the arguments for the streaming job. -10. The **getDebugInfo** property is an optional element. When it is set to Failure, the logs are downloaded only on failure. When it is set to Always, logs are always downloaded irrespective of the execution status. --> [!NOTE] -> As shown in the example, you specify an output dataset for the Hadoop Streaming Activity for the **outputs** property. This dataset is just a dummy dataset that is required to drive the pipeline schedule. You do not need to specify any input dataset for the activity for the **inputs** property. -> -> --## Example -The pipeline in this walkthrough runs the Word Count streaming Map/Reduce program on your Azure HDInsight cluster. --### Linked services -#### Azure Storage linked service -First, you create a linked service to link the Azure Storage that is used by the Azure HDInsight cluster to the Azure data factory. If you copy/paste the following code, do not forget to replace account name and account key with the name and key of your Azure Storage. --```JSON -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>" - } - } -} -``` --#### Azure HDInsight linked service -Next, you create a linked service to link your Azure HDInsight cluster to the Azure data factory. If you copy/paste the following code, replace HDInsight cluster name with the name of your HDInsight cluster, and change user name and password values. --```JSON -{ - "name": "HDInsightLinkedService", - "properties": { - "type": "HDInsight", - "typeProperties": { - "clusterUri": "https://<HDInsight cluster name>.azurehdinsight.net", - "userName": "admin", - "password": "**********", - "linkedServiceName": "StorageLinkedService" - } - } -} -``` --### Datasets -#### Output dataset -The pipeline in this example does not take any inputs. You specify an output dataset for the HDInsight Streaming Activity. This dataset is just a dummy dataset that is required to drive the pipeline schedule. --```JSON -{ - "name": "StreamingOutputDataset", - "properties": { - "published": false, - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "adftutorial/streamingdata/", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - }, - }, - "availability": { - "frequency": "Day", - "interval": 1 - } - } -} -``` --### Pipeline -The pipeline in this example has only one activity that is of type: **HDInsightStreaming**. --The HDInsight cluster is automatically populated with example programs (wc.exe and cat.exe) and data (davinci.txt). By default, name of the container that is used by the HDInsight cluster is the name of the cluster itself. For example, if your cluster name is myhdicluster, name of the blob container associated would be myhdicluster. --```JSON -{ - "name": "HadoopStreamingPipeline", - "properties": { - "description": "Hadoop Streaming Demo", - "activities": [ - { - "type": "HDInsightStreaming", - "typeProperties": { - "mapper": "cat.exe", - "reducer": "wc.exe", - "input": "wasb://<blobcontainer>@spestore.blob.core.windows.net/example/data/gutenberg/davinci.txt", - "output": "wasb://<blobcontainer>@spestore.blob.core.windows.net/example/data/StreamingOutput/wc.txt", - "filePaths": [ - "<blobcontainer>/example/apps/wc.exe", - "<blobcontainer>/example/apps/cat.exe" - ], - "fileLinkedService": "StorageLinkedService" - }, - "outputs": [ - { - "name": "StreamingOutputDataset" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1 - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - }, - "name": "RunHadoopStreamingJob", - "description": "Run a Hadoop streaming job", - "linkedServiceName": "HDInsightLinkedService" - } - ], - "start": "2017-01-03T00:00:00Z", - "end": "2017-01-04T00:00:00Z" - } -} -``` -## See Also -* [Hive Activity](data-factory-hive-activity.md) -* [Pig Activity](data-factory-pig-activity.md) -* [MapReduce Activity](data-factory-map-reduce.md) -* [Invoke Spark programs](data-factory-spark.md) -* [Invoke R scripts](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/RunRScriptUsingADFSample) - |
data-factory | Data Factory Hdfs Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-hdfs-connector.md | - Title: Move data from on-premises HDFS -description: Learn about how to move data from on-premises HDFS using Azure Data Factory. ---- Previously updated : 04/12/2023----# Move data from on-premises HDFS using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-hdfs-connector.md) -> * [Version 2 (current version)](../connector-hdfs.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [HDFS connector in V2](../connector-hdfs.md). --This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises HDFS. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --You can copy data from HDFS to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the [Supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. Data factory currently supports only moving data from an on-premises HDFS to other data stores, but not for moving data from other data stores to an on-premises HDFS. --> [!NOTE] -> Copy Activity does not delete the source file after it is successfully copied to the destination. If you need to delete the source file after a successful copy, create a custom activity to delete the file and use the activity in the pipeline. ---## Enabling connectivity -Data Factory service supports connecting to on-premises HDFS using the Data Management Gateway. See [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. Use the gateway to connect to HDFS even if it is hosted in an Azure IaaS VM. --> [!NOTE] -> Make sure the Data Management Gateway can access to **ALL** the [name node server]:[name node port] and [data node servers]:[data node port] of the Hadoop cluster. Default [name node port] is 50070, and default [data node port] is 50075. --While you can install gateway on the same on-premises machine or the Azure VM as the HDFS, we recommend that you install the gateway on a separate machine/Azure IaaS VM. Having gateway on a separate machine reduces resource contention and improves performance. When you install the gateway on a separate machine, the machine should be able to access the machine with the HDFS. --## Getting started -You can create a pipeline with a copy activity that moves data from a HDFS source by using different tools/APIs. --The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. --You can also use the following tools to create a pipeline: **Azure portal**, **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from a HDFS data store, see [JSON example: Copy data from on-premises HDFS to Azure Blob](#json-example-copy-data-from-on-premises-hdfs-to-azure-blob) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to HDFS: --## Linked service properties -A linked service links a data store to a data factory. You create a linked service of type **Hdfs** to link an on-premises HDFS to your data factory. The following table provides description for JSON elements specific to HDFS linked service. --| Property | Description | Required | -| | | | -| type |The type property must be set to: **Hdfs** |Yes | -| url |URL to the HDFS |Yes | -| authenticationType |Anonymous, or Windows. <br><br> To use **Kerberos authentication** for HDFS connector, refer to [this section](#use-kerberos-authentication-for-hdfs-connector) to set up your on-premises environment accordingly. |Yes | -| userName |Username for Windows authentication. For Kerberos authentication, specify `<username>@<domain>.com`. |Yes (for Windows Authentication) | -| password |Password for Windows authentication. |Yes (for Windows Authentication) | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the HDFS. |Yes | -| encryptedCredential |[New-AzDataFactoryEncryptValue](/powershell/module/az.datafactory/new-azdatafactoryencryptvalue) output of the access credential. |No | --### Using Anonymous authentication --```JSON -{ - "name": "hdfs", - "properties": - { - "type": "Hdfs", - "typeProperties": - { - "authenticationType": "Anonymous", - "userName": "hadoop", - "url" : "http://<machine>:50070/webhdfs/v1/", - "gatewayName": "mygateway" - } - } -} -``` --### Using Windows authentication --```JSON -{ - "name": "hdfs", - "properties": - { - "type": "Hdfs", - "typeProperties": - { - "authenticationType": "Windows", - "userName": "<username>@<domain>.com (for Kerberos auth)", - "password": "password", - "url" : "http://<machine>:50070/webhdfs/v1/", - "gatewayName": "mygateway" - } - } -} -``` --## Dataset properties -For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --The **typeProperties** section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type **FileShare** (which includes HDFS dataset) has the following properties --| Property | Description | Required | -| | | | -| folderPath |Path to the folder. Example: `myfolder`<br/><br/>Use escape character ' \ ' for special characters in the string. For example: for folder\subfolder, specify folder\\\\subfolder and for d:\samplefolder, specify d:\\\\samplefolder.<br/><br/>You can combine this property with **partitionBy** to have folder paths based on slice start/end date-times. |Yes | -| fileName |Specify the name of the file in the **folderPath** if you want the table to refer to a specific file in the folder. If you do not specify any value for this property, the table points to all files in the folder.<br/><br/>When fileName is not specified for an output dataset, the name of the generated file would be in the following this format: <br/><br/>`Data.<Guid>.txt` (for example: : Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt |No | -| partitionedBy |partitionedBy can be used to specify a dynamic folderPath, filename for time series data. Example: folderPath parameterized for every hour of data. |No | -| format | The following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see [Text Format](data-factory-supported-file-and-compression-formats.md#text-format), [Json Format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro Format](data-factory-supported-file-and-compression-formats.md#avro-format), [Orc Format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet Format](data-factory-supported-file-and-compression-formats.md#parquet-format) sections. <br><br> If you want to **copy files as-is** between file-based stores (binary copy), skip the format section in both input and output dataset definitions. |No | -| compression | Specify the type and level of compression for the data. Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**. Supported levels are: **Optimal** and **Fastest**. For more information, see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). |No | --> [!NOTE] -> filename and fileFilter cannot be used simultaneously. --### Using partionedBy property -As mentioned in the previous section, you can specify a dynamic folderPath and filename for time series data with the **partitionedBy** property, [Data Factory functions, and the system variables](data-factory-functions-variables.md). --To learn more about time series datasets, scheduling, and slices, see [Creating Datasets](data-factory-create-datasets.md), [Scheduling & Execution](data-factory-scheduling-and-execution.md), and [Creating Pipelines](data-factory-create-pipelines.md) articles. --#### Sample 1: --```JSON -"folderPath": "wikidatagateway/wikisampledataout/{Slice}", -"partitionedBy": -[ - { "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } }, -], -``` -In this example {Slice} is replaced with the value of Data Factory system variable SliceStart in the format (YYYYMMDDHH) specified. The SliceStart refers to start time of the slice. The folderPath is different for each slice. For example: wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104. --#### Sample 2: --```JSON -"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}", -"fileName": "{Hour}.csv", -"partitionedBy": - [ - { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, - { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, - { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, - { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } } -], -``` -In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by folderPath and fileName properties. --## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and policies are available for all types of activities. --Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --For Copy Activity, when source is of type **FileSystemSource** the following properties are available in typeProperties section: --**FileSystemSource** supports the following properties: --| Property | Description | Allowed values | Required | -| | | | | -| recursive |Indicates whether the data is read recursively from the sub folders or only from the specified folder. |True, False (default) |No | --## Supported file and compression formats -See [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md) article on details. --## JSON example: Copy data from on-premises HDFS to Azure Blob -This sample shows how to copy data from an on-premises HDFS to Azure Blob Storage. However, data can be copied **directly** to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --The sample provides JSON definitions for the following Data Factory entities. You can use these definitions to create a pipeline to copy data from HDFS to Azure Blob Storage by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). --1. A linked service of type [OnPremisesHdfs](#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [FileShare](#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -5. A [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [FileSystemSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data from an on-premises HDFS to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples. --As a first step, set up the data management gateway. The instructions in the [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article. --**HDFS linked service:** -This example uses the Windows authentication. See [HDFS linked service](#linked-service-properties) section for different types of authentication you can use. --```JSON -{ - "name": "HDFSLinkedService", - "properties": - { - "type": "Hdfs", - "typeProperties": - { - "authenticationType": "Windows", - "userName": "Administrator", - "password": "password", - "url" : "http://<machine>:50070/webhdfs/v1/", - "gatewayName": "mygateway" - } - } -} -``` --**Azure Storage linked service:** --```JSON -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --**HDFS input dataset:** -This dataset refers to the HDFS folder DataTransfer/UnitTest/. The pipeline copies all the files in this folder to the destination. --Setting "external": "true" informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. --```JSON -{ - "name": "InputDataset", - "properties": { - "type": "FileShare", - "linkedServiceName": "HDFSLinkedService", - "typeProperties": { - "folderPath": "DataTransfer/UnitTest/" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Azure Blob output dataset:** --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```JSON -{ - "name": "OutputDataset", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/hdfs/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**A copy activity in a pipeline with File System source and Blob sink:** --The pipeline contains a Copy Activity that is configured to use these input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **FileSystemSource** and **sink** type is set to **BlobSink**. The SQL query specified for the **query** property selects the data in the past hour to copy. --```JSON -{ - "name": "pipeline", - "properties": - { - "activities": - [ - { - "name": "HdfsToBlobCopy", - "inputs": [ {"name": "InputDataset"} ], - "outputs": [ {"name": "OutputDataset"} ], - "type": "Copy", - "typeProperties": - { - "source": - { - "type": "FileSystemSource" - }, - "sink": - { - "type": "BlobSink" - } - }, - "policy": - { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "00:05:00" - } - } - ], - "start": "2014-06-01T18:00:00Z", - "end": "2014-06-01T19:00:00Z" - } -} -``` --## Use Kerberos authentication for HDFS connector -There are two options to set up the on-premises environment so as to use Kerberos Authentication in HDFS connector. You can choose the one better fits your case. -* Option 1: [Join gateway machine in Kerberos realm](#kerberos-join-realm) -* Option 2: [Enable mutual trust between Windows domain and Kerberos realm](#kerberos-mutual-trust) --### <a name="kerberos-join-realm"></a>Option 1: Join gateway machine in Kerberos realm --#### Requirement: --* The gateway machine needs to join the Kerberos realm and can't join any Windows domain. --#### How to configure: --**On gateway machine:** --1. Run the **Ksetup** utility to configure the Kerberos KDC server and realm. -- The machine must be configured as a member of a workgroup since a Kerberos realm is different from a Windows domain. This can be achieved by setting the Kerberos realm and adding a KDC server as follows. Replace *REALM.COM* with your own respective realm as needed. -- ```cmd - Ksetup /setdomain REALM.COM - Ksetup /addkdc REALM.COM <your_kdc_server_address> - ``` -- **Restart** the machine after executing these 2 commands. --2. Verify the configuration with **Ksetup** command. The output should be like: -- ```cmd - Ksetup - default realm = REALM.COM (external) - REALM.com: - kdc = <your_kdc_server_address> - ``` --**In Azure Data Factory:** --* Configure the HDFS connector using **Windows authentication** together with your Kerberos principal name and password to connect to the HDFS data source. Check [HDFS Linked Service properties](#linked-service-properties) section on configuration details. --### <a name="kerberos-mutual-trust"></a>Option 2: Enable mutual trust between Windows domain and Kerberos realm --#### Requirement: -* The gateway machine must join a Windows domain. -* You need permission to update the domain controller's settings. --#### How to configure: --> [!NOTE] -> Replace REALM.COM and AD.COM in the following tutorial with your own respective realm and domain controller as needed. --**On KDC server:** --1. Edit the KDC configuration in **krb5.conf** file to let KDC trust Windows Domain referring to the following configuration template. By default, the configuration is located at **/etc/krb5.conf**. -- ```config - [logging] - default = FILE:/var/log/krb5libs.log - kdc = FILE:/var/log/krb5kdc.log - admin_server = FILE:/var/log/kadmind.log -- [libdefaults] - default_realm = REALM.COM - dns_lookup_realm = false - dns_lookup_kdc = false - ticket_lifetime = 24h - renew_lifetime = 7d - forwardable = true -- [realms] - REALM.COM = { - kdc = node.REALM.COM - admin_server = node.REALM.COM - } - AD.COM = { - kdc = windc.ad.com - admin_server = windc.ad.com - } -- [domain_realm] - .REALM.COM = REALM.COM - REALM.COM = REALM.COM - .ad.com = AD.COM - ad.com = AD.COM -- [capaths] - AD.COM = { - REALM.COM = . - } - ``` -- **Restart** the KDC service after configuration. --2. Prepare a principal named **krbtgt/REALM.COM\@AD.COM** in KDC server with the following command: -- ```cmd - Kadmin> addprinc krbtgt/REALM.COM@AD.COM - ``` --3. In **hadoop.security.auth_to_local** HDFS service configuration file, add `RULE:[1:$1@$0](.*\@AD.COM)s/\@.*//`. --**On domain controller:** --1. Run the following **Ksetup** commands to add a realm entry: -- ```cmd - Ksetup /addkdc REALM.COM <your_kdc_server_address> - ksetup /addhosttorealmmap HDFS-service-FQDN REALM.COM - ``` --2. Establish trust from Windows Domain to Kerberos Realm. [password] is the password for the principal **krbtgt/REALM.COM\@AD.COM**. -- ```cmd - netdom trust REALM.COM /Domain: AD.COM /add /realm /passwordt:[password] - ``` --3. Select encryption algorithm used in Kerberos. -- 1. Go to Server Manager > Group Policy Management > Domain > Group Policy Objects > Default or Active Domain Policy, and Edit. -- 2. In the **Group Policy Management Editor** popup window, go to Computer Configuration > Policies > Windows Settings > Security Settings > Local Policies > Security Options, and configure **Network security: Configure Encryption types allowed for Kerberos**. -- 3. Select the encryption algorithm you want to use when connect to KDC. Commonly, you can simply select all the options. -- :::image type="content" source="media/data-factory-hdfs-connector/config-encryption-types-for-kerberos.png" alt-text="Config Encryption Types for Kerberos"::: -- 4. Use **Ksetup** command to specify the encryption algorithm to be used on the specific REALM. -- ```cmd - ksetup /SetEncTypeAttr REALM.COM DES-CBC-CRC DES-CBC-MD5 RC4-HMAC-MD5 AES128-CTS-HMAC-SHA1-96 AES256-CTS-HMAC-SHA1-96 - ``` --4. Create the mapping between the domain account and Kerberos principal, in order to use Kerberos principal in Windows Domain. -- 1. Start the Administrative tools > **Active Directory Users and Computers**. -- 2. Configure advanced features by clicking **View** > **Advanced Features**. -- 3. Locate the account to which you want to create mappings, and right-click to view **Name Mappings** > click **Kerberos Names** tab. -- 4. Add a principal from the realm. -- :::image type="content" source="media/data-factory-hdfs-connector/map-security-identity.png" alt-text="Map Security Identity"::: --**On gateway machine:** --* Run the following **Ksetup** commands to add a realm entry. -- ```cmd - Ksetup /addkdc REALM.COM <your_kdc_server_address> - ksetup /addhosttorealmmap HDFS-service-FQDN REALM.COM - ``` --**In Azure Data Factory:** --* Configure the HDFS connector using **Windows authentication** together with either your Domain Account or Kerberos Principal to connect to the HDFS data source. Check [HDFS Linked Service properties](#linked-service-properties) section on configuration details. --> [!NOTE] -> To map columns from source dataset to columns from sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory Hive Activity | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-hive-activity.md | - Title: Transform data using Hive Activity - Azure -description: Learn how you can use the Hive Activity in Azure Data Factory v1 to run Hive queries on an on-demand/your own HDInsight cluster. ------ Previously updated : 04/12/2023---# Transform data using Hive Activity in Azure Data Factory -> [!div class="op_single_selector" title1="Transformation Activities"] -> * [Hive Activity](data-factory-hive-activity.md) -> * [Pig Activity](data-factory-pig-activity.md) -> * [MapReduce Activity](data-factory-map-reduce.md) -> * [Hadoop Streaming Activity](data-factory-hadoop-streaming-activity.md) -> * [Spark Activity](data-factory-spark.md) -> * [ML Studio (classic) Batch Execution Activity](data-factory-azure-ml-batch-execution-activity.md) -> * [ML Studio (classic) Update Resource Activity](data-factory-azure-ml-update-resource-activity.md) -> * [Stored Procedure Activity](data-factory-stored-proc-activity.md) -> * [Data Lake Analytics U-SQL Activity](data-factory-usql-activity.md) -> * [.NET Custom Activity](data-factory-use-custom-activities.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [transform data using Hive activity in Data Factory](../transform-data-using-hadoop-hive.md). --The HDInsight Hive activity in a Data Factory [pipeline](data-factory-create-pipelines.md) executes Hive queries on [your own](data-factory-compute-linked-services.md#azure-hdinsight-linked-service) or [on-demand](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service) Windows/Linux-based HDInsight cluster. This article builds on the [data transformation activities](data-factory-data-transformation-activities.md) article, which presents a general overview of data transformation and the supported transformation activities. --> [!NOTE] -> If you are new to Azure Data Factory, read through [Introduction to Azure Data Factory](data-factory-introduction.md) and do the tutorial: [Build your first data pipeline](data-factory-build-your-first-pipeline.md) before reading this article. --## Syntax --```JSON -{ - "name": "Hive Activity", - "description": "description", - "type": "HDInsightHive", - "inputs": [ - { - "name": "input tables" - } - ], - "outputs": [ - { - "name": "output tables" - } - ], - "linkedServiceName": "MyHDInsightLinkedService", - "typeProperties": { - "script": "Hive script", - "scriptPath": "<pathtotheHivescriptfileinAzureblobstorage>", - "defines": { - "param1": "param1Value" - } - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - } -} -``` -## Syntax details -| Property | Description | Required | -| | | | -| name |Name of the activity |Yes | -| description |Text describing what the activity is used for |No | -| type |HDinsightHive |Yes | -| inputs |Inputs consumed by the Hive activity |No | -| outputs |Outputs produced by the Hive activity |Yes | -| linkedServiceName |Reference to the HDInsight cluster registered as a linked service in Data Factory |Yes | -| script |Specify the Hive script inline |No | -| scriptPath |Store the Hive script in an Azure blob storage and provide the path to the file. Use 'script' or 'scriptPath' property. Both cannot be used together. The file name is case-sensitive. |No | -| defines |Specify parameters as key/value pairs for referencing within the Hive script using 'hiveconf' |No | --## Example -LetΓÇÖs consider an example of game logs analytics where you want to identify the time spent by users playing games launched by your company. --The following log is a sample game log, which is comma (`,`) separated and contains the following fields - ProfileID, SessionStart, Duration, SrcIPAddress, and GameType. --``` -1809,2014-05-04 12:04:25.3470000,14,221.117.223.75,CaptureFlag -1703,2014-05-04 06:05:06.0090000,16,12.49.178.247,KingHill -1703,2014-05-04 10:21:57.3290000,10,199.118.18.179,CaptureFlag -1809,2014-05-04 05:24:22.2100000,23,192.84.66.141,KingHill -..... -``` --The **Hive script** to process this data: --``` -DROP TABLE IF EXISTS HiveSampleIn; -CREATE EXTERNAL TABLE HiveSampleIn -( - ProfileID string, - SessionStart string, - Duration int, - SrcIPAddress string, - GameType string -) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '10' STORED AS TEXTFILE LOCATION 'wasb://adfwalkthrough@<storageaccount>.blob.core.windows.net/samplein/'; --DROP TABLE IF EXISTS HiveSampleOut; -CREATE EXTERNAL TABLE HiveSampleOut -( - ProfileID string, - Duration int -) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '10' STORED AS TEXTFILE LOCATION 'wasb://adfwalkthrough@<storageaccount>.blob.core.windows.net/sampleout/'; --INSERT OVERWRITE TABLE HiveSampleOut -Select - ProfileID, - SUM(Duration) -FROM HiveSampleIn Group by ProfileID -``` --To execute this Hive script in a Data Factory pipeline, you need to do the following --1. Create a linked service to register [your own HDInsight compute cluster](data-factory-compute-linked-services.md#azure-hdinsight-linked-service) or configure [on-demand HDInsight compute cluster](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service). LetΓÇÖs call this linked service ΓÇ£HDInsightLinkedServiceΓÇ¥. -2. Create a [linked service](data-factory-azure-blob-connector.md) to configure the connection to Azure Blob storage hosting the data. LetΓÇÖs call this linked service ΓÇ£StorageLinkedServiceΓÇ¥ -3. Create [datasets](data-factory-create-datasets.md) pointing to the input and the output data. LetΓÇÖs call the input dataset ΓÇ£HiveSampleInΓÇ¥ and the output dataset ΓÇ£HiveSampleOutΓÇ¥ -4. Copy the Hive query as a file to Azure Blob Storage configured in step #2. if the storage for hosting the data is different from the one hosting this query file, create a separate Azure Storage linked service and refer to it in the activity. Use **scriptPath** to specify the path to hive query file and **scriptLinkedService** to specify the Azure storage that contains the script file. - - > [!NOTE] - > You can also provide the Hive script inline in the activity definition by using the **script** property. We do not recommend this approach as all special characters in the script within the JSON document needs to be escaped and may cause debugging issues. The best practice is to follow step #4. - > - > -5. Create a pipeline with the HDInsightHive activity. The activity processes/transforms the data. -- ```json - { - "name": "HiveActivitySamplePipeline", - "properties": { - "activities": [ - { - "name": "HiveActivitySample", - "type": "HDInsightHive", - "inputs": [ - { - "name": "HiveSampleIn" - } - ], - "outputs": [ - { - "name": "HiveSampleOut" - } - ], - "linkedServiceName": "HDInsightLinkedService", - "typeproperties": { - "scriptPath": "adfwalkthrough\\scripts\\samplehive.hql", - "scriptLinkedService": "StorageLinkedService" - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - } - } - ] - } - } - ``` --6. Deploy the pipeline. See [Creating pipelines](data-factory-create-pipelines.md) article for details. -7. Monitor the pipeline using the data factory monitoring and management views. See [Monitoring and manage Data Factory pipelines](data-factory-monitor-manage-pipelines.md) article for details. --## Specifying parameters for a Hive script -In this example, game logs are ingested daily into Azure Blob Storage and are stored in a folder partitioned with date and time. You want to parameterize the Hive script and pass the input folder location dynamically during runtime and also produce the output partitioned with date and time. --To use parameterized Hive script, do the following --* Define the parameters in **defines**. -- ```JSON - { - "name": "HiveActivitySamplePipeline", - "properties": { - "activities": [ - { - "name": "HiveActivitySample", - "type": "HDInsightHive", - "inputs": [ - { - "name": "HiveSampleIn" - } - ], - "outputs": [ - { - "name": "HiveSampleOut" - } - ], - "linkedServiceName": "HDInsightLinkedService", - "typeproperties": { - "scriptPath": "adfwalkthrough\\scripts\\samplehive.hql", - "scriptLinkedService": "StorageLinkedService", - "defines": { - "Input": "$$Text.Format('wasb://adfwalkthrough@<storageaccountname>.blob.core.windows.net/samplein/yearno={0:yyyy}/monthno={0:MM}/dayno={0:dd}/', SliceStart)", - "Output": "$$Text.Format('wasb://adfwalkthrough@<storageaccountname>.blob.core.windows.net/sampleout/yearno={0:yyyy}/monthno={0:MM}/dayno={0:dd}/', SliceStart)" - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - } - } - } - ] - } - } - ``` -* In the Hive Script, refer to the parameter using **${hiveconf:parameterName}**. - - ``` - DROP TABLE IF EXISTS HiveSampleIn; - CREATE EXTERNAL TABLE HiveSampleIn - ( - ProfileID string, - SessionStart string, - Duration int, - SrcIPAddress string, - GameType string - ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '10' STORED AS TEXTFILE LOCATION '${hiveconf:Input}'; -- DROP TABLE IF EXISTS HiveSampleOut; - CREATE EXTERNAL TABLE HiveSampleOut - ( - ProfileID string, - Duration int - ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '10' STORED AS TEXTFILE LOCATION '${hiveconf:Output}'; -- INSERT OVERWRITE TABLE HiveSampleOut - Select - ProfileID, - SUM(Duration) - FROM HiveSampleIn Group by ProfileID - ``` - ## See Also -* [Pig Activity](data-factory-pig-activity.md) -* [MapReduce Activity](data-factory-map-reduce.md) -* [Hadoop Streaming Activity](data-factory-hadoop-streaming-activity.md) -* [Invoke Spark programs](data-factory-spark.md) -* [Invoke R scripts](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/RunRScriptUsingADFSample) - |
data-factory | Data Factory How To Use Resource Manager Templates | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-how-to-use-resource-manager-templates.md | - Title: Use Resource Manager templates in Data Factory -description: Learn how to create and use Azure Resource Manager templates to create Data Factory entities. ------- Previously updated : 04/12/2023---# Use templates to create Azure Data Factory entities -> [!NOTE] -> This article applies to version 1 of Data Factory. --## Overview -While using Azure Data Factory for your data integration needs, you may find yourself reusing the same pattern across different environments or implementing the same task repetitively within the same solution. Templates help you implement and manage these scenarios in an easy manner. Templates in Azure Data Factory are ideal for scenarios that involve reusability and repetition. --Consider the situation where an organization has 10 manufacturing plants across the world. The logs from each plant are stored in a separate SQL Server database. The company wants to build a single data warehouse in the cloud for ad hoc analytics. It also wants to have the same logic but different configurations for development, test, and production environments. --In this case, a task needs to be repeated within the same environment, but with different values across the 10 data factories for each manufacturing plant. In effect, **repetition** is present. Templating allows the abstraction of this generic flow (that is, pipelines having the same activities in each data factory), but uses a separate parameter file for each manufacturing plant. --Furthermore, as the organization wants to deploy these 10 data factories multiple times across different environments, templates can use this **reusability** by utilizing separate parameter files for development, test, and production environments. --## Templating with Azure Resource Manager -[Azure Resource Manager templates](../../azure-resource-manager/templates/overview.md) are a great way to achieve templating in Azure Data Factory. Resource Manager templates define the infrastructure and configuration of your Azure solution through a JSON file. Because Azure Resource Manager templates work with all/most Azure services, it can be widely used to easily manage all resources of your Azure assets. See [Authoring Azure Resource Manager templates](../../azure-resource-manager/templates/syntax.md) to learn more about the Resource Manager Templates in general. --## Tutorials -See the following tutorials for step-by-step instructions to create Data Factory entities by using Resource Manager templates: --* [Tutorial: Create a pipeline to copy data by using Azure Resource Manager template](data-factory-copy-activity-tutorial-using-azure-resource-manager-template.md) -* [Tutorial: Create a pipeline to process data by using Azure Resource Manager template](data-factory-build-your-first-pipeline.md) --## Data Factory templates on GitHub -Check out the following Azure quickstart templates on GitHub: --* [Create a Data factory to copy data from Azure Blob Storage to Azure SQL Database](https://github.com/Azure/azure-quickstart-templates/tree/master/quickstarts/microsoft.datafactory/data-factory-blob-to-sql-copy) -* [Create a Data factory with Hive activity on Azure HDInsight cluster](https://github.com/Azure/azure-quickstart-templates/tree/master/quickstarts/microsoft.datafactory/data-factory-hive-transformation) -* [Create a Data factory to copy data from Salesforce to Azure Blobs](https://github.com/Azure/azure-quickstart-templates/tree/master/quickstarts/microsoft.datafactory/data-factory-salesforce-to-blob-copy) -* [Create a Data factory that chains activities: copies data from an FTP server to Azure Blobs, invokes a hive script on an on-demand HDInsight cluster to transform the data, and copies result into Azure SQL Database](https://github.com/Azure/azure-quickstart-templates/tree/master/quickstarts/microsoft.datafactory/data-factory-ftp-hive-blob) --Feel free to share your Azure Data Factory templates at [Azure quickstart](https://azure.microsoft.com/resources/templates/). Refer to the [contribution guide](https://github.com/Azure/azure-quickstart-templates/tree/master/1-CONTRIBUTION-GUIDE) while developing templates that can be shared via this repository. --The following sections provide details about defining Data Factory resources in a Resource Manager template. --## Defining Data Factory resources in templates -The top-level template for defining a data factory is: --```JSON -"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", -"contentVersion": "1.0.0.0", -"parameters": { ... -}, -"variables": { ... -}, -"resources": [ -{ - "name": "[parameters('dataFactoryName')]", - "apiVersion": "[variables('apiVersion')]", - "type": "Microsoft.DataFactory/datafactories", - "location": "westus", - "resources": [ - { "type": "linkedservices", - ... - }, - {"type": "datasets", - ... - }, - {"type": "dataPipelines", - ... - } -} -``` --### Define data factory -You define a data factory in the Resource Manager template as shown in the following sample: --```JSON -"resources": [ -{ - "name": "[variables('<mydataFactoryName>')]", - "apiVersion": "2015-10-01", - "type": "Microsoft.DataFactory/datafactories", - "location": "East US" -} -``` -The dataFactoryName is defined in "variables" as: --```JSON -"dataFactoryName": "[concat('<myDataFactoryName>', uniqueString(resourceGroup().id))]", -``` --### Define linked services --```JSON -"type": "linkedservices", -"name": "[variables('<LinkedServiceName>')]", -"apiVersion": "2015-10-01", -"dependsOn": [ "[variables('<dataFactoryName>')]" ], -"properties": { - ... -} -``` --See [Storage Linked Service](data-factory-azure-blob-connector.md#azure-storage-linked-service) or [Compute Linked Services](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service) for details about the JSON properties for the specific linked service you wish to deploy. The "dependsOn" parameter specifies name of the corresponding data factory. An example of defining a linked service for Azure Storage is shown in the following JSON definition: --### Define datasets --```JSON -"type": "datasets", -"name": "[variables('<myDatasetName>')]", -"dependsOn": [ - "[variables('<dataFactoryName>')]", - "[variables('<myDatasetLinkedServiceName>')]" -], -"apiVersion": "2015-10-01", -"properties": { - ... -} -``` -Refer to [Supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) for details about the JSON properties for the specific dataset type you wish to deploy. Note the "dependsOn" parameter specifies name of the corresponding data factory and storage linked service. An example of defining dataset type of Azure blob storage is shown in the following JSON definition: --```JSON -"type": "datasets", -"name": "[variables('storageDataset')]", -"dependsOn": [ - "[variables('dataFactoryName')]", - "[variables('storageLinkedServiceName')]" -], -"apiVersion": "2015-10-01", -"properties": { -"type": "AzureBlob", -"linkedServiceName": "[variables('storageLinkedServiceName')]", -"typeProperties": { - "folderPath": "[concat(parameters('sourceBlobContainer'), '/')]", - "fileName": "[parameters('sourceBlobName')]", - "format": { - "type": "TextFormat" - } -}, -"availability": { - "frequency": "Hour", - "interval": 1 -} -``` --### Define pipelines --```JSON -"type": "dataPipelines", -"name": "[variables('<mypipelineName>')]", -"dependsOn": [ - "[variables('<dataFactoryName>')]", - "[variables('<inputDatasetLinkedServiceName>')]", - "[variables('<outputDatasetLinkedServiceName>')]", - "[variables('<inputDataset>')]", - "[variables('<outputDataset>')]" -], -"apiVersion": "2015-10-01", -"properties": { - activities: { - ... - } -} -``` --Refer to [defining pipelines](data-factory-create-pipelines.md#pipeline-json) for details about the JSON properties for defining the specific pipeline and activities you wish to deploy. Note the "dependsOn" parameter specifies name of the data factory, and any corresponding linked services or datasets. An example of a pipeline that copies data from Azure Blob Storage to Azure SQL Database is shown in the following JSON snippet: --```JSON -"type": "datapipelines", -"name": "[variables('pipelineName')]", -"dependsOn": [ - "[variables('dataFactoryName')]", - "[variables('azureStorageLinkedServiceName')]", - "[variables('azureSqlLinkedServiceName')]", - "[variables('blobInputDatasetName')]", - "[variables('sqlOutputDatasetName')]" -], -"apiVersion": "2015-10-01", -"properties": { - "activities": [ - { - "name": "CopyFromAzureBlobToAzureSQL", - "description": "Copy data frm Azure blob to Azure SQL", - "type": "Copy", - "inputs": [ - { - "name": "[variables('blobInputDatasetName')]" - } - ], - "outputs": [ - { - "name": "[variables('sqlOutputDatasetName')]" - } - ], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "SqlSink", - "sqlWriterCleanupScript": "$$Text.Format('DELETE FROM {0}', 'emp')" - }, - "translator": { - "type": "TabularTranslator", - "columnMappings": "Column0:FirstName,Column1:LastName" - } - }, - "Policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 3, - "timeout": "01:00:00" - } - } - ], - "start": "2016-10-03T00:00:00Z", - "end": "2016-10-04T00:00:00Z" -} -``` -## Parameterizing Data Factory template -For best practices on parameterizing, see [Best practices for creating Azure Resource Manager templates](../../azure-resource-manager/templates/best-practices.md). In general, parameter usage should be minimized, especially if variables can be used instead. Only provide parameters in the following scenarios: --* Settings vary by environment (example: development, test, and production) -* Secrets (such as passwords) --If you need to pull secrets from [Azure Key Vault](../../key-vault/general/overview.md) when deploying Azure Data Factory entities using templates, specify the **key vault** and **secret name** as shown in the following example: --```JSON -"parameters": { - "storageAccountKey": { - "reference": { - "keyVault": { - "id":"/subscriptions/<subscriptionID>/resourceGroups/<resourceGroupName>/providers/Microsoft.KeyVault/vaults/<keyVaultName>", - }, - "secretName": "<secretName>" - }, - }, - ... -} -``` --> [!NOTE] -> While exporting templates for existing data factories is currently not yet supported, it is in the works. -> -> |
data-factory | Data Factory Http Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-http-connector.md | - Title: Move data from an HTTP source - Azure -description: Learn how to move data from an on-premises or cloud HTTP source by using Azure Data Factory. ----- Previously updated : 04/12/2023---# Move data from an HTTP source by using Azure Data Factory --> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-http-connector.md) -> * [Version 2 (current version)](../connector-http.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you're using the current version of the Azure Data Factory service, see [HTTP connector in V2](../connector-http.md). --This article outlines how to use Copy Activity in Azure Data Factory to move data from an on-premises or cloud HTTP endpoint to a supported sink data store. This article builds on [Move data by using Copy Activity](data-factory-data-movement-activities.md), which presents a general overview of data movement by using Copy Activity. The article also lists the data stores that Copy Activity supports as sources and sinks. --Data Factory currently supports only moving data from an HTTP source to other data stores. It doesn't support moving data from other data stores to an HTTP destination. --## Supported scenarios and authentication types --You can use this HTTP connector to retrieve data from *both a cloud and an on-premises HTTP/S endpoint* by using the HTTP **GET** or **POST** methods. The following authentication types are supported: **Anonymous**, **Basic**, **Digest**, **Windows**, and **ClientCertificate**. Note the difference between this connector and the [Web table connector](data-factory-web-table-connector.md). The Web table connector extracts table content from an HTML webpage. --When you copy data from an on-premises HTTP endpoint, you must install Data Management Gateway in the on-premises environment or in an Azure VM. To learn about Data Management Gateway and for step-by-step instructions on how to set up the gateway, see [Moving data between on-premises locations and the cloud](data-factory-move-data-between-onprem-and-cloud.md). --## Get started --You can create a pipeline that has a copy activity to move data from an HTTP source by using different tools or APIs: --- The easiest way to create a pipeline is to use the Copy Data wizard. For a quick walkthrough of creating a pipeline by using the Copy Data wizard, see [Tutorial: Create a pipeline by using the Copy wizard](data-factory-copy-data-wizard-tutorial.md).--- You can also use the following tools to create a pipeline: the **Visual Studio**, **Azure PowerShell**, an **Azure Resource Manager template**, the **.NET API**, or the **REST API**. For step-by-step instructions on how to create a pipeline that has a copy activity, see the [Copy Activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). For JSON samples that copy data from an HTTP source to Azure Blob storage, see [JSON examples](#json-examples).--## Linked service properties --The following table describes JSON elements that are specific to the HTTP linked service: --| Property | Description | Required | -| | | | -| type | The **type** property must be set to **Http**. | Yes | -| url | The base URL to the web server. | Yes | -| authenticationType | Specifies the authentication type. Allowed values are **Anonymous**, **Basic**, **Digest**, **Windows**, and **ClientCertificate**. <br><br> Refer to later sections in this article for more properties and JSON samples for these authentication types. | Yes | -| enableServerCertificateValidation | Specifies whether to enable server TLS/SSL certificate validation if the source is an HTTPS web server. When your HTTPS server uses a self-signed certificate, set this to **false**. | No<br /> (the default is **true**) | -| gatewayName | The name of the Data Management Gateway instance to use to connect to an on-premises HTTP source. | Yes, if you are copying data from an on-premises HTTP source | -| encryptedCredential | The encrypted credential for accessing the HTTP endpoint. The value is autogenerated when you configure the authentication information in the Copy wizard or by using the **ClickOnce** dialog box. | No<br /> (apply only when you copy data from an on-premises HTTP server) | --For details about setting credentials for an on-premises HTTP connector data source, see [Move data between on-premises sources and the cloud by using Data Management Gateway](data-factory-move-data-between-onprem-and-cloud.md). --### Using Basic, Digest, or Windows authentication --Set **authenticationType** to **Basic**, **Digest**, or **Windows**. In addition to the generic HTTP connector properties described in the preceding sections, set the following properties: --| Property | Description | Required | -| | | | -| userName | The user name to use to access the HTTP endpoint. | Yes | -| password | The password for the user (**username**). | Yes | --**Example: Using Basic, Digest, or Windows authentication** --```json -{ - "name": "HttpLinkedService", - "properties": - { - "type": "Http", - "typeProperties": - { - "authenticationType": "basic", - "url" : "https://en.wikipedia.org/wiki/", - "userName": "user name", - "password": "password" - } - } -} -``` --### Using ClientCertificate authentication --To use basic authentication, set **authenticationType** to **ClientCertificate**. In addition to the generic HTTP connector properties described in the preceding sections, set the following properties: --| Property | Description | Required | -| | | | -| embeddedCertData | The Base64-encoded contents of binary data of the PFX file. | Specify either **embeddedCertData** or **certThumbprint** | -| certThumbprint | The thumbprint of the certificate that was installed on your gateway machine's cert store. Apply only when you copy data from an on-premises HTTP source. | Specify either **embeddedCertData** or **certThumbprint** | -| password | The password that's associated with the certificate. | No | --If you use **certThumbprint** for authentication and the certificate is installed in the personal store of the local computer, grant read permissions to the gateway service: --1. Open the Microsoft Management Console (MMC). Add the **Certificates** snap-in that targets **Local Computer**. -2. Expand **Certificates** > **Personal**, and then select **Certificates**. -3. Right-click the certificate from the personal store, and then select **All Tasks** >**Manage Private Keys**. -3. On the **Security** tab, add the user account under which the Data Management Gateway Host Service is running, with read access to the certificate. --**Example: Using a client certificate** --This linked service links your data factory to an on-premises HTTP web server. It uses a client certificate that is installed on the machine that has Data Management Gateway installed. --```json -{ - "name": "HttpLinkedService", - "properties": - { - "type": "Http", - "typeProperties": - { - "authenticationType": "ClientCertificate", - "url": "https://en.wikipedia.org/wiki/", - "certThumbprint": "thumbprint of certificate", - "gatewayName": "gateway name" - } - } -} -``` --**Example: Using a client certificate in a file** --This linked service links your data factory to an on-premises HTTP web server. It uses a client certificate file on the machine that has Data Management Gateway installed. --```json -{ - "name": "HttpLinkedService", - "properties": - { - "type": "Http", - "typeProperties": - { - "authenticationType": "ClientCertificate", - "url": "https://en.wikipedia.org/wiki/", - "embeddedCertData": "Base64-encoded cert data", - "password": "password of cert" - } - } -} -``` --## Dataset properties --Some sections of a dataset JSON file, such as structure, availability, and policy, are similar for all dataset types (Azure SQL Database, Azure Blob storage, Azure Table storage). --For a full list of sections and properties that are available for defining datasets, see [Creating datasets](data-factory-create-datasets.md). --The **typeProperties** section is different for each type of dataset. The **typeProperties** section provides information about the location of the data in the data store. The **typeProperties** section for a dataset of the **Http** type has the following properties: --| Property | Description | Required | -|: |: |: | -| type | The **type** of the dataset must be set to **Http**. | Yes | -| relativeUrl | A relative URL to the resource that contains the data. When the path isn't specified, only the URL that's specified in the linked service definition is used. <br><br> To construct a dynamic URL, you can use [Data Factory functions and system variables](data-factory-functions-variables.md). Example: **relativeUrl**: **$$Text.Format('/my/report?month={0:yyyy}-{0:MM}&fmt=csv', SliceStart)**. | No | -| requestMethod | The HTTP method. Allowed values are **GET** and **POST**. | No <br />(default is **GET**) | -| additionalHeaders | Additional HTTP request headers. | No | -| requestBody | The body for the HTTP request. | No | -| format | If you want to *retrieve the data from an HTTP endpoint as-is* without parsing it, skip the **format** setting. <br><br> If you want to parse the HTTP response content during copy, the following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, and **ParquetFormat**. For more information, see [Text format](data-factory-supported-file-and-compression-formats.md#text-format), [JSON format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro format](data-factory-supported-file-and-compression-formats.md#avro-format), [Orc format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet format](data-factory-supported-file-and-compression-formats.md#parquet-format). |No | -| compression | Specify the type and level of compression for the data. Supported types: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**. Supported levels: **Optimal** and **Fastest**. For more information, see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). |No | --**Example: Using the GET (default) method** --```json -{ - "name": "HttpSourceDataInput", - "properties": { - "type": "Http", - "linkedServiceName": "HttpLinkedService", - "typeProperties": { - "relativeUrl": "XXX/test.xml", - "additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Example: Using the POST method** --```json -{ - "name": "HttpSourceDataInput", - "properties": { - "type": "Http", - "linkedServiceName": "HttpLinkedService", - "typeProperties": { - "relativeUrl": "/XXX/test.xml", - "requestMethod": "Post", - "requestBody": "body for POST HTTP request" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --## Copy Activity properties --Properties like name, description, input and output tables, and policy are available for all types of activities. --For a full list of sections and properties that are available for defining activities, see [Creating pipelines](data-factory-create-pipelines.md). --Properties that are available in the **typeProperties** section of the activity vary with each activity type. For a copy activity, properties vary depending on the types of sources and sinks. --Currently, when the source in Copy Activity is of the **HttpSource** type, the following properties are supported: --| Property | Description | Required | -| -- | -- | -- | -| httpRequestTimeout | The timeout (the **TimeSpan** value) for the HTTP request to get a response. It's the timeout to get a response, not the timeout to read response data. | No<br />(default value: **00:01:40**) | --## Supported file and compression formats --See [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md) for more. --## JSON examples --The following examples provide sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). The examples show how to copy data from an HTTP source to Azure Blob storage. However, data can be copied *directly* from any of the sources to any of the sinks [that are supported](data-factory-data-movement-activities.md#supported-data-stores-and-formats) by using Copy Activity in Azure Data Factory. --**Example: Copy data from an HTTP source to Azure Blob storage** --The Data Factory solution for this sample contains the following Data Factory entities: --* A linked service of type [HTTP](#linked-service-properties). -* A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -* An input [dataset](data-factory-create-datasets.md) of type [Http](#dataset-properties). -* An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -* A [pipeline](data-factory-create-pipelines.md) that has a copy activity that uses [HttpSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data from an HTTP source to an Azure blob every hour. The JSON properties used in these samples are described in sections that follow the samples. --### HTTP linked service --This example uses the HTTP linked service with anonymous authentication. See [HTTP linked service](#linked-service-properties) for different types of authentication you can use. --```json -{ - "name": "HttpLinkedService", - "properties": - { - "type": "Http", - "typeProperties": - { - "authenticationType": "Anonymous", - "url" : "https://en.wikipedia.org/wiki/" - } - } -} -``` --### Azure storage linked service --```json -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>" - } - } -} -``` --### HTTP input dataset --Setting **external** to **true** informs the Data Factory service that the dataset is external to the data factory and isn't produced by an activity in the data factory. --```json -{ - "name": "HttpSourceDataInput", - "properties": { - "type": "Http", - "linkedServiceName": "HttpLinkedService", - "typeProperties": { - "relativeUrl": "$$Text.Format('/my/report?month={0:yyyy}-{0:MM}&fmt=csv', SliceStart)", - "additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} --``` --### Azure blob output dataset --Data is written to a new blob every hour (**frequency**: **hour**, **interval**: **1**). --```json -{ - "name": "AzureBlobOutput", - "properties": - { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": - { - "folderPath": "adfgetstarted/Movies" - }, - "availability": - { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --### Pipeline that uses a copy activity --The pipeline contains a copy activity that is configured to use the input and output datasets. The copy activity is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **HttpSource** and the **sink** type is set to **BlobSink**. --For the list of properties that **HttpSource** supports, see [HttpSource](#copy-activity-properties). --```json -{ - "name":"SamplePipeline", - "properties":{ - "start":"2014-06-01T18:00:00", - "end":"2014-06-01T19:00:00", - "description":"pipeline with a copy activity", - "activities":[ - { - "name": "HttpSourceToAzureBlob", - "description": "Copy from an HTTP source to an Azure blob", - "type": "Copy", - "inputs": [ - { - "name": "HttpSourceDataInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "typeProperties": { - "source": { - "type": "HttpSource" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` --> [!NOTE] -> To map columns from a source dataset to columns from a sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Performance and tuning --To learn about key factors that affect the performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it, see the [Copy Activity performance and tuning guide](data-factory-copy-activity-performance.md). |
data-factory | Data Factory Introduction | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-introduction.md | - Title: Introduction to Data Factory, a data integration service -description: 'Learn what Azure Data Factory is: A cloud data integration service that orchestrates and automates movement and transformation of data.' ------ Previously updated : 04/12/2023---# Introduction to Azure Data Factory V1 -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-introduction.md) -> * [Version 2 (current version)](../introduction.md) --> [!NOTE] -> This article applies to version 1 of Azure Data Factory. If you are using the current version of the Data Factory service, see [Introduction to Data Factory V2](../introduction.md). ---## What is Azure Data Factory? -In the world of big data, how is existing data leveraged in business? Is it possible to enrich data that's generated in the cloud by using reference data from on-premises data sources or other disparate data sources? --For example, a gaming company collects logs that are produced by games in the cloud. It wants to analyze these logs to gain insights into customer preferences, demographics, usage behavior, and so on. The company also wants to identify up-sell and cross-sell opportunities, develop compelling new features to drive business growth, and provide a better experience to customers. --To analyze these logs, the company needs to use the reference data such as customer information, game information, and marketing campaign information that is in an on-premises data store. Therefore, the company wants to ingest log data from the cloud data store and reference data from the on-premises data store. --Next they want to process the data by using Hadoop in the cloud (Azure HDInsight). They want to publish the result data into a cloud data warehouse such as Azure Synapse Analytics or an on-premises data store such as SQL Server. The company wants this workflow to run once a week. --The company needs a platform where they can create a workflow that can ingest data from both on-premises and cloud data stores. The company also needs to be able to transform or process data by using existing compute services such as Hadoop, and publish the results to an on-premises or cloud data store for BI applications to consume. ---Azure Data Factory is the platform for these kinds of scenarios. It is a *cloud-based data integration service that allows you to create data-driven workflows in the cloud that orchestrate and automate data movement and data transformation*. Using Azure Data Factory, you can do the following tasks: --- Create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores.--- Process or transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.--- Publish output data to data stores such as Azure Synapse Analytics for business intelligence (BI) applications to consume. --It's more of an Extract-and-Load (EL) and Transform-and-Load (TL) platform rather than a traditional Extract-Transform-and-Load (ETL) platform. The transformations process data by using compute services rather than by adding derived columns, counting the number of rows, sorting data, and so on. --Currently, in Azure Data Factory, the data that workflows consume and produce is *time-sliced data* (hourly, daily, weekly, and so on). For example, a pipeline might read input data, process data, and produce output data once a day. You can also run a workflow just one time. - --## How does it work? -The pipelines (data-driven workflows) in Azure Data Factory typically perform the following three steps: ---### Connect and collect -Enterprises have data of various types that are located in disparate sources. The first step in building an information production system is to connect to all the required sources of data and processing. These sources include SaaS services, file shares, FTP, and web services. Then move the data as-needed to a centralized location for subsequent processing. --Without Data Factory, enterprises must build custom data movement components or write custom services to integrate these data sources and processing. It is expensive and hard to integrate and maintain such systems. These systems also often lack the enterprise grade monitoring, alerting, and controls that a fully managed service can offer. --With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and cloud source data stores to a centralization data store in the cloud for further analysis. --For example, you can collect data in Azure Data Lake Store and transform the data later by using an Azure Data Lake Analytics compute service. Or, collect data in Azure blob storage and transform it later by using an Azure HDInsight Hadoop cluster. --### Transform and enrich -After data is present in a centralized data store in the cloud, process or transfer it by using compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, or Machine Learning. You want to reliably produce transformed data on a maintainable and controlled schedule to feed production environments with trusted data. --### Publish -Deliver transformed data from the cloud to on-premises sources such as SQL Server. Alternatively, keep it in your cloud storage sources for consumption by BI and analytics tools and other applications. --## Key components -An Azure subscription can have one or more Azure Data Factory instances (or data factories). Azure Data Factory is composed of four key components. These components work together to provide the platform on which you can compose data-driven workflows with steps to move and transform data. --### Pipeline -A data factory can have one or more pipelines. A pipeline is a group of activities. Together, the activities in a pipeline perform a task. --For example, a pipeline can contain a group of activities that ingests data from an Azure blob, and then runs a Hive query on an HDInsight cluster to partition the data. The benefit of this is that the pipeline allows you to manage the activities as a set instead of each one individually. For example, you can deploy and schedule the pipeline, instead of scheduling independent activities. --### Activity -A pipeline can have one or more activities. Activities define the actions to perform on your data. For example, you can use a copy activity to copy data from one data store to another data store. Similarly, you can use a Hive activity. A Hive activity runs a Hive query on an Azure HDInsight cluster to transform or analyze your data. Data Factory supports two types of activities: data movement activities and data transformation activities. --### Data movement activities -Copy Activity in Data Factory copies data from a source data store to a sink data store. Data from any source can be written to any sink. Select a data store to learn how to copy data to and from that store. Data Factory supports the following data stores: ---For more information, see [Move data by using Copy Activity](data-factory-data-movement-activities.md). --### Data transformation activities --For more information, see [Move data by using Copy Activity](data-factory-data-transformation-activities.md). --### Custom .NET activities -Create a custom .NET activity if you need to move data to or from a data store that Copy Activity doesn't support or if you need to transform data by using your own logic. For details about how to create and use a custom activity, see [Use custom activities in an Azure Data Factory pipeline](data-factory-use-custom-activities.md). --### Datasets -An activity takes zero or more datasets as inputs and one or more datasets as outputs. Datasets represent data structures within the data stores. These structures point to or reference the data you want to use in your activities (such as inputs or outputs). --For example, an Azure blob dataset specifies the blob container and folder in the Azure blob storage from which the pipeline should read the data. Or an Azure SQL table dataset specifies the table to which the output data is written by the activity. --### Linked services -Linked services are much like connection strings, which define the connection information that's needed for Data Factory to connect to external resources. Think of it this way: a linked service defines the connection to the data source and a dataset represents the structure of the data. --For example, an Azure Storage-linked service specifies a connection string with which to connect to the Azure Storage account. An Azure blob dataset specifies the blob container and the folder that contains the data. --Linked services are used for two reasons in Data Factory: --* To represent a *data store* that includes, but isn't limited to, a SQL Server database, Oracle database, file share, or Azure blob storage account. See the [Data movement activities](#data-movement-activities) section for a list of supported data stores. --* To represent a *compute resource* that can host the execution of an activity. For example, the HDInsightHive activity runs on an HDInsight Hadoop cluster. See the [Data transformation activities](#data-transformation-activities) section for a list of supported compute environments. --### Relationship between Data Factory entities ---## Supported regions -Currently, you can create data factories in the West US, East US, and North Europe regions. However, a data factory can access data stores and compute services in other Azure regions to move data between data stores or process data by using compute services. --Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate the movement of data between [supported data stores](#data-movement-activities). It also lets you process data by using [compute services](#data-transformation-activities) in other regions or in an on-premises environment. It also allows you to [monitor and manage workflows](data-factory-monitor-manage-pipelines.md) by using both programmatic and UI mechanisms. --Data Factory is available in only West US, East US, and North Europe regions. However, the service that powers the data movement in Data Factory is available [globally](data-factory-data-movement-activities.md#global) in several regions. If a data store is behind a firewall, then a [Data Management Gateway](data-factory-move-data-between-onprem-and-cloud.md) that's installed in your on-premises environment moves the data instead. --For an example, let's assume that your compute environments such as Azure HDInsight cluster and Azure Machine Learning are located in the West Europe region. You can create and use an Azure Data Factory instance in North Europe. Then you can use it to schedule jobs on your compute environments in West Europe. It takes a few milliseconds for Data Factory to trigger the job on your compute environment, but the time for running the job on your computing environment does not change. --## Get started with creating a pipeline -You can use one of these tools or APIs to create data pipelines in Azure Data Factory: --- Visual Studio-- PowerShell-- .NET API-- REST API-- Azure Resource Manager template--To learn how to build data factories with data pipelines, follow the step-by-step instructions in the following tutorials: --| Tutorial | Description | -| | | -| [Move data between two cloud data stores](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) |Create a data factory with a pipeline that moves data from blob storage to SQL Database. | -| [Transform data by using Hadoop cluster](data-factory-build-your-first-pipeline.md) |Build your first Azure data factory with a data pipeline that processes data by running a Hive script on an Azure HDInsight (Hadoop) cluster. | -| [Move data between an on-premises data store and a cloud data store by using Data Management Gateway](data-factory-move-data-between-onprem-and-cloud.md) |Build a data factory with a pipeline that moves data from a SQL Server database to an Azure blob. As part of the walkthrough, you install and configure the Data Management Gateway on your machine. | |
data-factory | Data Factory Invoke Stored Procedure From Copy Activity | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-invoke-stored-procedure-from-copy-activity.md | - Title: Invoke stored procedure from Azure Data Factory Copy Activity -description: 'Learn how to invoke a stored procedure in Azure SQL Database, or SQL Server from an Azure Data Factory copy activity.' ---- Previously updated : 04/12/2023-----# Invoke stored procedure from copy activity in Azure Data Factory -> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [transform data using stored procedure activity in Data Factory](../transform-data-using-stored-procedure.md). ---When copying data into [SQL Server](data-factory-sqlserver-connector.md) or [Azure SQL Database](data-factory-azure-sql-connector.md), you can configure the **SqlSink** in copy activity to invoke a stored procedure. You may want to use the stored procedure to perform any additional processing (merging columns, looking up values, insertion into multiple tables, etc.) is required before inserting data in to the destination table. This feature takes advantage of [Table-Valued Parameters](/dotnet/framework/data/adonet/sql/table-valued-parameters). --The following sample shows how to invoke a stored procedure in a SQL Server database from a Data Factory pipeline (copy activity): --## Output dataset JSON -In the output dataset JSON, set the **type** to: **SqlServerTable**. Set it to **AzureSqlTable** to use with Azure SQL Database. The value for **tableName** property must match the name of first parameter of the stored procedure. --```json -{ - "name": "SqlOutput", - "properties": { - "type": "SqlServerTable", - "linkedServiceName": "SqlLinkedService", - "typeProperties": { - "tableName": "Marketing" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --## SqlSink section in copy activity JSON -Define the **SqlSink** section in the copy activity JSON as follows. To invoke a stored procedure while inserting data into the sink/destination database, specify values for both **SqlWriterStoredProcedureName** and **SqlWriterTableType** properties. For descriptions of these properties, see [SqlSink section in the SQL Server connector article](data-factory-sqlserver-connector.md#sqlsink). --```json -"sink": -{ - "type": "SqlSink", - "SqlWriterTableType": "MarketingType", - "SqlWriterStoredProcedureName": "spOverwriteMarketing", - "storedProcedureParameters": - { - "stringData": - { - "value": "str1" - } - } -} -``` --## Stored procedure definition -In your database, define the stored procedure with the same name as **SqlWriterStoredProcedureName**. The stored procedure handles input data from the source data store, and inserts data into a table in the destination database. The name of the first parameter of stored procedure must match the tableName defined in the dataset JSON (Marketing). --```sql -CREATE PROCEDURE spOverwriteMarketing @Marketing [dbo].[MarketingType] READONLY, @stringData varchar(256) -AS -BEGIN - DELETE FROM [dbo].[Marketing] where ProfileID = @stringData - INSERT [dbo].[Marketing](ProfileID, State) - SELECT * FROM @Marketing -END -``` --## Table type definition -In your database, define the table type with the same name as **SqlWriterTableType**. The schema of the table type must match the schema of the input dataset. --```sql -CREATE TYPE [dbo].[MarketingType] AS TABLE( - [ProfileID] [varchar](256) NOT NULL, - [State] [varchar](256) NOT NULL -) -``` --## Next steps -Review the following connector articles that for complete JSON examples: --- [Azure SQL Database](data-factory-azure-sql-connector.md)-- [SQL Server](data-factory-sqlserver-connector.md) |
data-factory | Data Factory Json Scripting Reference | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-json-scripting-reference.md | - Title: Azure Data Factory - JSON Scripting Reference -description: Provides JSON schemas for Data Factory entities. ------ Previously updated : 04/12/2023----# Azure Data Factory - JSON Scripting Reference -> [!NOTE] -> This article applies to version 1 of Data Factory. ---This article provides JSON schemas and examples for defining Azure Data Factory entities (pipeline, activity, dataset, and linked service). ---## Pipeline -The high-level structure for a pipeline definition is as follows: --```json -{ - "name": "SamplePipeline", - "properties": { - "description": "Describe what pipeline does", - "activities": [ - ], - "start": "2016-07-12T00:00:00", - "end": "2016-07-13T00:00:00" - } -} -``` --Following table describes the properties within the pipeline JSON definition: --| Property | Description | Required | -- | ---| name | Name of the pipeline. Specify a name that represents the action that the activity or pipeline is configured to do<br/><ul><li>Maximum number of characters: 260</li><li>Must start with a letter number, or an underscore (\_)</li><li>Following characters are not allowed: “.”, “+”, “?”, “/”, “<”,”>”,”*”,”%”,”&”,”:”,”\\”</li></ul> |Yes | -| description |Text describing what the activity or pipeline is used for | No | -| activities | Contains a list of activities. | Yes | -| start |Start date-time for the pipeline. Must be in [ISO format](https://en.wikipedia.org/wiki/ISO_8601). For example: 2014-10-14T16:32:41. <br/><br/>It is possible to specify a local time, for example an EST time. Here is an example: `2016-02-27T06:00:00**-05:00`, which is 6 AM EST.<br/><br/>The start and end properties together specify active period for the pipeline. Output slices are only produced with in this active period. |No<br/><br/>If you specify a value for the end property, you must specify value for the start property.<br/><br/>The start and end times can both be empty to create a pipeline. You must specify both values to set an active period for the pipeline to run. If you do not specify start and end times when creating a pipeline, you can set them using the Set-AzDataFactoryPipelineActivePeriod cmdlet later. | -| end |End date-time for the pipeline. If specified must be in ISO format. For example: 2014-10-14T17:32:41 <br/><br/>It is possible to specify a local time, for example an EST time. Here is an example: `2016-02-27T06:00:00**-05:00`, which is 6 AM EST.<br/><br/>To run the pipeline indefinitely, specify 9999-09-09 as the value for the end property. |No <br/><br/>If you specify a value for the start property, you must specify value for the end property.<br/><br/>See notes for the **start** property. | -| isPaused |If set to true the pipeline does not run. Default value = false. You can use this property to enable or disable. |No | -| pipelineMode |The method for scheduling runs for the pipeline. Allowed values are: scheduled (default), onetime.<br/><br/>‘Scheduled’ indicates that the pipeline runs at a specified time interval according to its active period (start and end time). ‘Onetime’ indicates that the pipeline runs only once. Onetime pipelines once created cannot be modified/updated currently. See [Onetime pipeline](data-factory-create-pipelines.md#onetime-pipeline) for details about onetime setting. |No | -| expirationTime |Duration of time after creation for which the pipeline is valid and should remain provisioned. If it does not have any active, failed, or pending runs, the pipeline is deleted automatically once it reaches the expiration time. |No | ---## Activity -The high-level structure for an activity within a pipeline definition (activities element) is as follows: --```json -{ - "name": "ActivityName", - "description": "description", - "type": "<ActivityType>", - "inputs": "[]", - "outputs": "[]", - "linkedServiceName": "MyLinkedService", - "typeProperties": - { -- }, - "policy": - { - }, - "scheduler": - { - } -} -``` --Following table describe the properties within the activity JSON definition: --| Tag | Description | Required | -| | | | -| name |Name of the activity. Specify a name that represents the action that the activity is configured to do<br/><ul><li>Maximum number of characters: 260</li><li>Must start with a letter number, or an underscore (\_)</li><li>Following characters are not allowed: “.”, “+”, “?”, “/”, “<”,”>”,”*”,”%”,”&”,”:”,”\\”</li></ul> |Yes | -| description |Text describing what the activity is used for. |No | -| type |Specifies the type of the activity. See the [DATA STORES](#data-stores) and [DATA TRANSFORMATION ACTIVITIES](#data-transformation-activities) sections for different types of activities. |Yes | -| inputs |Input tables used by the activity<br/><br/>`// one input table`<br/>`"inputs": [ { "name": "inputtable1" } ],`<br/><br/>`// two input tables` <br/>`"inputs": [ { "name": "inputtable1" }, { "name": "inputtable2" } ],` |No for HDInsightStreaming and SqlServerStoredProcedure activities <br/> <br/> Yes for all others | -| outputs |Output tables used by the activity.<br/><br/>`// one output table`<br/>`"outputs": [ { "name": “outputtable1” } ],`<br/><br/>`//two output tables`<br/>`"outputs": [ { "name": “outputtable1” }, { "name": “outputtable2” } ],` |Yes | -| linkedServiceName |Name of the linked service used by the activity. <br/><br/>An activity may require that you specify the linked service that links to the required compute environment. |Yes for HDInsight activities, ML Studio (classic) activities, and Stored Procedure Activity. <br/><br/>No for all others | -| typeProperties |Properties in the typeProperties section depend on type of the activity. |No | -| policy |Policies that affect the run-time behavior of the activity. If it is not specified, default policies are used. |No | -| scheduler |“scheduler” property is used to define desired scheduling for the activity. Its subproperties are the same as the ones in the [availability property in a dataset](data-factory-create-datasets.md#dataset-availability). |No | --### Policies -Policies affect the run-time behavior of an activity, specifically when the slice of a table is processed. The following table provides the details. --| Property | Permitted values | Default Value | Description | -| | | | | -| concurrency |Integer <br/><br/>Max value: 10 |1 |Number of concurrent executions of the activity.<br/><br/>It determines the number of parallel activity executions that can happen on different slices. For example, if an activity needs to go through a large set of available data, having a larger concurrency value speeds up the data processing. | -| executionPriorityOrder |NewestFirst<br/><br/>OldestFirst |OldestFirst |Determines the ordering of data slices that are being processed.<br/><br/>For example, if you have 2 slices (one happening at 4pm, and another one at 5pm), and both are pending execution. If you set the executionPriorityOrder to be NewestFirst, the slice at 5 PM is processed first. Similarly if you set the executionPriorityORder to be OldestFIrst, then the slice at 4 PM is processed. | -| retry |Integer<br/><br/>Max value can be 10 |0 |Number of retries before the data processing for the slice is marked as Failure. Activity execution for a data slice is retried up to the specified retry count. The retry is done as soon as possible after the failure. | -| timeout |TimeSpan |00:00:00 |Timeout for the activity. Example: 00:10:00 (implies timeout 10 mins)<br/><br/>If a value is not specified or is 0, the timeout is infinite.<br/><br/>If the data processing time on a slice exceeds the timeout value, it is canceled, and the system attempts to retry the processing. The number of retries depends on the retry property. When timeout occurs, the status is set to TimedOut. | -| delay |TimeSpan |00:00:00 |Specify the delay before data processing of the slice starts.<br/><br/>The execution of activity for a data slice is started after the Delay is past the expected execution time.<br/><br/>Example: 00:10:00 (implies delay of 10 mins) | -| longRetry |Integer<br/><br/>Max value: 10 |1 |The number of long retry attempts before the slice execution is failed.<br/><br/>longRetry attempts are spaced by longRetryInterval. So if you need to specify a time between retry attempts, use longRetry. If both Retry and longRetry are specified, each longRetry attempt includes Retry attempts and the max number of attempts is Retry * longRetry.<br/><br/>For example, if we have the following settings in the activity policy:<br/>Retry: 3<br/>longRetry: 2<br/>longRetryInterval: 01:00:00<br/><br/>Assume there is only one slice to execute (status is Waiting) and the activity execution fails every time. Initially there would be 3 consecutive execution attempts. After each attempt, the slice status would be Retry. After first 3 attempts are over, the slice status would be LongRetry.<br/><br/>After an hour (that is, longRetryInteval’s value), there would be another set of 3 consecutive execution attempts. After that, the slice status would be Failed and no more retries would be attempted. Hence overall 6 attempts were made.<br/><br/>If any execution succeeds, the slice status would be Ready and no more retries are attempted.<br/><br/>longRetry may be used in situations where dependent data arrives at non-deterministic times or the overall environment is flaky under which data processing occurs. In such cases, doing retries one after another may not help and doing so after an interval of time results in the desired output.<br/><br/>Word of caution: do not set high values for longRetry or longRetryInterval. Typically, higher values imply other systemic issues. | -| longRetryInterval |TimeSpan |00:00:00 |The delay between long retry attempts | --### typeProperties section -The typeProperties section is different for each activity. Transformation activities have just the type properties. See [DATA TRANSFORMATION ACTIVITIES](#data-transformation-activities) section in this article for JSON samples that define transformation activities in a pipeline. --**Copy activity** has two subsections in the typeProperties section: **source** and **sink**. See [DATA STORES](#data-stores) section in this article for JSON samples that show how to use a data store as a source and/or sink. --### Sample copy pipeline -In the following sample pipeline, there is one activity of type **Copy** in the **activities** section. In this sample, the [Copy activity](data-factory-data-movement-activities.md) copies data from an Azure Blob storage to Azure SQL Database. --```json -{ - "name": "CopyPipeline", - "properties": { - "description": "Copy data from a blob to Azure SQL table", - "activities": [ - { - "name": "CopyFromBlobToSQL", - "type": "Copy", - "inputs": [ - { - "name": "InputDataset" - } - ], - "outputs": [ - { - "name": "OutputDataset" - } - ], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "SqlSink", - "writeBatchSize": 10000, - "writeBatchTimeout": "60:00:00" - } - }, - "Policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ], - "start": "2016-07-12T00:00:00", - "end": "2016-07-13T00:00:00" - } -} -``` --Note the following points: --* In the activities section, there is only one activity whose **type** is set to **Copy**. -* Input for the activity is set to **InputDataset** and output for the activity is set to **OutputDataset**. -* In the **typeProperties** section, **BlobSource** is specified as the source type and **SqlSink** is specified as the sink type. --See [DATA STORES](#data-stores) section in this article for JSON samples that show how to use a data store as a source and/or sink. --For a complete walkthrough of creating this pipeline, see [Tutorial: Copy data from Blob Storage to SQL Database](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --### Sample transformation pipeline -In the following sample pipeline, there is one activity of type **HDInsightHive** in the **activities** section. In this sample, the [HDInsight Hive activity](data-factory-hive-activity.md) transforms data from an Azure Blob storage by running a Hive script file on an Azure HDInsight Hadoop cluster. --```json -{ - "name": "TransformPipeline", - "properties": { - "description": "My first Azure Data Factory pipeline", - "activities": [ - { - "type": "HDInsightHive", - "typeProperties": { - "scriptPath": "adfgetstarted/script/partitionweblogs.hql", - "scriptLinkedService": "AzureStorageLinkedService", - "defines": { - "inputtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata", - "partitionedtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata" - } - }, - "inputs": [ - { - "name": "AzureBlobInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "policy": { - "concurrency": 1, - "retry": 3 - }, - "scheduler": { - "frequency": "Month", - "interval": 1 - }, - "name": "RunSampleHiveActivity", - "linkedServiceName": "HDInsightOnDemandLinkedService" - } - ], - "start": "2016-04-01T00:00:00", - "end": "2016-04-02T00:00:00", - "isPaused": false - } -} -``` --Note the following points: --* In the activities section, there is only one activity whose **type** is set to **HDInsightHive**. -* The Hive script file, **partitionweblogs.hql**, is stored in the Azure storage account (specified by the scriptLinkedService, called **AzureStorageLinkedService**), and in **script** folder in the container **adfgetstarted**. -* The **defines** section is used to specify the runtime settings that are passed to the hive script as Hive configuration values (e.g `${hiveconf:inputtable}`, `${hiveconf:partitionedtable}`). --See [DATA TRANSFORMATION ACTIVITIES](#data-transformation-activities) section in this article for JSON samples that define transformation activities in a pipeline. --For a complete walkthrough of creating this pipeline, see [Tutorial: Build your first pipeline to process data using Hadoop cluster](data-factory-build-your-first-pipeline.md). --## Linked service -The high-level structure for a linked service definition is as follows: --```json -{ - "name": "<name of the linked service>", - "properties": { - "type": "<type of the linked service>", - "typeProperties": { - } - } -} -``` --Following table describe the properties within the activity JSON definition: --| Property | Description | Required | -| -- | -- | -- | -| name | Name of the linked service. | Yes | -| properties - type | Type of the linked service. For example: Azure Storage, Azure SQL Database. | -| typeProperties | The typeProperties section has elements that are different for each data store or compute environment. See data stores section for all the data store linked services and [compute environments](#compute-environments) for all the compute linked services | --## Dataset -A dataset in Azure Data Factory is defined as follows: --```json -{ - "name": "<name of dataset>", - "properties": { - "type": "<type of dataset: AzureBlob, AzureSql etc...>", - "external": <boolean flag to indicate external data. only for input datasets>, - "linkedServiceName": "<Name of the linked service that refers to a data store.>", - "structure": [ - { - "name": "<Name of the column>", - "type": "<Name of the type>" - } - ], - "typeProperties": { - "<type specific property>": "<value>", - "<type specific property 2>": "<value 2>", - }, - "availability": { - "frequency": "<Specifies the time unit for data slice production. Supported frequency: Minute, Hour, Day, Week, Month>", - "interval": "<Specifies the interval within the defined frequency. For example, frequency set to 'Hour' and interval set to 1 indicates that new data slices should be produced hourly>" - }, - "policy": - { - } - } -} -``` --The following table describes properties in the above JSON: --| Property | Description | Required | Default | -| | | | | -| name | Name of the dataset. See [Azure Data Factory - Naming rules](data-factory-naming-rules.md) for naming rules. |Yes |NA | -| type | Type of the dataset. Specify one of the types supported by Azure Data Factory (for example: AzureBlob, AzureSqlTable). See [DATA STORES](#data-stores) section for all the data stores and dataset types supported by Data Factory. | -| structure | Schema of the dataset. It contains columns, their types, etc. | No |NA | -| typeProperties | Properties corresponding to the selected type. See [DATA STORES](#data-stores) section for supported types and their properties. |Yes |NA | -| external | Boolean flag to specify whether a dataset is explicitly produced by a data factory pipeline or not. |No |false | -| availability | Defines the processing window or the slicing model for the dataset production. For details on the dataset slicing model, see [Scheduling and Execution](data-factory-scheduling-and-execution.md) article. |Yes |NA | -| policy |Defines the criteria or the condition that the dataset slices must fulfill. <br/><br/>For details, see Dataset Policy section. |No |NA | --Each column in the **structure** section contains the following properties: --| Property | Description | Required | -| | | | -| name |Name of the column. |Yes | -| type |Data type of the column. |No | -| culture |.NET based culture to be used when type is specified and is .NET type `Datetime` or `Datetimeoffset`. Default is `en-us`. |No | -| format |Format string to be used when type is specified and is .NET type `Datetime` or `Datetimeoffset`. |No | --In the following example, the dataset has three columns `slicetimestamp`, `projectname`, and `pageviews` and they are of type: String, String, and Decimal respectively. --```json -structure: -[ - { "name": "slicetimestamp", "type": "String"}, - { "name": "projectname", "type": "String"}, - { "name": "pageviews", "type": "Decimal"} -] -``` --The following table describes properties you can use in the **availability** section: --| Property | Description | Required | Default | -| | | | | -| frequency |Specifies the time unit for dataset slice production.<br/><br/><b>Supported frequency</b>: Minute, Hour, Day, Week, Month |Yes |NA | -| interval |Specifies a multiplier for frequency<br/><br/>”Frequency x interval” determines how often the slice is produced.<br/><br/>If you need the dataset to be sliced on an hourly basis, you set <b>Frequency</b> to <b>Hour</b>, and <b>interval</b> to <b>1</b>.<br/><br/><b>Note</b>: If you specify Frequency as Minute, we recommend that you set the interval to no less than 15 |Yes |NA | -| style |Specifies whether the slice should be produced at the start/end of the interval.<ul><li>StartOfInterval</li><li>EndOfInterval</li></ul><br/><br/>If Frequency is set to Month and style is set to EndOfInterval, the slice is produced on the last day of month. If the style is set to StartOfInterval, the slice is produced on the first day of month.<br/><br/>If Frequency is set to Day and style is set to EndOfInterval, the slice is produced in the last hour of the day.<br/><br/>If Frequency is set to Hour and style is set to EndOfInterval, the slice is produced at the end of the hour. For example, for a slice for 1 PM - 2 PM period, the slice is produced at 2 PM. |No |EndOfInterval | -| anchorDateTime |Defines the absolute position in time used by scheduler to compute dataset slice boundaries. <br/><br/><b>Note</b>: If the AnchorDateTime has date parts that are more granular than the frequency then the more granular parts are ignored. <br/><br/>For example, if the <b>interval</b> is <b>hourly</b> (frequency: hour and interval: 1) and the <b>AnchorDateTime</b> contains <b>minutes and seconds</b> then the <b>minutes and seconds</b> parts of the AnchorDateTime are ignored. |No |01/01/0001 | -| offset |Timespan by which the start and end of all dataset slices are shifted. <br/><br/><b>Note</b>: If both anchorDateTime and offset are specified, the result is the combined shift. |No |NA | --The following availability section specifies that the output dataset is either produced hourly (or) input dataset is available hourly: --```json -"availability": -{ - "frequency": "Hour", - "interval": 1 -} -``` --The **policy** section in dataset definition defines the criteria or the condition that the dataset slices must fulfill. --| Policy Name | Description | Applied To | Required | Default | -| | | | | | -| minimumSizeMB |Validates that the data in an **Azure blob** meets the minimum size requirements (in megabytes). |Azure Blob |No |NA | -| minimumRows |Validates that the data in **Azure SQL Database** or an **Azure table** contains the minimum number of rows. |<ul><li>Azure SQL Database</li><li>Azure Table</li></ul> |No |NA | --**Example:** --```json -"policy": --{ - "validation": - { - "minimumSizeMB": 10.0 - } -} -``` --Unless a dataset is being produced by Azure Data Factory, it should be marked as **external**. This setting generally applies to the inputs of first activity in a pipeline unless activity or pipeline chaining is being used. --| Name | Description | Required | Default Value | -| | | | | -| dataDelay |Time to delay the check on the availability of the external data for the given slice. For example, if the data is available hourly, the check to see the external data is available and the corresponding slice is Ready can be delayed by using dataDelay.<br/><br/>Only applies to the present time. For example, if it is 1:00 PM right now and this value is 10 minutes, the validation starts at 1:10 PM.<br/><br/>This setting does not affect slices in the past (slices with Slice End Time + dataDelay < Now) are processed without any delay.<br/><br/>Time greater than 23:59 hours need to specified using the `day.hours:minutes:seconds` format. For example, to specify 24 hours, don't use 24:00:00; instead, use 1.00:00:00. If you use 24:00:00, it is treated as 24 days (24.00:00:00). For 1 day and 4 hours, specify 1:04:00:00. |No |0 | -| retryInterval |The wait time between a failure and the next retry attempt. If a try fails, the next try is after retryInterval. <br/><br/>If it is 1:00 PM right now, we begin the first try. If the duration to complete the first validation check is 1 minute and the operation failed, the next retry is at 1:00 + 1 min (duration) + 1 min (retry interval) = 1:02 PM. <br/><br/>For slices in the past, there is no delay. The retry happens immediately. |No |00:01:00 (1 minute) | -| retryTimeout |The timeout for each retry attempt.<br/><br/>If this property is set to 10 minutes, the validation needs to be completed within 10 minutes. If it takes longer than 10 minutes to perform the validation, the retry times out.<br/><br/>If all attempts for the validation times out, the slice is marked as TimedOut. |No |00:10:00 (10 minutes) | -| maximumRetry |Number of times to check for the availability of the external data. The allowed maximum value is 10. |No |3 | ---## DATA STORES -The [Linked service](#linked-service) section provided descriptions for JSON elements that are common to all types of linked services. This section provides details about JSON elements that are specific to each data store. --The [Dataset](#dataset) section provided descriptions for JSON elements that are common to all types of datasets. This section provides details about JSON elements that are specific to each data store. --The [Activity](#activity) section provided descriptions for JSON elements that are common to all types of activities. This section provides details about JSON elements that are specific to each data store when it is used as a source/sink in a copy activity. --Click the link for the store you are interested in to see the JSON schemas for linked service, dataset, and the source/sink for the copy activity. --| Category | Data store -|: |: | -| **Azure** |[Azure Blob storage](#azure-blob-storage) | -| |Azure Data Lake Store | -| |[Azure Cosmos DB](#azure-cosmos-db) | -| |[Azure SQL Database](#azure-sql-database) | -| |[Azure Synapse Analytics](#azure-synapse-analytics) | -| |[Azure Cognitive Search](#azure-cognitive-search) | -| |[Azure Table storage](#azure-table-storage) | -| **Databases** |[Amazon Redshift](#amazon-redshift) | -| |[IBM DB2](#ibm-db2) | -| |[MySQL](#mysql) | -| |[Oracle](#oracle) | -| |[PostgreSQL](#postgresql) | -| |[SAP Business Warehouse](#sap-business-warehouse) | -| |[SAP HANA](#sap-hana) | -| |[SQL Server](#sql-server) | -| |[Sybase](#sybase) | -| |[Teradata](#teradata) | -| **NoSQL** |[Cassandra](#cassandra) | -| |[MongoDB](#mongodb) | -| **File** |[Amazon S3](#amazon-s3) | -| |[File System](#file-system) | -| |[FTP](#ftp) | -| |[HDFS](#hdfs) | -| |[SFTP](#sftp) | -| **Others** |[HTTP](#http) | -| |[OData](#odata) | -| |[ODBC](#odbc) | -| |[Salesforce](#salesforce) | -| |Web Table | --## Azure Blob Storage --### Linked service -There are two types of linked --#### Azure Storage Linked Service -To link your Azure storage account to a data factory by using the **account key**, create an Azure Storage linked service. To define an Azure Storage linked service, set the **type** of the linked service to **AzureStorage**. Then, you can specify following properties in the **typeProperties** section: --| Property | Description | Required | -|: |: |: | -| connectionString |Specify information needed to connect to Azure storage for the connectionString property. |Yes | --##### Example --```json -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --#### Azure Storage SAS Linked Service -The Azure Storage SAS linked service allows you to link an Azure Storage Account to an Azure data factory by using a Shared Access Signature (SAS). It provides the data factory with restricted/time-bound access to all/specific resources (blob/container) in the storage. To link your Azure storage account to a data factory by using Shared Access Signature, create an Azure Storage SAS linked service. To define an Azure Storage SAS linked service, set the **type** of the linked service to **AzureStorageSas**. Then, you can specify following properties in the **typeProperties** section: --| Property | Description | Required | -|: |: |: | -| sasUri |Specify Shared Access Signature URI to the Azure Storage resources such as blob, container, or table. |Yes | --##### Example --```json -{ - "name": "StorageSasLinkedService", - "properties": { - "type": "AzureStorageSas", - "typeProperties": { - "sasUri": "<storageUri>?<sasToken>" - } - } -} -``` --For more information about these linked services, see [Azure Blob Storage connector](data-factory-azure-blob-connector.md#linked-service-properties) article. --### Dataset -To define an Azure Blob dataset, set the **type** of the dataset to **AzureBlob**. Then, specify the following Azure Blob specific properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| folderPath |Path to the container and folder in the blob storage. Example: myblobcontainer\myblobfolder\ |Yes | -| fileName |Name of the blob. fileName is optional and case-sensitive.<br/><br/>If you specify a filename, the activity (including Copy) works on the specific Blob.<br/><br/>When fileName is not specified, Copy includes all Blobs in the folderPath for input dataset.<br/><br/>When fileName is not specified for an output dataset, the name of the generated file would be in the following this format: `Data.<Guid>.txt` (for example: : Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt |No | -| partitionedBy |partitionedBy is an optional property. You can use it to specify a dynamic folderPath and filename for time series data. For example, folderPath can be parameterized for every hour of data. |No | -| format | The following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see [Text Format](data-factory-supported-file-and-compression-formats.md#text-format), [Json Format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro Format](data-factory-supported-file-and-compression-formats.md#avro-format), [Orc Format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet Format](data-factory-supported-file-and-compression-formats.md#parquet-format) sections. <br><br> If you want to **copy files as-is** between file-based stores (binary copy), skip the format section in both input and output dataset definitions. |No | -| compression | Specify the type and level of compression for the data. Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**. Supported levels are: **Optimal** and **Fastest**. For more information, see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). |No | --#### Example --```json -{ - "name": "AzureBlobInput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "fileName": "input.log", - "folderPath": "adfgetstarted/inputdata", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Month", - "interval": 1 - }, - "external": true, - "policy": {} - } -} - ``` ---For more information, see [Azure Blob connector](data-factory-azure-blob-connector.md#dataset-properties) article. --### BlobSource in Copy Activity -If you are copying data from an Azure Blob Storage, set the **source type** of the copy activity to **BlobSource**, and specify following properties in the **source** section: --| Property | Description | Allowed values | Required | -| | | | | -| recursive |Indicates whether the data is read recursively from the sub folders or only from the specified folder. |True (default value), False |No | --#### Example: **BlobSource** -```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline with copy activity", - "activities": [{ - "name": "AzureBlobtoSQL", - "description": "Copy Activity", - "type": "Copy", - "inputs": [{ - "name": "AzureBlobInput" - }], - "outputs": [{ - "name": "AzureSqlOutput" - }], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "SqlSink" - } - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` -### BlobSink in Copy Activity -If you are copying data to an Azure Blob Storage, set the **sink type** of the copy activity to **BlobSink**, and specify following properties in the **sink** section: --| Property | Description | Allowed values | Required | -| | | | | -| copyBehavior |Defines the copy behavior when the source is BlobSource or FileSystem. |<b>PreserveHierarchy</b>: preserves the file hierarchy in the target folder. The relative path of source file to source folder is identical to the relative path of target file to target folder.<br/><br/><b>FlattenHierarchy</b>: all files from the source folder are in the first level of target folder. The target files have auto generated name. <br/><br/><b>MergeFiles (default):</b> merges all files from the source folder to one file. If the File/Blob Name is specified, the merged file name would be the specified name; otherwise, would be auto-generated file name. |No | --#### Example: BlobSink --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline for copy activity", - "activities": [{ - "name": "AzureSQLtoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [{ - "name": "AzureSQLInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "typeProperties": { - "source": { - "type": "SqlSource", - "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink" - } - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` --For more information, see [Azure Blob connector](data-factory-azure-blob-connector.md#copy-activity-properties) article. --## Azure Data Lake Store --### Linked service -To define an Azure Data Lake Store linked service, set the type of the linked service to **AzureDataLakeStore**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -|: |: |: | -| type | The type property must be set to: **AzureDataLakeStore** | Yes | -| dataLakeStoreUri | Specify information about the Azure Data Lake Store account. It is in the following format: `https://[accountname].azuredatalakestore.net/webhdfs/v1` or `adl://[accountname].azuredatalakestore.net/`. | Yes | -| subscriptionId | Azure subscription ID to which Data Lake Store belongs. | Required for sink | -| resourceGroupName | Azure resource group name to which Data Lake Store belongs. | Required for sink | -| servicePrincipalId | Specify the application's client ID. | Yes (for service principal authentication) | -| servicePrincipalKey | Specify the application's key. | Yes (for service principal authentication) | -| tenant | Specify the tenant information (domain name or tenant ID) under which your application resides. You can retrieve it by hovering the mouse in the top-right corner of the Azure portal. | Yes (for service principal authentication) | -| authorization | Click **Authorize** button in the **Data Factory Editor** and enter your credential that assigns the auto-generated authorization URL to this property. | Yes (for user credential authentication)| -| sessionId | OAuth session ID from the OAuth authorization session. Each session ID is unique and may only be used once. This setting is automatically generated when you use Data Factory Editor. | Yes (for user credential authentication) | --#### Example: using service principal authentication -```json -{ - "name": "AzureDataLakeStoreLinkedService", - "properties": { - "type": "AzureDataLakeStore", - "typeProperties": { - "dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1", - "servicePrincipalId": "<service principal id>", - "servicePrincipalKey": "<service principal key>", - "tenant": "<tenant info. Example: microsoft.onmicrosoft.com>" - } - } -} -``` --#### Example: using user credential authentication -```json -{ - "name": "AzureDataLakeStoreLinkedService", - "properties": { - "type": "AzureDataLakeStore", - "typeProperties": { - "dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1", - "sessionId": "<session ID>", - "authorization": "<authorization URL>", - "subscriptionId": "<subscription of ADLS>", - "resourceGroupName": "<resource group of ADLS>" - } - } -} -``` --For more information, see [Azure Data Lake Store connector](data-factory-azure-datalake-connector.md#linked-service-properties) article. --### Dataset -To define an Azure Data Lake Store dataset, set the **type** of the dataset to **AzureDataLakeStore**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -|: |: |: | -| folderPath |Path to the container and folder in the Azure Data Lake store. |Yes | -| fileName |Name of the file in the Azure Data Lake store. fileName is optional and case-sensitive. <br/><br/>If you specify a filename, the activity (including Copy) works on the specific file.<br/><br/>When fileName is not specified, Copy includes all files in the folderPath for input dataset.<br/><br/>When fileName is not specified for an output dataset, the name of the generated file would be in the following this format: `Data.<Guid>.txt` (for example: : Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt |No | -| partitionedBy |partitionedBy is an optional property. You can use it to specify a dynamic folderPath and filename for time series data. For example, folderPath can be parameterized for every hour of data. |No | -| format | The following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see [Text Format](data-factory-supported-file-and-compression-formats.md#text-format), [Json Format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro Format](data-factory-supported-file-and-compression-formats.md#avro-format), [Orc Format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet Format](data-factory-supported-file-and-compression-formats.md#parquet-format) sections. <br><br> If you want to **copy files as-is** between file-based stores (binary copy), skip the format section in both input and output dataset definitions. |No | -| compression | Specify the type and level of compression for the data. Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**. Supported levels are: **Optimal** and **Fastest**. For more information, see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). |No | --#### Example -```json -{ - "name": "AzureDataLakeStoreInput", - "properties": { - "type": "AzureDataLakeStore", - "linkedServiceName": "AzureDataLakeStoreLinkedService", - "typeProperties": { - "folderPath": "datalake/input/", - "fileName": "SearchLog.tsv", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - } - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --For more information, see [Azure Data Lake Store connector](data-factory-azure-datalake-connector.md#dataset-properties) article. --### Azure Data Lake Store Source in Copy Activity -If you are copying data from an Azure Data Lake Store, set the **source type** of the copy activity to **AzureDataLakeStoreSource**, and specify following properties in the **source** section: --**AzureDataLakeStoreSource** supports the following properties **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| recursive |Indicates whether the data is read recursively from the sub folders or only from the specified folder. |True (default value), False |No | --#### Example: AzureDataLakeStoreSource --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline for copy activity", - "activities": [{ - "name": "AzureDakeLaketoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [{ - "name": "AzureDataLakeStoreInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "typeProperties": { - "source": { - "type": "AzureDataLakeStoreSource" - }, - "sink": { - "type": "BlobSink" - } - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` --For more information, see [Azure Data Lake Store connector](data-factory-azure-datalake-connector.md#copy-activity-properties) article. --### Azure Data Lake Store Sink in Copy Activity -If you are copying data to an Azure Data Lake Store, set the **sink type** of the copy activity to **AzureDataLakeStoreSink**, and specify following properties in the **sink** section: --| Property | Description | Allowed values | Required | -| | | | | -| copyBehavior |Specifies the copy behavior. |<b>PreserveHierarchy</b>: preserves the file hierarchy in the target folder. The relative path of source file to source folder is identical to the relative path of target file to target folder.<br/><br/><b>FlattenHierarchy</b>: all files from the source folder are created in the first level of target folder. The target files are created with auto generated name.<br/><br/><b>MergeFiles</b>: merges all files from the source folder to one file. If the File/Blob Name is specified, the merged file name would be the specified name; otherwise, would be auto-generated file name. |No | --#### Example: AzureDataLakeStoreSink -```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline with copy activity", - "activities": [{ - "name": "AzureBlobtoDataLake", - "description": "Copy Activity", - "type": "Copy", - "inputs": [{ - "name": "AzureBlobInput" - }], - "outputs": [{ - "name": "AzureDataLakeStoreOutput" - }], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "AzureDataLakeStoreSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` --For more information, see [Azure Data Lake Store connector](data-factory-azure-datalake-connector.md#copy-activity-properties) article. --## Azure Cosmos DB --### Linked service -To define an Azure Cosmos DB linked service, set the **type** of the linked service to **DocumentDb**, and specify following properties in the **typeProperties** section: --| **Property** | **Description** | **Required** | -| | | | -| connectionString |Specify information needed to connect to Azure Cosmos DB database. |Yes | --#### Example --```json -{ - "name": "CosmosDBLinkedService", - "properties": { - "type": "DocumentDb", - "typeProperties": { - "connectionString": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>" - } - } -} -``` -For more information, see [Azure Cosmos DB connector](data-factory-azure-documentdb-connector.md#linked-service-properties) article. --### Dataset -To define an Azure Cosmos DB dataset, set the **type** of the dataset to **DocumentDbCollection**, and specify the following properties in the **typeProperties** section: --| **Property** | **Description** | **Required** | -| | | | -| collectionName |Name of the Azure Cosmos DB collection. |Yes | --#### Example --```json -{ - "name": "PersonCosmosDBTable", - "properties": { - "type": "DocumentDbCollection", - "linkedServiceName": "CosmosDBLinkedService", - "typeProperties": { - "collectionName": "Person" - }, - "external": true, - "availability": { - "frequency": "Day", - "interval": 1 - } - } -} -``` -For more information, see [Azure Cosmos DB connector](data-factory-azure-documentdb-connector.md#dataset-properties) article. --### Azure Cosmos DB Collection Source in Copy Activity -If you are copying data from an Azure Cosmos DB instance, set the **source type** of the copy activity to **DocumentDbCollectionSource**, and specify following properties in the **source** section: ---| **Property** | **Description** | **Allowed values** | **Required** | -| | | | | -| query |Specify the query to read data. |Query string supported by Azure Cosmos DB. <br/><br/>Example: `SELECT c.BusinessEntityID, c.PersonType, c.NameStyle, c.Title, c.Name.First AS FirstName, c.Name.Last AS LastName, c.Suffix, c.EmailPromotion FROM c WHERE c.ModifiedDate > \"2009-01-01T00:00:00\"` |No <br/><br/>If not specified, the SQL statement that is executed: `select <columns defined in structure> from mycollection` | -| nestingSeparator |Special character to indicate that the document is nested |Any character. <br/><br/>Azure Cosmos DB is a NoSQL store for JSON documents, where nested structures are allowed. Azure Data Factory enables user to denote hierarchy via nestingSeparator, which is “.” in the above examples. With the separator, the copy activity will generate the “Name” object with three children elements First, Middle and Last, according to “Name.First”, “Name.Middle” and “Name.Last” in the table definition. |No | --#### Example --```json -{ - "name": "DocDbToBlobPipeline", - "properties": { - "activities": [{ - "type": "Copy", - "typeProperties": { - "source": { - "type": "DocumentDbCollectionSource", - "query": "SELECT Person.Id, Person.Name.First AS FirstName, Person.Name.Middle as MiddleName, Person.Name.Last AS LastName FROM Person", - "nestingSeparator": "." - }, - "sink": { - "type": "BlobSink", - "blobWriterAddHeader": true, - "writeBatchSize": 1000, - "writeBatchTimeout": "00:00:59" - } - }, - "inputs": [{ - "name": "PersonCosmosDBTable" - }], - "outputs": [{ - "name": "PersonBlobTableOut" - }], - "policy": { - "concurrency": 1 - }, - "name": "CopyFromCosmosDbToBlob" - }], - "start": "2016-04-01T00:00:00", - "end": "2016-04-02T00:00:00" - } -} -``` --### Azure Cosmos DB Collection Sink in Copy Activity -If you are copying data to Azure Cosmos DB, set the **sink type** of the copy activity to **DocumentDbCollectionSink**, and specify following properties in the **sink** section: --| **Property** | **Description** | **Allowed values** | **Required** | -| | | | | -| nestingSeparator |A special character in the source column name to indicate that nested document is needed. <br/><br/>For example above: `Name.First` in the output table produces the following JSON structure in the Azure Cosmos DB document:<br/><br/>"Name": {<br/> "First": "John"<br/>}, |Character that is used to separate nesting levels.<br/><br/>Default value is `.` (dot). |Character that is used to separate nesting levels. <br/><br/>Default value is `.` (dot). | -| writeBatchSize |Number of parallel requests to Azure Cosmos DB service to create documents.<br/><br/>You can fine-tune the performance when copying data to/from Azure Cosmos DB by using this property. You can expect a better performance when you increase writeBatchSize because more parallel requests to Azure Cosmos DB are sent. However you’ll need to avoid throttling that can throw the error message: "Request rate is large".<br/><br/>Throttling is decided by a number of factors, including size of documents, number of terms in documents, indexing policy of target collection, etc. For copy operations, you can use a better collection (for example, S3) to have the most throughput available (2,500 request units/second). |Integer |No (default: 5) | -| writeBatchTimeout |Wait time for the operation to complete before it times out. |timespan<br/><br/> Example: “00:30:00” (30 minutes). |No | --#### Example --```json -{ - "name": "BlobToDocDbPipeline", - "properties": { - "activities": [{ - "type": "Copy", - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "DocumentDbCollectionSink", - "nestingSeparator": ".", - "writeBatchSize": 2, - "writeBatchTimeout": "00:00:00" - }, - "translator": { - "type": "TabularTranslator", - "ColumnMappings": "FirstName: Name.First, MiddleName: Name.Middle, LastName: Name.Last, BusinessEntityID: BusinessEntityID, PersonType: PersonType, NameStyle: NameStyle, Title: Title, Suffix: Suffix" - } - }, - "inputs": [{ - "name": "PersonBlobTableIn" - }], - "outputs": [{ - "name": "PersonCosmosDbTableOut" - }], - "policy": { - "concurrency": 1 - }, - "name": "CopyFromBlobToCosmosDb" - }], - "start": "2016-04-14T00:00:00", - "end": "2016-04-15T00:00:00" - } -} -``` --For more information, see [Azure Cosmos DB connector](data-factory-azure-documentdb-connector.md#copy-activity-properties) article. --## Azure SQL Database --### Linked service -To define an Azure SQL Database linked service, set the **type** of the linked service to **AzureSqlDatabase**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| connectionString |Specify information needed to connect to the Azure SQL Database instance for the connectionString property. |Yes | --#### Example -```json -{ - "name": "AzureSqlLinkedService", - "properties": { - "type": "AzureSqlDatabase", - "typeProperties": { - "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" - } - } -} -``` --For more information, see [Azure SQL connector](data-factory-azure-sql-connector.md#linked-service-properties) article. --### Dataset -To define an Azure SQL Database dataset, set the **type** of the dataset to **AzureSqlTable**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| tableName |Name of the table or view in the Azure SQL Database instance that linked service refers to. |Yes | --#### Example --```json -{ - "name": "AzureSqlInput", - "properties": { - "type": "AzureSqlTable", - "linkedServiceName": "AzureSqlLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` -For more information, see [Azure SQL connector](data-factory-azure-sql-connector.md#dataset-properties) article. --### SQL Source in Copy Activity -If you are copying data from an Azure SQL Database, set the **source type** of the copy activity to **SqlSource**, and specify following properties in the **source** section: ---| Property | Description | Allowed values | Required | -| | | | | -| sqlReaderQuery |Use the custom query to read data. |SQL query string. Example: `select * from MyTable`. |No | -| sqlReaderStoredProcedureName |Name of the stored procedure that reads data from the source table. |Name of the stored procedure. |No | -| storedProcedureParameters |Parameters for the stored procedure. |Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. |No | --#### Example --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline for copy activity", - "activities": [{ - "name": "AzureSQLtoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [{ - "name": "AzureSQLInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "typeProperties": { - "source": { - "type": "SqlSource", - "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` -For more information, see [Azure SQL connector](data-factory-azure-sql-connector.md#copy-activity-properties) article. --### SQL Sink in Copy Activity -If you are copying data to Azure SQL Database, set the **sink type** of the copy activity to **SqlSink**, and specify following properties in the **sink** section: --| Property | Description | Allowed values | Required | -| | | | | -| writeBatchTimeout |Wait time for the batch insert operation to complete before it times out. |timespan<br/><br/> Example: “00:30:00” (30 minutes). |No | -| writeBatchSize |Inserts data into the SQL table when the buffer size reaches writeBatchSize. |Integer (number of rows) |No (default: 10000) | -| sqlWriterCleanupScript |Specify a query for Copy Activity to execute such that data of a specific slice is cleaned up. |A query statement. |No | -| sliceIdentifierColumnName |Specify a column name for Copy Activity to fill with auto generated slice identifier, which is used to clean up data of a specific slice when rerun. |Column name of a column with data type of binary(32). |No | -| sqlWriterStoredProcedureName |Name of the stored procedure that upserts (updates/inserts) data into the target table. |Name of the stored procedure. |No | -| storedProcedureParameters |Parameters for the stored procedure. |Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. |No | -| sqlWriterTableType |Specify a table type name to be used in the stored procedure. Copy activity makes the data being moved available in a temp table with this table type. Stored procedure code can then merge the data being copied with existing data. |A table type name. |No | --#### Example --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline with copy activity", - "activities": [{ - "name": "AzureBlobtoSQL", - "description": "Copy Activity", - "type": "Copy", - "inputs": [{ - "name": "AzureBlobInput" - }], - "outputs": [{ - "name": "AzureSqlOutput" - }], - "typeProperties": { - "source": { - "type": "BlobSource", - "blobColumnSeparators": "," - }, - "sink": { - "type": "SqlSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` --For more information, see [Azure SQL connector](data-factory-azure-sql-connector.md#copy-activity-properties) article. --## Azure Synapse Analytics --### Linked service -To define an Azure Synapse Analytics linked service, set the **type** of the linked service to **AzureSqlDW**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| connectionString |Specify information needed to connect to the Azure Synapse Analytics instance for the connectionString property. |Yes | ----#### Example --```json -{ - "name": "AzureSqlDWLinkedService", - "properties": { - "type": "AzureSqlDW", - "typeProperties": { - "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" - } - } -} -``` --For more information, see [Azure Synapse Analytics connector](data-factory-azure-sql-data-warehouse-connector.md#linked-service-properties) article. --### Dataset -To define an Azure Synapse Analytics dataset, set the **type** of the dataset to **AzureSqlDWTable**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| tableName |Name of the table or view in the Azure Synapse Analytics database that the linked service refers to. |Yes | --#### Example --```json -{ - "name": "AzureSqlDWInput", - "properties": { - "type": "AzureSqlDWTable", - "linkedServiceName": "AzureSqlDWLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --For more information, see [Azure Synapse Analytics connector](data-factory-azure-sql-data-warehouse-connector.md#dataset-properties) article. --### Azure Synapse Analytics Source in Copy Activity -If you are copying data from Azure Synapse Analytics, set the **source type** of the copy activity to **SqlDWSource**, and specify following properties in the **source** section: ---| Property | Description | Allowed values | Required | -| | | | | -| sqlReaderQuery |Use the custom query to read data. |SQL query string. For example: `select * from MyTable`. |No | -| sqlReaderStoredProcedureName |Name of the stored procedure that reads data from the source table. |Name of the stored procedure. |No | -| storedProcedureParameters |Parameters for the stored procedure. |Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. |No | --#### Example --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline for copy activity", - "activities": [{ - "name": "AzureSQLDWtoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [{ - "name": "AzureSqlDWInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "typeProperties": { - "source": { - "type": "SqlDWSource", - "sqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` --For more information, see [Azure Synapse Analytics connector](data-factory-azure-sql-data-warehouse-connector.md#copy-activity-properties) article. --### Azure Synapse Analytics Sink in Copy Activity -If you are copying data to Azure Synapse Analytics, set the **sink type** of the copy activity to **SqlDWSink**, and specify following properties in the **sink** section: --| Property | Description | Allowed values | Required | -| | | | | -| sqlWriterCleanupScript |Specify a query for Copy Activity to execute such that data of a specific slice is cleaned up. |A query statement. |No | -| allowPolyBase |Indicates whether to use PolyBase (when applicable) instead of BULKINSERT mechanism. <br/><br/> **Using PolyBase is the recommended way to load data into Azure Synapse Analytics.** |True <br/>False (default) |No | -| polyBaseSettings |A group of properties that can be specified when the **allowPolybase** property is set to **true**. | |No | -| rejectValue |Specifies the number or percentage of rows that can be rejected before the query fails. <br/><br/>Learn more about the PolyBase’s reject options in the **Arguments** section of [CREATE EXTERNAL TABLE (Transact-SQL)](/sql/t-sql/statements/create-external-table-transact-sql) topic. |0 (default), 1, 2, … |No | -| rejectType |Specifies whether the rejectValue option is specified as a literal value or a percentage. |Value (default), Percentage |No | -| rejectSampleValue |Determines the number of rows to retrieve before the PolyBase recalculates the percentage of rejected rows. |1, 2, … |Yes, if **rejectType** is **percentage** | -| useTypeDefault |Specifies how to handle missing values in delimited text files when PolyBase retrieves data from the text file.<br/><br/>Learn more about this property from the Arguments section in [CREATE EXTERNAL FILE FORMAT (Transact-SQL)](/sql/t-sql/statements/create-external-file-format-transact-sql). |True, False (default) |No | -| writeBatchSize |Inserts data into the SQL table when the buffer size reaches writeBatchSize |Integer (number of rows) |No (default: 10000) | -| writeBatchTimeout |Wait time for the batch insert operation to complete before it times out. |timespan<br/><br/> Example: “00:30:00” (30 minutes). |No | --#### Example --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline with copy activity", - "activities": [{ - "name": "AzureBlobtoSQLDW", - "description": "Copy Activity", - "type": "Copy", - "inputs": [{ - "name": "AzureBlobInput" - }], - "outputs": [{ - "name": "AzureSqlDWOutput" - }], - "typeProperties": { - "source": { - "type": "BlobSource", - "blobColumnSeparators": "," - }, - "sink": { - "type": "SqlDWSink", - "allowPolyBase": true - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` --For more information, see [Azure Synapse Analytics connector](data-factory-azure-sql-data-warehouse-connector.md#copy-activity-properties) article. --## Azure Cognitive Search --### Linked service -To define an Azure Cognitive Search linked service, set the **type** of the linked service to **AzureSearch**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| -- | -- | -- | -| url | URL for the search service. | Yes | -| key | Admin key for the search service. | Yes | --#### Example --```json -{ - "name": "AzureSearchLinkedService", - "properties": { - "type": "AzureSearch", - "typeProperties": { - "url": "https://<service>.search.windows.net", - "key": "<AdminKey>" - } - } -} -``` --For more information, see [Azure Cognitive Search connector](data-factory-azure-search-connector.md#linked-service-properties) article. --### Dataset -To define an Azure Cognitive Search dataset, set the **type** of the dataset to **AzureSearchIndex**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| -- | -- | -- | -| type | The type property must be set to **AzureSearchIndex**.| Yes | -| indexName | Name of the search index. Data Factory does not create the index. The index must exist in Azure Cognitive Search. | Yes | --#### Example --```json -{ - "name": "AzureSearchIndexDataset", - "properties": { - "type": "AzureSearchIndex", - "linkedServiceName": "AzureSearchLinkedService", - "typeProperties": { - "indexName": "products" - }, - "availability": { - "frequency": "Minute", - "interval": 15 - } - } -} -``` --For more information, see [Azure Cognitive Search connector](data-factory-azure-search-connector.md#dataset-properties) article. --### Azure Cognitive Search Index Sink in Copy Activity -If you are copying data to a search index, set the **sink type** of the copy activity to **AzureSearchIndexSink**, and specify following properties in the **sink** section: --| Property | Description | Allowed values | Required | -| -- | -- | -- | -- | -| WriteBehavior | Specifies whether to merge or replace when a document already exists in the index. | Merge (default)<br/>Upload| No | -| WriteBatchSize | Uploads data into the search index when the buffer size reaches writeBatchSize. | 1 to 1,000. Default value is 1000. | No | --#### Example --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline for copy activity", - "activities": [{ - "name": "SqlServertoAzureSearchIndex", - "description": "copy activity", - "type": "Copy", - "inputs": [{ - "name": " SqlServerInput" - }], - "outputs": [{ - "name": "AzureSearchIndexDataset" - }], - "typeProperties": { - "source": { - "type": "SqlSource", - "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "AzureSearchIndexSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` --For more information, see [Azure Cognitive Search connector](data-factory-azure-search-connector.md#copy-activity-properties) article. --## Azure Table Storage --### Linked service -There are two types of linked --#### Azure Storage Linked Service -To link your Azure storage account to a data factory by using the **account key**, create an Azure Storage linked service. To define an Azure Storage linked service, set the **type** of the linked service to **AzureStorage**. Then, you can specify following properties in the **typeProperties** section: --| Property | Description | Required | -|: |: |: | -| type |The type property must be set to: **AzureStorage** |Yes | -| connectionString |Specify information needed to connect to Azure storage for the connectionString property. |Yes | --**Example:** --```json -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --#### Azure Storage SAS Linked Service -The Azure Storage SAS linked service allows you to link an Azure Storage Account to an Azure data factory by using a Shared Access Signature (SAS). It provides the data factory with restricted/time-bound access to all/specific resources (blob/container) in the storage. To link your Azure storage account to a data factory by using Shared Access Signature, create an Azure Storage SAS linked service. To define an Azure Storage SAS linked service, set the **type** of the linked service to **AzureStorageSas**. Then, you can specify following properties in the **typeProperties** section: --| Property | Description | Required | -|: |: |: | -| type |The type property must be set to: **AzureStorageSas** |Yes | -| sasUri |Specify Shared Access Signature URI to the Azure Storage resources such as blob, container, or table. |Yes | --**Example:** --```json -{ - "name": "StorageSasLinkedService", - "properties": { - "type": "AzureStorageSas", - "typeProperties": { - "sasUri": "<storageUri>?<sasToken>" - } - } -} -``` --For more information about these linked services, see [Azure Table Storage connector](data-factory-azure-table-connector.md#linked-service-properties) article. --### Dataset -To define an Azure Table dataset, set the **type** of the dataset to **AzureTable**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| tableName |Name of the table in the Azure Table Database instance that linked service refers to. |Yes. When a tableName is specified without an azureTableSourceQuery, all records from the table are copied to the destination. If an azureTableSourceQuery is also specified, records from the table that satisfies the query are copied to the destination. | --#### Example --```json -{ - "name": "AzureTableInput", - "properties": { - "type": "AzureTable", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --For more information about these linked services, see [Azure Table Storage connector](data-factory-azure-table-connector.md#dataset-properties) article. --### Azure Table Source in Copy Activity -If you are copying data from Azure Table Storage, set the **source type** of the copy activity to **AzureTableSource**, and specify following properties in the **source** section: --| Property | Description | Allowed values | Required | -| | | | | -| azureTableSourceQuery |Use the custom query to read data. |Azure table query string. See examples in the next section. |No. When a tableName is specified without an azureTableSourceQuery, all records from the table are copied to the destination. If an azureTableSourceQuery is also specified, records from the table that satisfies the query are copied to the destination. | -| azureTableSourceIgnoreTableNotFound |Indicate whether swallow the exception of table not exist. |TRUE<br/>FALSE |No | --#### Example --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline for copy activity", - "activities": [{ - "name": "AzureTabletoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [{ - "name": "AzureTableInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "typeProperties": { - "source": { - "type": "AzureTableSource", - "AzureTableSourceQuery": "PartitionKey eq 'DefaultPartitionKey'" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` --For more information about these linked services, see [Azure Table Storage connector](data-factory-azure-table-connector.md#copy-activity-properties) article. --### Azure Table Sink in Copy Activity -If you are copying data to Azure Table Storage, set the **sink type** of the copy activity to **AzureTableSink**, and specify following properties in the **sink** section: --| Property | Description | Allowed values | Required | -| | | | | -| azureTableDefaultPartitionKeyValue |Default partition key value that can be used by the sink. |A string value. |No | -| azureTablePartitionKeyName |Specify name of the column whose values are used as partition keys. If not specified, AzureTableDefaultPartitionKeyValue is used as the partition key. |A column name. |No | -| azureTableRowKeyName |Specify name of the column whose column values are used as row key. If not specified, use a GUID for each row. |A column name. |No | -| azureTableInsertType |The mode to insert data into Azure table.<br/><br/>This property controls whether existing rows in the output table with matching partition and row keys have their values replaced or merged. <br/><br/>To learn about how these settings (merge and replace) work, see [Insert or Merge Entity](/rest/api/storageservices/Insert-Or-Merge-Entity) and [Insert or Replace Entity](/rest/api/storageservices/Insert-Or-Replace-Entity) topics. <br/><br> This setting applies at the row level, not the table level, and neither option deletes rows in the output table that do not exist in the input. |merge (default)<br/>replace |No | -| writeBatchSize |Inserts data into the Azure table when the writeBatchSize or writeBatchTimeout is hit. |Integer (number of rows) |No (default: 10000) | -| writeBatchTimeout |Inserts data into the Azure table when the writeBatchSize or writeBatchTimeout is hit |timespan<br/><br/>Example: “00:20:00” (20 minutes) |No (Default to storage client default timeout value 90 sec) | --#### Example --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline with copy activity", - "activities": [{ - "name": "AzureBlobtoTable", - "description": "Copy Activity", - "type": "Copy", - "inputs": [{ - "name": "AzureBlobInput" - }], - "outputs": [{ - "name": "AzureTableOutput" - }], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "AzureTableSink", - "writeBatchSize": 100, - "writeBatchTimeout": "01:00:00" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` -For more information about these linked services, see [Azure Table Storage connector](data-factory-azure-table-connector.md#copy-activity-properties) article. --## Amazon RedShift --### Linked service -To define an Amazon Redshift linked service, set the **type** of the linked service to **AmazonRedshift**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| server |IP address or host name of the Amazon Redshift server. |Yes | -| port |The number of the TCP port that the Amazon Redshift server uses to listen for client connections. |No, default value: 5439 | -| database |Name of the Amazon Redshift database. |Yes | -| username |Name of user who has access to the database. |Yes | -| password |Password for the user account. |Yes | --#### Example --```json -{ - "name": "AmazonRedshiftLinkedService", - "properties": { - "type": "AmazonRedshift", - "typeProperties": { - "server": "<Amazon Redshift host name or IP address>", - "port": 5439, - "database": "<database name>", - "username": "user", - "password": "password" - } - } -} -``` --For more information, see Amazon Redshift connector article. --### Dataset -To define an Amazon Redshift dataset, set the **type** of the dataset to **RelationalTable**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| tableName |Name of the table in the Amazon Redshift database that linked service refers to. |No (if **query** of **RelationalSource** is specified) | ---#### Example --```json -{ - "name": "AmazonRedshiftInputDataset", - "properties": { - "type": "RelationalTable", - "linkedServiceName": "AmazonRedshiftLinkedService", - "typeProperties": { - "tableName": "<Table name>" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true - } -} -``` -For more information, see Amazon Redshift connector article. --### Relational Source in Copy Activity -If you are copying data from Amazon Redshift, set the **source type** of the copy activity to **RelationalSource**, and specify following properties in the **source** section: --| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL query string. For example: `select * from MyTable`. |No (if **tableName** of **dataset** is specified) | --#### Example --```json -{ - "name": "CopyAmazonRedshiftToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [{ - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [{ - "name": "AmazonRedshiftInputDataset" - }], - "outputs": [{ - "name": "AzureBlobOutputDataSet" - }], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "AmazonRedshiftToBlob" - }], - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00" - } -} -``` -For more information, see Amazon Redshift connector article. --## IBM DB2 --### Linked service -To define an IBM DB2 linked service, set the **type** of the linked service to **OnPremisesDB2**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| server |Name of the DB2 server. |Yes | -| database |Name of the DB2 database. |Yes | -| schema |Name of the schema in the database. The schema name is case-sensitive. |No | -| authenticationType |Type of authentication used to connect to the DB2 database. Possible values are: Anonymous, Basic, and Windows. |Yes | -| username |Specify user name if you are using Basic or Windows authentication. |No | -| password |Specify password for the user account you specified for the username. |No | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the on-premises DB2 database. |Yes | --#### Example -```json -{ - "name": "OnPremDb2LinkedService", - "properties": { - "type": "OnPremisesDb2", - "typeProperties": { - "server": "<server>", - "database": "<database>", - "schema": "<schema>", - "authenticationType": "<authentication type>", - "username": "<username>", - "password": "<password>", - "gatewayName": "<gatewayName>" - } - } -} -``` -For more information, see IBM DB2 connector article. --### Dataset -To define a DB2 dataset, set the **type** of the dataset to **RelationalTable**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| tableName |Name of the table in the DB2 Database instance that linked service refers to. The tableName is case-sensitive. |No (if **query** of **RelationalSource** is specified) --#### Example -```json -{ - "name": "Db2DataSet", - "properties": { - "type": "RelationalTable", - "linkedServiceName": "OnPremDb2LinkedService", - "typeProperties": {}, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --For more information, see IBM DB2 connector article. --### Relational Source in Copy Activity -If you are copying data from IBM DB2, set the **source type** of the copy activity to **RelationalSource**, and specify following properties in the **source** section: ---| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL query string. For example: `"query": "select * from "MySchema"."MyTable""`. |No (if **tableName** of **dataset** is specified) | --#### Example -```json -{ - "name": "CopyDb2ToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [{ - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "select * from \"Orders\"" - }, - "sink": { - "type": "BlobSink" - } - }, - "inputs": [{ - "name": "Db2DataSet" - }], - "outputs": [{ - "name": "AzureBlobDb2DataSet" - }], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "Db2ToBlob" - }], - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00" - } -} -``` -For more information, see IBM DB2 connector article. --## MySQL --### Linked service -To define a MySQL linked service, set the **type** of the linked service to **OnPremisesMySql**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| server |Name of the MySQL server. |Yes | -| database |Name of the MySQL database. |Yes | -| schema |Name of the schema in the database. |No | -| authenticationType |Type of authentication used to connect to the MySQL database. Possible values are: `Basic`. |Yes | -| userName |Specify user name to connect to the MySQL database. |Yes | -| password |Specify password for the user account you specified. |Yes | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the on-premises MySQL database. |Yes | --#### Example --```json -{ - "name": "OnPremMySqlLinkedService", - "properties": { - "type": "OnPremisesMySql", - "typeProperties": { - "server": "<server name>", - "database": "<database name>", - "schema": "<schema name>", - "authenticationType": "<authentication type>", - "userName": "<user name>", - "password": "<password>", - "gatewayName": "<gateway>" - } - } -} -``` --For more information, see [MySQL connector](data-factory-onprem-mysql-connector.md#linked-service-properties) article. --### Dataset -To define a MySQL dataset, set the **type** of the dataset to **RelationalTable**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| tableName |Name of the table in the MySQL Database instance that linked service refers to. |No (if **query** of **RelationalSource** is specified) | --#### Example --```json -{ - "name": "MySqlDataSet", - "properties": { - "type": "RelationalTable", - "linkedServiceName": "OnPremMySqlLinkedService", - "typeProperties": {}, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` -For more information, see [MySQL connector](data-factory-onprem-mysql-connector.md#dataset-properties) article. --### Relational Source in Copy Activity -If you are copying data from a MySQL database, set the **source type** of the copy activity to **RelationalSource**, and specify following properties in the **source** section: ---| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL query string. For example: `select * from MyTable`. |No (if **tableName** of **dataset** is specified) | ---#### Example -```json -{ - "name": "CopyMySqlToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [{ - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [{ - "name": "MySqlDataSet" - }], - "outputs": [{ - "name": "AzureBlobMySqlDataSet" - }], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "MySqlToBlob" - }], - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00" - } -} -``` --For more information, see [MySQL connector](data-factory-onprem-mysql-connector.md#copy-activity-properties) article. --## Oracle --### Linked service -To define an Oracle linked service, set the **type** of the linked service to **OnPremisesOracle**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| driverType | Specify which driver to use to copy data from/to Oracle Database. Allowed values are **Microsoft** or **ODP** (default). See Supported version and installation section on driver details. | No | -| connectionString | Specify information needed to connect to the Oracle Database instance for the connectionString property. | Yes | -| gatewayName | Name of the gateway that is used to connect to the on-premises Oracle server |Yes | --#### Example -```json -{ - "name": "OnPremisesOracleLinkedService", - "properties": { - "type": "OnPremisesOracle", - "typeProperties": { - "driverType": "Microsoft", - "connectionString": "Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;Password=<password>;", - "gatewayName": "<gateway name>" - } - } -} -``` --For more information, see [Oracle connector](data-factory-onprem-oracle-connector.md#linked-service-properties) article. --### Dataset -To define an Oracle dataset, set the **type** of the dataset to **OracleTable**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| tableName |Name of the table in the Oracle Database that the linked service refers to. |No (if **oracleReaderQuery** of **OracleSource** is specified) | --#### Example --```json -{ - "name": "OracleInput", - "properties": { - "type": "OracleTable", - "linkedServiceName": "OnPremisesOracleLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "external": true, - "availability": { - "offset": "01:00:00", - "interval": "1", - "anchorDateTime": "2016-02-27T12:00:00", - "frequency": "Hour" - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` -For more information, see [Oracle connector](data-factory-onprem-oracle-connector.md#dataset-properties) article. --### Oracle Source in Copy Activity -If you are copying data from an Oracle database, set the **source type** of the copy activity to **OracleSource**, and specify following properties in the **source** section: --| Property | Description | Allowed values | Required | -| | | | | -| oracleReaderQuery |Use the custom query to read data. |SQL query string. For example: `select * from MyTable` <br/><br/>If not specified, the SQL statement that is executed: `select * from MyTable` |No (if **tableName** of **dataset** is specified) | --#### Example --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline for copy activity", - "activities": [{ - "name": "OracletoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [{ - "name": " OracleInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "typeProperties": { - "source": { - "type": "OracleSource", - "oracleReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` --For more information, see [Oracle connector](data-factory-onprem-oracle-connector.md#copy-activity-properties) article. --### Oracle Sink in Copy Activity -If you are copying data to am Oracle database, set the **sink type** of the copy activity to **OracleSink**, and specify following properties in the **sink** section: --| Property | Description | Allowed values | Required | -| | | | | -| writeBatchTimeout |Wait time for the batch insert operation to complete before it times out. |timespan<br/><br/> Example: 00:30:00 (30 minutes). |No | -| writeBatchSize |Inserts data into the SQL table when the buffer size reaches writeBatchSize. |Integer (number of rows) |No (default: 100) | -| sqlWriterCleanupScript |Specify a query for Copy Activity to execute such that data of a specific slice is cleaned up. |A query statement. |No | -| sliceIdentifierColumnName |Specify column name for Copy Activity to fill with auto generated slice identifier, which is used to clean up data of a specific slice when rerun. |Column name of a column with data type of binary(32). |No | --#### Example -```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-05T19:00:00", - "description": "pipeline with copy activity", - "activities": [{ - "name": "AzureBlobtoOracle", - "description": "Copy Activity", - "type": "Copy", - "inputs": [{ - "name": "AzureBlobInput" - }], - "outputs": [{ - "name": "OracleOutput" - }], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "OracleSink" - } - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` -For more information, see [Oracle connector](data-factory-onprem-oracle-connector.md#copy-activity-properties) article. --## PostgreSQL --### Linked service -To define a PostgreSQL linked service, set the **type** of the linked service to **OnPremisesPostgreSql**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| server |Name of the PostgreSQL server. |Yes | -| database |Name of the PostgreSQL database. |Yes | -| schema |Name of the schema in the database. The schema name is case-sensitive. |No | -| authenticationType |Type of authentication used to connect to the PostgreSQL database. Possible values are: Anonymous, Basic, and Windows. |Yes | -| username |Specify user name if you are using Basic or Windows authentication. |No | -| password |Specify password for the user account you specified for the username. |No | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the on-premises PostgreSQL database. |Yes | --#### Example --```json -{ - "name": "OnPremPostgreSqlLinkedService", - "properties": { - "type": "OnPremisesPostgreSql", - "typeProperties": { - "server": "<server>", - "database": "<database>", - "schema": "<schema>", - "authenticationType": "<authentication type>", - "username": "<username>", - "password": "<password>", - "gatewayName": "<gatewayName>" - } - } -} -``` -For more information, see [PostgreSQL connector](data-factory-onprem-postgresql-connector.md#linked-service-properties) article. --### Dataset -To define a PostgreSQL dataset, set the **type** of the dataset to **RelationalTable**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| tableName |Name of the table in the PostgreSQL Database instance that linked service refers to. The tableName is case-sensitive. |No (if **query** of **RelationalSource** is specified) | --#### Example -```json -{ - "name": "PostgreSqlDataSet", - "properties": { - "type": "RelationalTable", - "linkedServiceName": "OnPremPostgreSqlLinkedService", - "typeProperties": {}, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` -For more information, see [PostgreSQL connector](data-factory-onprem-postgresql-connector.md#dataset-properties) article. --### Relational Source in Copy Activity -If you are copying data from a PostgreSQL database, set the **source type** of the copy activity to **RelationalSource**, and specify following properties in the **source** section: ---| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL query string. For example: "query": "select * from \"MySchema\".\"MyTable\"". |No (if **tableName** of **dataset** is specified) | --#### Example --```json -{ - "name": "CopyPostgreSqlToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [{ - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "select * from \"public\".\"usstates\"" - }, - "sink": { - "type": "BlobSink" - } - }, - "inputs": [{ - "name": "PostgreSqlDataSet" - }], - "outputs": [{ - "name": "AzureBlobPostgreSqlDataSet" - }], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "PostgreSqlToBlob" - }], - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00" - } -} -``` --For more information, see [PostgreSQL connector](data-factory-onprem-postgresql-connector.md#copy-activity-properties) article. --## SAP Business Warehouse ---### Linked service -To define a SAP Business Warehouse (BW) linked service, set the **type** of the linked service to **SapBw**, and specify following properties in the **typeProperties** section: --Property | Description | Allowed values | Required | -- | -- | ---server | Name of the server on which the SAP BW instance resides. | string | Yes -systemNumber | System number of the SAP BW system. | Two-digit decimal number represented as a string. | Yes -clientId | Client ID of the client in the SAP W system. | Three-digit decimal number represented as a string. | Yes -username | Name of the user who has access to the SAP server | string | Yes -password | Password for the user. | string | Yes -gatewayName | Name of the gateway that the Data Factory service should use to connect to the on-premises SAP BW instance. | string | Yes -encryptedCredential | The encrypted credential string. | string | No --#### Example --```json -{ - "name": "SapBwLinkedService", - "properties": { - "type": "SapBw", - "typeProperties": { - "server": "<server name>", - "systemNumber": "<system number>", - "clientId": "<client id>", - "username": "<SAP user>", - "password": "<Password for SAP user>", - "gatewayName": "<gateway name>" - } - } -} -``` --For more information, see [SAP Business Warehouse connector](data-factory-sap-business-warehouse-connector.md#linked-service-properties) article. --### Dataset -To define a SAP BW dataset, set the **type** of the dataset to **RelationalTable**. There are no type-specific properties supported for the SAP BW dataset of type **RelationalTable**. --#### Example --```json -{ - "name": "SapBwDataset", - "properties": { - "type": "RelationalTable", - "linkedServiceName": "SapBwLinkedService", - "typeProperties": {}, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true - } -} -``` -For more information, see [SAP Business Warehouse connector](data-factory-sap-business-warehouse-connector.md#dataset-properties) article. --### Relational Source in Copy Activity -If you are copying data from SAP Business Warehouse, set the **source type** of the copy activity to **RelationalSource**, and specify following properties in the **source** section: ---| Property | Description | Allowed values | Required | -| | | | | -| query | Specifies the MDX query to read data from the SAP BW instance. | MDX query. | Yes | --#### Example --```json -{ - "name": "CopySapBwToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [{ - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "<MDX query for SAP BW>" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [{ - "name": "SapBwDataset" - }], - "outputs": [{ - "name": "AzureBlobDataSet" - }], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "SapBwToBlob" - }], - "start": "2017-03-01T18:00:00", - "end": "2017-03-01T19:00:00" - } -} -``` --For more information, see [SAP Business Warehouse connector](data-factory-sap-business-warehouse-connector.md#copy-activity-properties) article. --## SAP HANA --### Linked service -To define a SAP HANA linked service, set the **type** of the linked service to **SapHana**, and specify following properties in the **typeProperties** section: --Property | Description | Allowed values | Required | -- | -- | ---server | Name of the server on which the SAP HANA instance resides. If your server is using a customized port, specify `server:port`. | string | Yes -authenticationType | Type of authentication. | string. "Basic" or "Windows" | Yes -username | Name of the user who has access to the SAP server | string | Yes -password | Password for the user. | string | Yes -gatewayName | Name of the gateway that the Data Factory service should use to connect to the on-premises SAP HANA instance. | string | Yes -encryptedCredential | The encrypted credential string. | string | No --#### Example --```json -{ - "name": "SapHanaLinkedService", - "properties": { - "type": "SapHana", - "typeProperties": { - "server": "<server name>", - "authenticationType": "<Basic, or Windows>", - "username": "<SAP user>", - "password": "<Password for SAP user>", - "gatewayName": "<gateway name>" - } - } -} --``` -For more information, see [SAP HANA connector](data-factory-sap-hana-connector.md#linked-service-properties) article. --### Dataset -To define a SAP HANA dataset, set the **type** of the dataset to **RelationalTable**. There are no type-specific properties supported for the SAP HANA dataset of type **RelationalTable**. --#### Example --```json -{ - "name": "SapHanaDataset", - "properties": { - "type": "RelationalTable", - "linkedServiceName": "SapHanaLinkedService", - "typeProperties": {}, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true - } -} -``` -For more information, see [SAP HANA connector](data-factory-sap-hana-connector.md#dataset-properties) article. --### Relational Source in Copy Activity -If you are copying data from a SAP HANA data store, set the **source type** of the copy activity to **RelationalSource**, and specify following properties in the **source** section: --| Property | Description | Allowed values | Required | -| | | | | -| query | Specifies the SQL query to read data from the SAP HANA instance. | SQL query. | Yes | ---#### Example ---```json -{ - "name": "CopySapHanaToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [{ - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "<SQL Query for HANA>" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [{ - "name": "SapHanaDataset" - }], - "outputs": [{ - "name": "AzureBlobDataSet" - }], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "SapHanaToBlob" - }], - "start": "2017-03-01T18:00:00", - "end": "2017-03-01T19:00:00" - } -} -``` --For more information, see [SAP HANA connector](data-factory-sap-hana-connector.md#copy-activity-properties) article. ---## SQL Server --### Linked service -You create a linked service of type **OnPremisesSqlServer** to link a SQL Server database to a data factory. The following table provides description for JSON elements specific to SQL Server linked service. --The following table provides description for JSON elements specific to SQL Server linked service. --| Property | Description | Required | -| | | | -| type |The type property should be set to: **OnPremisesSqlServer**. |Yes | -| connectionString |Specify connectionString information needed to connect to the SQL Server database using either SQL authentication or Windows authentication. |Yes | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the SQL Server database. |Yes | -| username |Specify user name if you are using Windows Authentication. Example: **domainname\\username**. |No | -| password |Specify password for the user account you specified for the username. |No | --You can encrypt credentials using the **New-AzDataFactoryEncryptValue** cmdlet and use them in the connection string as shown in the following example (**EncryptedCredential** property): --```json -"connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated Security=True;EncryptedCredential=<encrypted credential>", -``` ---#### Example: JSON for using SQL Authentication --```json -{ - "name": "MyOnPremisesSQLDB", - "properties": { - "type": "OnPremisesSqlServer", - "typeProperties": { - "connectionString": "Data Source=<servername>;Initial Catalog=MarketingCampaigns;Integrated Security=False;User ID=<username>;Password=<password>;", - "gatewayName": "<gateway name>" - } - } -} -``` -#### Example: JSON for using Windows Authentication --If username and password are specified, gateway uses them to impersonate the specified user account to connect to the SQL Server database. Otherwise, gateway connects to the SQL Server directly with the security context of Gateway (its startup account). --```json -{ - "Name": " MyOnPremisesSQLDB", - "Properties": { - "type": "OnPremisesSqlServer", - "typeProperties": { - "ConnectionString": "Data Source=<servername>;Initial Catalog=MarketingCampaigns;Integrated Security=True;", - "username": "<domain\\username>", - "password": "<password>", - "gatewayName": "<gateway name>" - } - } -} -``` --For more information, see [SQL Server connector](data-factory-sqlserver-connector.md#linked-service-properties) article. --### Dataset -To define a SQL Server dataset, set the **type** of the dataset to **SqlServerTable**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| tableName |Name of the table or view in the SQL Server Database instance that linked service refers to. |Yes | --#### Example -```json -{ - "name": "SqlServerInput", - "properties": { - "type": "SqlServerTable", - "linkedServiceName": "SqlServerLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --For more information, see [SQL Server connector](data-factory-sqlserver-connector.md#dataset-properties) article. --### Sql Source in Copy Activity -If you are copying data from a SQL Server database, set the **source type** of the copy activity to **SqlSource**, and specify following properties in the **source** section: ---| Property | Description | Allowed values | Required | -| | | | | -| sqlReaderQuery |Use the custom query to read data. |SQL query string. For example: `select * from MyTable`. May reference multiple tables from the database referenced by the input dataset. If not specified, the SQL statement that is executed: select from MyTable. |No | -| sqlReaderStoredProcedureName |Name of the stored procedure that reads data from the source table. |Name of the stored procedure. |No | -| storedProcedureParameters |Parameters for the stored procedure. |Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. |No | --If the **sqlReaderQuery** is specified for the SqlSource, the Copy Activity runs this query against the SQL Server Database source to get the data. --Alternatively, you can specify a stored procedure by specifying the **sqlReaderStoredProcedureName** and **storedProcedureParameters** (if the stored procedure takes parameters). --If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section are used to build a select query to run against the SQL Server Database. If the dataset definition does not have the structure, all columns are selected from the table. --> [!NOTE] -> When you use **sqlReaderStoredProcedureName**, you still need to specify a value for the **tableName** property in the dataset JSON. There are no validations performed against this table though. ---#### Example -```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline for copy activity", - "activities": [{ - "name": "SqlServertoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [{ - "name": " SqlServerInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "typeProperties": { - "source": { - "type": "SqlSource", - "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` --In this example, **sqlReaderQuery** is specified for the SqlSource. The Copy Activity runs this query against the SQL Server Database source to get the data. Alternatively, you can specify a stored procedure by specifying the **sqlReaderStoredProcedureName** and **storedProcedureParameters** (if the stored procedure takes parameters). The sqlReaderQuery can reference multiple tables within the database referenced by the input dataset. It is not limited to only the table set as the dataset's tableName typeProperty. --If you do not specify sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section are used to build a select query to run against the SQL Server Database. If the dataset definition does not have the structure, all columns are selected from the table. --For more information, see [SQL Server connector](data-factory-sqlserver-connector.md#copy-activity-properties) article. --### Sql Sink in Copy Activity -If you are copying data to a SQL Server database, set the **sink type** of the copy activity to **SqlSink**, and specify following properties in the **sink** section: --| Property | Description | Allowed values | Required | -| | | | | -| writeBatchTimeout |Wait time for the batch insert operation to complete before it times out. |timespan<br/><br/> Example: “00:30:00” (30 minutes). |No | -| writeBatchSize |Inserts data into the SQL table when the buffer size reaches writeBatchSize. |Integer (number of rows) |No (default: 10000) | -| sqlWriterCleanupScript |Specify query for Copy Activity to execute such that data of a specific slice is cleaned up. For more information, see repeatability section. |A query statement. |No | -| sliceIdentifierColumnName |Specify column name for Copy Activity to fill with auto generated slice identifier, which is used to clean up data of a specific slice when rerun. For more information, see repeatability section. |Column name of a column with data type of binary(32). |No | -| sqlWriterStoredProcedureName |Name of the stored procedure that upserts (updates/inserts) data into the target table. |Name of the stored procedure. |No | -| storedProcedureParameters |Parameters for the stored procedure. |Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. |No | -| sqlWriterTableType |Specify table type name to be used in the stored procedure. Copy activity makes the data being moved available in a temp table with this table type. Stored procedure code can then merge the data being copied with existing data. |A table type name. |No | --#### Example -The pipeline contains a Copy Activity that is configured to use these input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **BlobSource** and **sink** type is set to **SqlSink**. --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline with copy activity", - "activities": [{ - "name": "AzureBlobtoSQL", - "description": "Copy Activity", - "type": "Copy", - "inputs": [{ - "name": "AzureBlobInput" - }], - "outputs": [{ - "name": " SqlServerOutput " - }], - "typeProperties": { - "source": { - "type": "BlobSource", - "blobColumnSeparators": "," - }, - "sink": { - "type": "SqlSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` --For more information, see [SQL Server connector](data-factory-sqlserver-connector.md#copy-activity-properties) article. --## Sybase --### Linked service -To define a Sybase linked service, set the **type** of the linked service to **OnPremisesSybase**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| server |Name of the Sybase server. |Yes | -| database |Name of the Sybase database. |Yes | -| schema |Name of the schema in the database. |No | -| authenticationType |Type of authentication used to connect to the Sybase database. Possible values are: Anonymous, Basic, and Windows. |Yes | -| username |Specify user name if you are using Basic or Windows authentication. |No | -| password |Specify password for the user account you specified for the username. |No | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the on-premises Sybase database. |Yes | --#### Example -```json -{ - "name": "OnPremSybaseLinkedService", - "properties": { - "type": "OnPremisesSybase", - "typeProperties": { - "server": "<server>", - "database": "<database>", - "schema": "<schema>", - "authenticationType": "<authentication type>", - "username": "<username>", - "password": "<password>", - "gatewayName": "<gatewayName>" - } - } -} -``` --For more information, see [Sybase connector](data-factory-onprem-sybase-connector.md#linked-service-properties) article. --### Dataset -To define a Sybase dataset, set the **type** of the dataset to **RelationalTable**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| tableName |Name of the table in the Sybase Database instance that linked service refers to. |No (if **query** of **RelationalSource** is specified) | --#### Example --```json -{ - "name": "SybaseDataSet", - "properties": { - "type": "RelationalTable", - "linkedServiceName": "OnPremSybaseLinkedService", - "typeProperties": {}, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --For more information, see [Sybase connector](data-factory-onprem-sybase-connector.md#dataset-properties) article. --### Relational Source in Copy Activity -If you are copying data from a Sybase database, set the **source type** of the copy activity to **RelationalSource**, and specify following properties in the **source** section: ---| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL query string. For example: `select * from MyTable`. |No (if **tableName** of **dataset** is specified) | --#### Example --```json -{ - "name": "CopySybaseToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [{ - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "select * from DBA.Orders" - }, - "sink": { - "type": "BlobSink" - } - }, - "inputs": [{ - "name": "SybaseDataSet" - }], - "outputs": [{ - "name": "AzureBlobSybaseDataSet" - }], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "SybaseToBlob" - }], - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00" - } -} -``` --For more information, see [Sybase connector](data-factory-onprem-sybase-connector.md#copy-activity-properties) article. --## Teradata --### Linked service -To define a Teradata linked service, set the **type** of the linked service to **OnPremisesTeradata**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| server |Name of the Teradata server. |Yes | -| authenticationType |Type of authentication used to connect to the Teradata database. Possible values are: Anonymous, Basic, and Windows. |Yes | -| username |Specify user name if you are using Basic or Windows authentication. |No | -| password |Specify password for the user account you specified for the username. |No | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the on-premises Teradata database. |Yes | --#### Example -```json -{ - "name": "OnPremTeradataLinkedService", - "properties": { - "type": "OnPremisesTeradata", - "typeProperties": { - "server": "<server>", - "authenticationType": "<authentication type>", - "username": "<username>", - "password": "<password>", - "gatewayName": "<gatewayName>" - } - } -} -``` --For more information, see [Teradata connector](data-factory-onprem-teradata-connector.md#linked-service-properties) article. --### Dataset -To define a Teradata Blob dataset, set the **type** of the dataset to **RelationalTable**. Currently, there are no type properties supported for the Teradata dataset. --#### Example -```json -{ - "name": "TeradataDataSet", - "properties": { - "type": "RelationalTable", - "linkedServiceName": "OnPremTeradataLinkedService", - "typeProperties": {}, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --For more information, see [Teradata connector](data-factory-onprem-teradata-connector.md#dataset-properties) article. --### Relational Source in Copy Activity -If you are copying data from a Teradata database, set the **source type** of the copy activity to **RelationalSource**, and specify following properties in the **source** section: --| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL query string. For example: `select * from MyTable`. |Yes | --#### Example --```json -{ - "name": "CopyTeradataToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [{ - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', SliceStart, SliceEnd)" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [{ - "name": "TeradataDataSet" - }], - "outputs": [{ - "name": "AzureBlobTeradataDataSet" - }], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "TeradataToBlob" - }], - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "isPaused": false - } -} -``` --For more information, see [Teradata connector](data-factory-onprem-teradata-connector.md#copy-activity-properties) article. --## Cassandra ---### Linked service -To define a Cassandra linked service, set the **type** of the linked service to **OnPremisesCassandra**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| host |One or more IP addresses or host names of Cassandra servers.<br/><br/>Specify a comma-separated list of IP addresses or host names to connect to all servers concurrently. |Yes | -| port |The TCP port that the Cassandra server uses to listen for client connections. |No, default value: 9042 | -| authenticationType |Basic, or Anonymous |Yes | -| username |Specify user name for the user account. |Yes, if authenticationType is set to Basic. | -| password |Specify password for the user account. |Yes, if authenticationType is set to Basic. | -| gatewayName |The name of the gateway that is used to connect to the on-premises Cassandra database. |Yes | -| encryptedCredential |Credential encrypted by the gateway. |No | --#### Example --```json -{ - "name": "CassandraLinkedService", - "properties": { - "type": "OnPremisesCassandra", - "typeProperties": { - "authenticationType": "Basic", - "host": "<cassandra server name or IP address>", - "port": 9042, - "username": "user", - "password": "password", - "gatewayName": "<onpremgateway>" - } - } -} -``` --For more information, see [Cassandra connector](data-factory-onprem-cassandra-connector.md#linked-service-properties) article. --### Dataset -To define a Cassandra dataset, set the **type** of the dataset to **CassandraTable**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| keyspace |Name of the keyspace or schema in Cassandra database. |Yes (If **query** for **CassandraSource** is not defined). | -| tableName |Name of the table in Cassandra database. |Yes (If **query** for **CassandraSource** is not defined). | --#### Example --```json -{ - "name": "CassandraInput", - "properties": { - "linkedServiceName": "CassandraLinkedService", - "type": "CassandraTable", - "typeProperties": { - "tableName": "mytable", - "keySpace": "<key space>" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --For more information, see [Cassandra connector](data-factory-onprem-cassandra-connector.md#dataset-properties) article. --### Cassandra Source in Copy Activity -If you are copying data from Cassandra, set the **source type** of the copy activity to **CassandraSource**, and specify following properties in the **source** section: --| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL-92 query or CQL query. See [CQL reference](https://docs.datastax.com/en/cql/3.1/cql/cql_reference/cqlReferenceTOC.html). <br/><br/>When using SQL query, specify **keyspace name.table name** to represent the table you want to query. |No (if tableName and keyspace on dataset are defined). | -| consistencyLevel |The consistency level specifies how many replicas must respond to a read request before returning data to the client application. Cassandra checks the specified number of replicas for data to satisfy the read request. |ONE, TWO, THREE, QUORUM, ALL, LOCAL_QUORUM, EACH_QUORUM, LOCAL_ONE. See [Configuring data consistency](https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_config_consistency_c.html) for details. |No. Default value is ONE. | --#### Example --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline with copy activity", - "activities": [{ - "name": "CassandraToAzureBlob", - "description": "Copy from Cassandra to an Azure blob", - "type": "Copy", - "inputs": [{ - "name": "CassandraInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "typeProperties": { - "source": { - "type": "CassandraSource", - "query": "select id, firstname, lastname from mykeyspace.mytable" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` --For more information, see [Cassandra connector](data-factory-onprem-cassandra-connector.md#copy-activity-properties) article. --## MongoDB --### Linked service -To define a MongoDB linked service, set the **type** of the linked service to **OnPremisesMongoDB**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| server |IP address or host name of the MongoDB server. |Yes | -| port |TCP port that the MongoDB server uses to listen for client connections. |Optional, default value: 27017 | -| authenticationType |Basic, or Anonymous. |Yes | -| username |User account to access MongoDB. |Yes (if basic authentication is used). | -| password |Password for the user. |Yes (if basic authentication is used). | -| authSource |Name of the MongoDB database that you want to use to check your credentials for authentication. |Optional (if basic authentication is used). default: uses the admin account and the database specified using databaseName property. | -| databaseName |Name of the MongoDB database that you want to access. |Yes | -| gatewayName |Name of the gateway that accesses the data store. |Yes | -| encryptedCredential |Credential encrypted by gateway. |Optional | --#### Example --```json -{ - "name": "OnPremisesMongoDbLinkedService", - "properties": { - "type": "OnPremisesMongoDb", - "typeProperties": { - "authenticationType": "<Basic or Anonymous>", - "server": "< The IP address or host name of the MongoDB server >", - "port": "<The number of the TCP port that the MongoDB server uses to listen for client connections.>", - "username": "<username>", - "password": "<password>", - "authSource": "< The database that you want to use to check your credentials for authentication. >", - "databaseName": "<database name>", - "gatewayName": "<onpremgateway>" - } - } -} -``` --For more information, see [MongoDB connector article](data-factory-on-premises-mongodb-connector.md#linked-service-properties) --### Dataset -To define a MongoDB dataset, set the **type** of the dataset to **MongoDbCollection**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| collectionName |Name of the collection in MongoDB database. |Yes | --#### Example --```json -{ - "name": "MongoDbInputDataset", - "properties": { - "type": "MongoDbCollection", - "linkedServiceName": "OnPremisesMongoDbLinkedService", - "typeProperties": { - "collectionName": "<Collection name>" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true - } -} -``` --For more information, see [MongoDB connector article](data-factory-on-premises-mongodb-connector.md#dataset-properties) --#### MongoDB Source in Copy Activity -If you are copying data from MongoDB, set the **source type** of the copy activity to **MongoDbSource**, and specify following properties in the **source** section: --| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL-92 query string. For example: `select * from MyTable`. |No (if **collectionName** of **dataset** is specified) | --#### Example --```json -{ - "name": "CopyMongoDBToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [{ - "type": "Copy", - "typeProperties": { - "source": { - "type": "MongoDbSource", - "query": "select * from MyTable" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [{ - "name": "MongoDbInputDataset" - }], - "outputs": [{ - "name": "AzureBlobOutputDataSet" - }], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "MongoDBToAzureBlob" - }], - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00" - } -} -``` --For more information, see [MongoDB connector article](data-factory-on-premises-mongodb-connector.md#copy-activity-properties) --## Amazon S3 ---### Linked service -To define an Amazon S3 linked service, set the **type** of the linked service to **AwsAccessKey**, and specify following properties in the **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| accessKeyID |ID of the secret access key. |string |Yes | -| secretAccessKey |The secret access key itself. |Encrypted secret string |Yes | --#### Example -```json -{ - "name": "AmazonS3LinkedService", - "properties": { - "type": "AwsAccessKey", - "typeProperties": { - "accessKeyId": "<access key id>", - "secretAccessKey": "<secret access key>" - } - } -} -``` --For more information, see [Amazon S3 connector article](data-factory-amazon-simple-storage-service-connector.md#linked-service-properties). --### Dataset -To define an Amazon S3 dataset, set the **type** of the dataset to **AmazonS3**, and specify the following properties in the **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| bucketName |The S3 bucket name. |String |Yes | -| key |The S3 object key. |String |No | -| prefix |Prefix for the S3 object key. Objects whose keys start with this prefix are selected. Applies only when key is empty. |String |No | -| version |The version of S3 object if S3 versioning is enabled. |String |No | -| format | The following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see [Text Format](data-factory-supported-file-and-compression-formats.md#text-format), [Json Format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro Format](data-factory-supported-file-and-compression-formats.md#avro-format), [Orc Format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet Format](data-factory-supported-file-and-compression-formats.md#parquet-format) sections. <br><br> If you want to **copy files as-is** between file-based stores (binary copy), skip the format section in both input and output dataset definitions. |No | | -| compression | Specify the type and level of compression for the data. Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**. The supported levels are: **Optimal** and **Fastest**. For more information, see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). |No | | ---> [!NOTE] -> bucketName + key specifies the location of the S3 object where bucket is the root container for S3 objects and key is the full path to S3 object. --#### Example: Sample dataset with prefix --```json -{ - "name": "dataset-s3", - "properties": { - "type": "AmazonS3", - "linkedServiceName": "link- testS3", - "typeProperties": { - "prefix": "testFolder/test", - "bucketName": "<S3 bucket name>", - "format": { - "type": "OrcFormat" - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true - } -} -``` -#### Example: Sample data set (with version) --```json -{ - "name": "dataset-s3", - "properties": { - "type": "AmazonS3", - "linkedServiceName": "link- testS3", - "typeProperties": { - "key": "testFolder/test.orc", - "bucketName": "<S3 bucket name>", - "version": "XXXXXXXXXczm0CJajYkHf0_k6LhBmkcL", - "format": { - "type": "OrcFormat" - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true - } -} -``` --#### Example: Dynamic paths for S3 -In the sample, we use fixed values for key and bucketName properties in the Amazon S3 dataset. --```json -"key": "testFolder/test.orc", -"bucketName": "<S3 bucket name>", -``` --You can have Data Factory calculate the key and bucketName dynamically at runtime by using system variables such as SliceStart. --```json -"key": "$$Text.Format('{0:MM}/{0:dd}/test.orc', SliceStart)" -"bucketName": "$$Text.Format('{0:yyyy}', SliceStart)" -``` --You can do the same for the prefix property of an Amazon S3 dataset. See [Data Factory functions and system variables](data-factory-functions-variables.md) for a list of supported functions and variables. --For more information, see [Amazon S3 connector article](data-factory-amazon-simple-storage-service-connector.md#dataset-properties). --### File System Source in Copy Activity -If you are copying data from Amazon S3, set the **source type** of the copy activity to **FileSystemSource**, and specify following properties in the **source** section: ---| Property | Description | Allowed values | Required | -| | | | | -| recursive |Specifies whether to recursively list S3 objects under the directory. |true/false |No | ---#### Example ---```json -{ - "name": "CopyAmazonS3ToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [{ - "type": "Copy", - "typeProperties": { - "source": { - "type": "FileSystemSource", - "recursive": true - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [{ - "name": "AmazonS3InputDataset" - }], - "outputs": [{ - "name": "AzureBlobOutputDataSet" - }], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "AmazonS3ToBlob" - }], - "start": "2016-08-08T18:00:00", - "end": "2016-08-08T19:00:00" - } -} -``` --For more information, see [Amazon S3 connector article](data-factory-amazon-simple-storage-service-connector.md#copy-activity-properties). --## File System ---### Linked service -You can link an on-premises file system to an Azure data factory with the **On-premises File Server** linked service. The following table provides descriptions for JSON elements that are specific to the On-premises File Server linked service. --| Property | Description | Required | -| | | | -| type |Ensure that the type property is set to **OnPremisesFileServer**. |Yes | -| host |Specifies the root path of the folder that you want to copy. Use the escape character ‘ \ ’ for special characters in the string. See Sample linked service and dataset definitions for examples. |Yes | -| userid |Specify the ID of the user who has access to the server. |No (if you choose encryptedCredential) | -| password |Specify the password for the user (userid). |No (if you choose encryptedCredential | -| encryptedCredential |Specify the encrypted credentials that you can get by running the New-AzDataFactoryEncryptValue cmdlet. |No (if you choose to specify userid and password in plain text) | -| gatewayName |Specifies the name of the gateway that Data Factory should use to connect to the on-premises file server. |Yes | --#### Sample folder path definitions --| Scenario | Host in linked service definition | folderPath in dataset definition | -| | | | -| Local folder on Data Management Gateway machine: <br/><br/>Examples: D:\\\* or D:\folder\subfolder\\* |D:\\\\ (for Data Management Gateway 2.0 and later versions) <br/><br/> localhost (for earlier versions than Data Management Gateway 2.0) |.\\\\ or folder\\\\subfolder (for Data Management Gateway 2.0 and later versions) <br/><br/>D:\\\\ or D:\\\\folder\\\\subfolder (for gateway version below 2.0) | -| Remote shared folder: <br/><br/>Examples: \\\\myserver\\share\\\* or \\\\myserver\\share\\folder\\subfolder\\* |\\\\\\\\myserver\\\\share |.\\\\ or folder\\\\subfolder | ---#### Example: Using username and password in plain text --```json -{ - "Name": "OnPremisesFileServerLinkedService", - "properties": { - "type": "OnPremisesFileServer", - "typeProperties": { - "host": "\\\\Contosogame-Asia", - "userid": "Admin", - "password": "123456", - "gatewayName": "<onpremgateway>" - } - } -} -``` --#### Example: Using encryptedcredential --```json -{ - "Name": " OnPremisesFileServerLinkedService ", - "properties": { - "type": "OnPremisesFileServer", - "typeProperties": { - "host": "D:\\", - "encryptedCredential": "WFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5xxxxxxxxxxxxxxxxx", - "gatewayName": "<onpremgateway>" - } - } -} -``` --For more information, see [File System connector article](data-factory-onprem-file-system-connector.md#linked-service-properties). --### Dataset -To define a File System dataset, set the **type** of the dataset to **FileShare**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| folderPath |Specifies the subpath to the folder. Use the escape character ‘\’ for special characters in the string. See Sample linked service and dataset definitions for examples.<br/><br/>You can combine this property with **partitionBy** to have folder paths based on slice start/end date-times. |Yes | -| fileName |Specify the name of the file in the **folderPath** if you want the table to refer to a specific file in the folder. If you do not specify any value for this property, the table points to all files in the folder.<br/><br/>When fileName is not specified for an output dataset, the name of the generated file is in the following format: <br/><br/>`Data.<Guid>.txt` (Example: Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt) |No | -| fileFilter |Specify a filter to be used to select a subset of files in the folderPath rather than all files. <br/><br/>Allowed values are: `*` (multiple characters) and `?` (single character).<br/><br/>Example 1: "fileFilter": "*.log"<br/>Example 2: "fileFilter": 2016-1-?.txt"<br/><br/>Note that fileFilter is applicable for an input FileShare dataset. |No | -| partitionedBy |You can use partitionedBy to specify a dynamic folderPath/fileName for time-series data. An example is folderPath parameterized for every hour of data. |No | -| format | The following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see [Text Format](data-factory-supported-file-and-compression-formats.md#text-format), [Json Format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro Format](data-factory-supported-file-and-compression-formats.md#avro-format), [Orc Format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet Format](data-factory-supported-file-and-compression-formats.md#parquet-format) sections. <br><br> If you want to **copy files as-is** between file-based stores (binary copy), skip the format section in both input and output dataset definitions. |No | -| compression | Specify the type and level of compression for the data. Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**; and supported levels are: **Optimal** and **Fastest**. see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). |No | --> [!NOTE] -> You cannot use fileName and fileFilter simultaneously. --#### Example --```json -{ - "name": "OnpremisesFileSystemInput", - "properties": { - "type": " FileShare", - "linkedServiceName": " OnPremisesFileServerLinkedService ", - "typeProperties": { - "folderPath": "mysharedfolder/yearno={Year}/monthno={Month}/dayno={Day}", - "fileName": "{Hour}.csv", - "partitionedBy": [{ - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - }] - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --For more information, see [File System connector article](data-factory-onprem-file-system-connector.md#dataset-properties). --### File System Source in Copy Activity -If you are copying data from File System, set the **source type** of the copy activity to **FileSystemSource**, and specify following properties in the **source** section: --| Property | Description | Allowed values | Required | -| | | | | -| recursive |Indicates whether the data is read recursively from the subfolders or only from the specified folder. |True, False (default) |No | --#### Example --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2015-06-01T18:00:00", - "end": "2015-06-01T19:00:00", - "description": "Pipeline for copy activity", - "activities": [{ - "name": "OnpremisesFileSystemtoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [{ - "name": "OnpremisesFileSystemInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "typeProperties": { - "source": { - "type": "FileSystemSource" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` -For more information, see [File System connector article](data-factory-onprem-file-system-connector.md#copy-activity-properties). --### File System Sink in Copy Activity -If you are copying data to File System, set the **sink type** of the copy activity to **FileSystemSink**, and specify following properties in the **sink** section: --| Property | Description | Allowed values | Required | -| | | | | -| copyBehavior |Defines the copy behavior when the source is BlobSource or FileSystem. |**PreserveHierarchy:** Preserves the file hierarchy in the target folder. That is, the relative path of the source file to the source folder is the same as the relative path of the target file to the target folder.<br/><br/>**FlattenHierarchy:** All files from the source folder are created in the first level of target folder. The target files are created with an autogenerated name.<br/><br/>**MergeFiles:** Merges all files from the source folder to one file. If the file name/blob name is specified, the merged file name is the specified name. Otherwise, it is an auto-generated file name. |No | --auto- --#### Example --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2015-06-01T18:00:00", - "end": "2015-06-01T20:00:00", - "description": "pipeline for copy activity", - "activities": [{ - "name": "AzureSQLtoOnPremisesFile", - "description": "copy activity", - "type": "Copy", - "inputs": [{ - "name": "AzureSQLInput" - }], - "outputs": [{ - "name": "OnpremisesFileSystemOutput" - }], - "typeProperties": { - "source": { - "type": "SqlSource", - "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "FileSystemSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 3, - "timeout": "01:00:00" - } - }] - } -} -``` --For more information, see [File System connector article](data-factory-onprem-file-system-connector.md#copy-activity-properties). --## FTP --### Linked service -To define an FTP linked service, set the **type** of the linked service to **FtpServer**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | Default | -| | | | | -| host |Name or IP address of the FTP Server |Yes | | -| authenticationType |Specify authentication type |Yes |Basic, Anonymous | -| username |User who has access to the FTP server |No | | -| password |Password for the user (username) |No | | -| encryptedCredential |Encrypted credential to access the FTP server |No | | -| gatewayName |Name of the Data Management Gateway to connect to an on-premises FTP server |No | | -| port |Port on which the FTP server is listening |No |21 | -| enableSsl |Specify whether to use FTP over SSL/TLS channel |No |true | -| enableServerCertificateValidation |Specify whether to enable server TLS/SSL certificate validation when using FTP over SSL/TLS channel |No |true | --#### Example: Using Anonymous authentication --```json -{ - "name": "FTPLinkedService", - "properties": { - "type": "FtpServer", - "typeProperties": { - "authenticationType": "Anonymous", - "host": "myftpserver.com" - } - } -} -``` --#### Example: Using username and password in plain text for basic authentication --```json -{ - "name": "FTPLinkedService", - "properties": { - "type": "FtpServer", - "typeProperties": { - "host": "myftpserver.com", - "authenticationType": "Basic", - "username": "Admin", - "password": "123456" - } - } -} -``` --#### Example: Using port, enableSsl, enableServerCertificateValidation --```json -{ - "name": "FTPLinkedService", - "properties": { - "type": "FtpServer", - "typeProperties": { - "host": "myftpserver.com", - "authenticationType": "Basic", - "username": "Admin", - "password": "123456", - "port": "21", - "enableSsl": true, - "enableServerCertificateValidation": true - } - } -} -``` --#### Example: Using encryptedCredential for authentication and gateway --```json -{ - "name": "FTPLinkedService", - "properties": { - "type": "FtpServer", - "typeProperties": { - "host": "myftpserver.com", - "authenticationType": "Basic", - "encryptedCredential": "xxxxxxxxxxxxxxxxx", - "gatewayName": "<onpremgateway>" - } - } -} -``` --For more information, see [FTP connector](data-factory-ftp-connector.md#linked-service-properties) article. --### Dataset -To define an FTP dataset, set the **type** of the dataset to **FileShare**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| folderPath |Sub path to the folder. Use escape character ‘ \ ’ for special characters in the string. See Sample linked service and dataset definitions for examples.<br/><br/>You can combine this property with **partitionBy** to have folder paths based on slice start/end date-times. |Yes -| fileName |Specify the name of the file in the **folderPath** if you want the table to refer to a specific file in the folder. If you do not specify any value for this property, the table points to all files in the folder.<br/><br/>When fileName is not specified for an output dataset, the name of the generated file would be in the following this format: <br/><br/>`Data.<Guid>.txt` (Example: Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt) |No | -| fileFilter |Specify a filter to be used to select a subset of files in the folderPath rather than all files.<br/><br/>Allowed values are: `*` (multiple characters) and `?` (single character).<br/><br/>Examples 1: `"fileFilter": "*.log"`<br/>Example 2: `"fileFilter": 2016-1-?.txt"`<br/><br/> fileFilter is applicable for an input FileShare dataset. This property is not supported with HDFS. |No | -| partitionedBy |partitionedBy can be used to specify a dynamic folderPath, filename for time series data. For example, folderPath parameterized for every hour of data. |No | -| format | The following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see [Text Format](data-factory-supported-file-and-compression-formats.md#text-format), [Json Format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro Format](data-factory-supported-file-and-compression-formats.md#avro-format), [Orc Format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet Format](data-factory-supported-file-and-compression-formats.md#parquet-format) sections. <br><br> If you want to **copy files as-is** between file-based stores (binary copy), skip the format section in both input and output dataset definitions. |No | -| compression | Specify the type and level of compression for the data. Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**; and supported levels are: **Optimal** and **Fastest**. For more information, see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). |No | -| useBinaryTransfer |Specify whether use Binary transfer mode. True for binary mode and false ASCII. Default value: True. This property can only be used when associated linked service type is of type: FtpServer. |No | --> [!NOTE] -> filename and fileFilter cannot be used simultaneously. --#### Example --```json -{ - "name": "FTPFileInput", - "properties": { - "type": "FileShare", - "linkedServiceName": "FTPLinkedService", - "typeProperties": { - "folderPath": "<path to shared folder>", - "fileName": "test.csv", - "useBinaryTransfer": true - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --For more information, see [FTP connector](data-factory-ftp-connector.md#dataset-properties) article. --### File System Source in Copy Activity -If you are copying data from an FTP server, set the **source type** of the copy activity to **FileSystemSource**, and specify following properties in the **source** section: --| Property | Description | Allowed values | Required | -| | | | | -| recursive |Indicates whether the data is read recursively from the sub folders or only from the specified folder. |True, False (default) |No | --#### Example --```json -{ - "name": "pipeline", - "properties": { - "activities": [{ - "name": "FTPToBlobCopy", - "inputs": [{ - "name": "FtpFileInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "type": "Copy", - "typeProperties": { - "source": { - "type": "FileSystemSource" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "00:05:00" - } - }], - "start": "2016-08-24T18:00:00", - "end": "2016-08-24T19:00:00" - } -} -``` --For more information, see [FTP connector](data-factory-ftp-connector.md#copy-activity-properties) article. ---## HDFS --### Linked service -To define a HDFS linked service, set the **type** of the linked service to **Hdfs**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| type |The type property must be set to: **Hdfs** |Yes | -| Url |URL to the HDFS |Yes | -| authenticationType |Anonymous, or Windows. <br><br> To use **Kerberos authentication** for HDFS connector, refer to this section to set up your on-premises environment accordingly. |Yes | -| userName |Username for Windows authentication. |Yes (for Windows Authentication) | -| password |Password for Windows authentication. |Yes (for Windows Authentication) | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the HDFS. |Yes | -| encryptedCredential |[New-AzDataFactoryEncryptValue](/powershell/module/az.datafactory/new-azdatafactoryencryptvalue) output of the access credential. |No | --#### Example: Using Anonymous authentication --```json -{ - "name": "HDFSLinkedService", - "properties": { - "type": "Hdfs", - "typeProperties": { - "authenticationType": "Anonymous", - "userName": "hadoop", - "url": "http://<machine>:50070/webhdfs/v1/", - "gatewayName": "<onpremgateway>" - } - } -} -``` --#### Example: Using Windows authentication --```json -{ - "name": "HDFSLinkedService", - "properties": { - "type": "Hdfs", - "typeProperties": { - "authenticationType": "Windows", - "userName": "Administrator", - "password": "password", - "url": "http://<machine>:50070/webhdfs/v1/", - "gatewayName": "<onpremgateway>" - } - } -} -``` --For more information, see HDFS connector article. --### Dataset -To define a HDFS dataset, set the **type** of the dataset to **FileShare**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| folderPath |Path to the folder. Example: `myfolder`<br/><br/>Use escape character ‘ \ ’ for special characters in the string. For example: for folder\subfolder, specify folder\\\\subfolder and for d:\samplefolder, specify d:\\\\samplefolder.<br/><br/>You can combine this property with **partitionBy** to have folder paths based on slice start/end date-times. |Yes | -| fileName |Specify the name of the file in the **folderPath** if you want the table to refer to a specific file in the folder. If you do not specify any value for this property, the table points to all files in the folder.<br/><br/>When fileName is not specified for an output dataset, the name of the generated file would be in the following this format: <br/><br/>`Data.<Guid>.txt` (for example: : Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt |No | -| partitionedBy |partitionedBy can be used to specify a dynamic folderPath, filename for time series data. Example: folderPath parameterized for every hour of data. |No | -| format | The following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see [Text Format](data-factory-supported-file-and-compression-formats.md#text-format), [Json Format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro Format](data-factory-supported-file-and-compression-formats.md#avro-format), [Orc Format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet Format](data-factory-supported-file-and-compression-formats.md#parquet-format) sections. <br><br> If you want to **copy files as-is** between file-based stores (binary copy), skip the format section in both input and output dataset definitions. |No | -| compression | Specify the type and level of compression for the data. Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**. Supported levels are: **Optimal** and **Fastest**. For more information, see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). |No | --> [!NOTE] -> filename and fileFilter cannot be used simultaneously. --#### Example --```json -{ - "name": "InputDataset", - "properties": { - "type": "FileShare", - "linkedServiceName": "HDFSLinkedService", - "typeProperties": { - "folderPath": "DataTransfer/UnitTest/" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --For more information, see HDFS connector article. --### File System Source in Copy Activity -If you are copying data from HDFS, set the **source type** of the copy activity to **FileSystemSource**, and specify following properties in the **source** section: --**FileSystemSource** supports the following properties: --| Property | Description | Allowed values | Required | -| | | | | -| recursive |Indicates whether the data is read recursively from the sub folders or only from the specified folder. |True, False (default) |No | --#### Example --```json -{ - "name": "pipeline", - "properties": { - "activities": [{ - "name": "HdfsToBlobCopy", - "inputs": [{ - "name": "InputDataset" - }], - "outputs": [{ - "name": "OutputDataset" - }], - "type": "Copy", - "typeProperties": { - "source": { - "type": "FileSystemSource" - }, - "sink": { - "type": "BlobSink" - } - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "00:05:00" - } - }], - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00" - } -} -``` --For more information, see HDFS connector article. --## SFTP ---### Linked service -To define an SFTP linked service, set the **type** of the linked service to **Sftp**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| host | Name or IP address of the SFTP server. |Yes | -| port |Port on which the SFTP server is listening. The default value is: 21 |No | -| authenticationType |Specify authentication type. Allowed values: **Basic**, **SshPublicKey**. <br><br> Refer to Using basic authentication and [Using SSH public key authentication](#using-ssh-public-key-authentication) sections on more properties and JSON samples respectively. |Yes | -| skipHostKeyValidation | Specify whether to skip host key validation. | No. The default value: false | -| hostKeyFingerprint | Specify the finger print of the host key. | Yes if the `skipHostKeyValidation` is set to false. | -| gatewayName |Name of the Data Management Gateway to connect to an on-premises SFTP server. | Yes if copying data from an on-premises SFTP server. | -| encryptedCredential | Encrypted credential to access the SFTP server. Auto-generated when you specify basic authentication (username + password) or SshPublicKey authentication (username + private key path or content) in copy wizard or the ClickOnce popup dialog. | No. Apply only when copying data from an on-premises SFTP server. | --#### Example: Using basic authentication --To use basic authentication, set `authenticationType` as `Basic`, and specify the following properties besides the SFTP connector generic ones introduced in the last section: --| Property | Description | Required | -| | | | -| username | User who has access to the SFTP server. |Yes | -| password | Password for the user (username). | Yes | --```json -{ - "name": "SftpLinkedService", - "properties": { - "type": "Sftp", - "typeProperties": { - "host": "<SFTP server name or IP address>", - "port": 22, - "authenticationType": "Basic", - "username": "xxx", - "password": "xxx", - "skipHostKeyValidation": false, - "hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00", - "gatewayName": "<onpremgateway>" - } - } -} -``` --#### Example: **Basic authentication with encrypted credential** --```json -{ - "name": "SftpLinkedService", - "properties": { - "type": "Sftp", - "typeProperties": { - "host": "<FTP server name or IP address>", - "port": 22, - "authenticationType": "Basic", - "username": "xxx", - "encryptedCredential": "xxxxxxxxxxxxxxxxx", - "skipHostKeyValidation": false, - "hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00", - "gatewayName": "<onpremgateway>" - } - } -} -``` --#### **Using SSH public key authentication:** --To use basic authentication, set `authenticationType` as `SshPublicKey`, and specify the following properties besides the SFTP connector generic ones introduced in the last section: --| Property | Description | Required | -| | | | -| username |User who has access to the SFTP server |Yes | -| privateKeyPath | Specify absolute path to the private key file that gateway can access. | Specify either the `privateKeyPath` or `privateKeyContent`. <br><br> Apply only when copying data from an on-premises SFTP server. | -| privateKeyContent | A serialized string of the private key content. The Copy Wizard can read the private key file and extract the private key content automatically. If you are using any other tool/SDK, use the privateKeyPath property instead. | Specify either the `privateKeyPath` or `privateKeyContent`. | -| passPhrase | Specify the pass phrase/password to decrypt the private key if the key file is protected by a pass phrase. | Yes if the private key file is protected by a pass phrase. | --```json -{ - "name": "SftpLinkedServiceWithPrivateKeyPath", - "properties": { - "type": "Sftp", - "typeProperties": { - "host": "<FTP server name or IP address>", - "port": 22, - "authenticationType": "SshPublicKey", - "username": "xxx", - "privateKeyPath": "D:\\privatekey_openssh", - "passPhrase": "xxx", - "skipHostKeyValidation": true, - "gatewayName": "<onpremgateway>" - } - } -} -``` --#### Example: **SshPublicKey authentication using private key content** --```json -{ - "name": "SftpLinkedServiceWithPrivateKeyContent", - "properties": { - "type": "Sftp", - "typeProperties": { - "host": "mysftpserver.westus.cloudapp.azure.com", - "port": 22, - "authenticationType": "SshPublicKey", - "username": "xxx", - "privateKeyContent": "<base64 string of the private key content>", - "passPhrase": "xxx", - "skipHostKeyValidation": true - } - } -} -``` --For more information, see [SFTP connector](data-factory-sftp-connector.md#linked-service-properties) article. --### Dataset -To define an SFTP dataset, set the **type** of the dataset to **FileShare**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| folderPath |Sub path to the folder. Use escape character ‘ \ ’ for special characters in the string. See Sample linked service and dataset definitions for examples.<br/><br/>You can combine this property with **partitionBy** to have folder paths based on slice start/end date-times. |Yes | -| fileName |Specify the name of the file in the **folderPath** if you want the table to refer to a specific file in the folder. If you do not specify any value for this property, the table points to all files in the folder.<br/><br/>When fileName is not specified for an output dataset, the name of the generated file would be in the following this format: <br/><br/>`Data.<Guid>.txt` (Example: Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt) |No | -| fileFilter |Specify a filter to be used to select a subset of files in the folderPath rather than all files.<br/><br/>Allowed values are: `*` (multiple characters) and `?` (single character).<br/><br/>Examples 1: `"fileFilter": "*.log"`<br/>Example 2: `"fileFilter": 2016-1-?.txt"`<br/><br/> fileFilter is applicable for an input FileShare dataset. This property is not supported with HDFS. |No | -| partitionedBy |partitionedBy can be used to specify a dynamic folderPath, filename for time series data. For example, folderPath parameterized for every hour of data. |No | -| format | The following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see [Text Format](data-factory-supported-file-and-compression-formats.md#text-format), [Json Format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro Format](data-factory-supported-file-and-compression-formats.md#avro-format), [Orc Format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet Format](data-factory-supported-file-and-compression-formats.md#parquet-format) sections. <br><br> If you want to **copy files as-is** between file-based stores (binary copy), skip the format section in both input and output dataset definitions. |No | -| compression | Specify the type and level of compression for the data. Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**. Supported levels are: **Optimal** and **Fastest**. For more information, see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). |No | -| useBinaryTransfer |Specify whether use Binary transfer mode. True for binary mode and false ASCII. Default value: True. This property can only be used when associated linked service type is of type: FtpServer. |No | --> [!NOTE] -> filename and fileFilter cannot be used simultaneously. --#### Example --```json -{ - "name": "SFTPFileInput", - "properties": { - "type": "FileShare", - "linkedServiceName": "SftpLinkedService", - "typeProperties": { - "folderPath": "<path to shared folder>", - "fileName": "test.csv" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --For more information, see [SFTP connector](data-factory-sftp-connector.md#dataset-properties) article. --### File System Source in Copy Activity -If you are copying data from an SFTP source, set the **source type** of the copy activity to **FileSystemSource**, and specify following properties in the **source** section: --| Property | Description | Allowed values | Required | -| | | | | -| recursive |Indicates whether the data is read recursively from the sub folders or only from the specified folder. |True, False (default) |No | ----#### Example --```json -{ - "name": "pipeline", - "properties": { - "activities": [{ - "name": "SFTPToBlobCopy", - "inputs": [{ - "name": "SFTPFileInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "type": "Copy", - "typeProperties": { - "source": { - "type": "FileSystemSource" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "00:05:00" - } - }], - "start": "2017-02-20T18:00:00", - "end": "2017-02-20T19:00:00" - } -} -``` --For more information, see [SFTP connector](data-factory-sftp-connector.md#copy-activity-properties) article. ---## HTTP --### Linked service -To define an HTTP linked service, set the **type** of the linked service to **Http**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| url | Base URL to the Web Server | Yes | -| authenticationType | Specifies the authentication type. Allowed values are: **Anonymous**, **Basic**, **Digest**, **Windows**, **ClientCertificate**. <br><br> Refer to sections below this table on more properties and JSON samples for those authentication types respectively. | Yes | -| enableServerCertificateValidation | Specify whether to enable server TLS/SSL certificate validation if source is HTTPS Web Server | No, default is true | -| gatewayName | Name of the Data Management Gateway to connect to an on-premises HTTP source. | Yes if copying data from an on-premises HTTP source. | -| encryptedCredential | Encrypted credential to access the HTTP endpoint. Auto-generated when you configure the authentication information in copy wizard or the ClickOnce popup dialog. | No. Apply only when copying data from an on-premises HTTP server. | --#### Example: Using Basic, Digest, or Windows authentication -Set `authenticationType` as `Basic`, `Digest`, or `Windows`, and specify the following properties besides the HTTP connector generic ones introduced above: --| Property | Description | Required | -| | | | -| username | Username to access the HTTP endpoint. | Yes | -| password | Password for the user (username). | Yes | --```json -{ - "name": "HttpLinkedService", - "properties": { - "type": "Http", - "typeProperties": { - "authenticationType": "basic", - "url": "https://en.wikipedia.org/wiki/", - "userName": "user name", - "password": "password" - } - } -} -``` --#### Example: Using ClientCertificate authentication --To use basic authentication, set `authenticationType` as `ClientCertificate`, and specify the following properties besides the HTTP connector generic ones introduced above: --| Property | Description | Required | -| | | | -| embeddedCertData | The Base64-encoded contents of binary data of the Personal Information Exchange (PFX) file. | Specify either the `embeddedCertData` or `certThumbprint`. | -| certThumbprint | The thumbprint of the certificate that was installed on your gateway machine’s cert store. Apply only when copying data from an on-premises HTTP source. | Specify either the `embeddedCertData` or `certThumbprint`. | -| password | Password associated with the certificate. | No | --If you use `certThumbprint` for authentication and the certificate is installed in the personal store of the local computer, you need to grant the read permission to the gateway service: --1. Launch Microsoft Management Console (MMC). Add the **Certificates** snap-in that targets the **Local Computer**. -2. Expand **Certificates**, **Personal**, and click **Certificates**. -3. Right-click the certificate from the personal store, and select **All Tasks**->**Manage Private Keys...** -3. On the **Security** tab, add the user account under which Data Management Gateway Host Service is running with the read access to the certificate. --**Example: using client certificate:** -This linked service links your data factory to an on-premises HTTP web server. It uses a client certificate that is installed on the machine with Data Management Gateway installed. --```json -{ - "name": "HttpLinkedService", - "properties": { - "type": "Http", - "typeProperties": { - "authenticationType": "ClientCertificate", - "url": "https://en.wikipedia.org/wiki/", - "certThumbprint": "thumbprint of certificate", - "gatewayName": "gateway name" - } - } -} -``` --#### Example: using client certificate in a file -This linked service links your data factory to an on-premises HTTP web server. It uses a client certificate file on the machine with Data Management Gateway installed. --```json -{ - "name": "HttpLinkedService", - "properties": { - "type": "Http", - "typeProperties": { - "authenticationType": "ClientCertificate", - "url": "https://en.wikipedia.org/wiki/", - "embeddedCertData": "base64 encoded cert data", - "password": "password of cert" - } - } -} -``` --For more information, see [HTTP connector](data-factory-http-connector.md#linked-service-properties) article. --### Dataset -To define an HTTP dataset, set the **type** of the dataset to **Http**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -|: |: |: | -| relativeUrl | A relative URL to the resource that contains the data. When path is not specified, only the URL specified in the linked service definition is used. <br><br> To construct dynamic URL, you can use [Data Factory functions and system variables](data-factory-functions-variables.md), Example: `"relativeUrl": "$$Text.Format('/my/report?month={0:yyyy}-{0:MM}&fmt=csv', SliceStart)"`. | No | -| requestMethod | Http method. Allowed values are **GET** or **POST**. | No. Default is `GET`. | -| additionalHeaders | Additional HTTP request headers. | No | -| requestBody | Body for HTTP request. | No | -| format | If you want to simply **retrieve the data from HTTP endpoint as-is** without parsing it, skip this format settings. <br><br> If you want to parse the HTTP response content during copy, the following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. For more information, see [Text Format](data-factory-supported-file-and-compression-formats.md#text-format), [Json Format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro Format](data-factory-supported-file-and-compression-formats.md#avro-format), [Orc Format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet Format](data-factory-supported-file-and-compression-formats.md#parquet-format) sections. |No | -| compression | Specify the type and level of compression for the data. Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**. Supported levels are: **Optimal** and **Fastest**. For more information, see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). |No | --#### Example: using the GET (default) method --```json -{ - "name": "HttpSourceDataInput", - "properties": { - "type": "Http", - "linkedServiceName": "HttpLinkedService", - "typeProperties": { - "relativeUrl": "XXX/test.xml", - "additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --#### Example: using the POST method --```json -{ - "name": "HttpSourceDataInput", - "properties": { - "type": "Http", - "linkedServiceName": "HttpLinkedService", - "typeProperties": { - "relativeUrl": "/XXX/test.xml", - "requestMethod": "Post", - "requestBody": "body for POST HTTP request" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` -For more information, see [HTTP connector](data-factory-http-connector.md#dataset-properties) article. --### HTTP Source in Copy Activity -If you are copying data from an HTTP source, set the **source type** of the copy activity to **HttpSource**, and specify following properties in the **source** section: --| Property | Description | Required | -| -- | -- | -- | -| httpRequestTimeout | The timeout (TimeSpan) for the HTTP request to get a response. It is the timeout to get a response, not the timeout to read response data. | No. Default value: 00:01:40 | ---#### Example --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline with copy activity", - "activities": [{ - "name": "HttpSourceToAzureBlob", - "description": "Copy from an HTTP source to an Azure blob", - "type": "Copy", - "inputs": [{ - "name": "HttpSourceDataInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "typeProperties": { - "source": { - "type": "HttpSource" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` --For more information, see [HTTP connector](data-factory-http-connector.md#copy-activity-properties) article. --## OData --### Linked service -To define an OData linked service, set the **type** of the linked service to **OData**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| url |Url of the OData service. |Yes | -| authenticationType |Type of authentication used to connect to the OData source. <br/><br/> For cloud OData, possible values are Anonymous, Basic, and OAuth (note Azure Data Factory currently only support Microsoft Entra ID based OAuth). <br/><br/> For on-premises OData, possible values are Anonymous, Basic, and Windows. |Yes | -| username |Specify user name if you are using Basic authentication. |Yes (only if you are using Basic authentication) | -| password |Specify password for the user account you specified for the username. |Yes (only if you are using Basic authentication) | -| authorizedCredential |If you are using OAuth, click **Authorize** button in the Data Factory Copy Wizard or Editor and enter your credential, then the value of this property will be auto-generated. |Yes (only if you are using OAuth authentication) | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the on-premises OData service. Specify only if you are copying data from on premises OData source. |No | --#### Example - Using Basic authentication -```json -{ - "name": "inputLinkedService", - "properties": { - "type": "OData", - "typeProperties": { - "url": "https://services.odata.org/OData/OData.svc", - "authenticationType": "Basic", - "username": "username", - "password": "password" - } - } -} -``` --#### Example - Using Anonymous authentication --```json -{ - "name": "ODataLinkedService", - "properties": { - "type": "OData", - "typeProperties": { - "url": "https://services.odata.org/OData/OData.svc", - "authenticationType": "Anonymous" - } - } -} -``` --#### Example - Using Windows authentication accessing on-premises OData source --```json -{ - "name": "inputLinkedService", - "properties": { - "type": "OData", - "typeProperties": { - "url": "<endpoint of on-premises OData source, for example, Dynamics CRM>", - "authenticationType": "Windows", - "username": "domain\\user", - "password": "password", - "gatewayName": "<onpremgateway>" - } - } -} -``` --#### Example - Using OAuth authentication accessing cloud OData source -```json -{ - "name": "inputLinkedService", - "properties": - { - "type": "OData", - "typeProperties": - { - "url": "<endpoint of cloud OData source, for example, https://<tenant>.crm.dynamics.com/XRMServices/2011/OrganizationData.svc>", - "authenticationType": "OAuth", - "authorizedCredential": "<auto generated by clicking the Authorize button on UI>" - } - } -} -``` --For more information, see [OData connector](data-factory-odata-connector.md#linked-service-properties) article. --### Dataset -To define an OData dataset, set the **type** of the dataset to **ODataResource**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| path |Path to the OData resource |No | --#### Example --```json -{ - "name": "ODataDataset", - "properties": { - "type": "ODataResource", - "typeProperties": { - "path": "Products" - }, - "linkedServiceName": "ODataLinkedService", - "structure": [], - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } -} -``` --For more information, see [OData connector](data-factory-odata-connector.md#dataset-properties) article. --### Relational Source in Copy Activity -If you are copying data from an OData source, set the **source type** of the copy activity to **RelationalSource**, and specify following properties in the **source** section: --| Property | Description | Example | Required | -| | | | | -| query |Use the custom query to read data. |"?$select=Name, Description&$top=5" |No | --#### Example --```json -{ - "name": "CopyODataToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [{ - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "?$select=Name, Description&$top=5" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [{ - "name": "ODataDataSet" - }], - "outputs": [{ - "name": "AzureBlobODataDataSet" - }], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "ODataToBlob" - }], - "start": "2017-02-01T18:00:00", - "end": "2017-02-03T19:00:00" - } -} -``` --For more information, see [OData connector](data-factory-odata-connector.md#copy-activity-properties) article. ---## ODBC ---### Linked service -To define an ODBC linked service, set the **type** of the linked service to **OnPremisesOdbc**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| connectionString |The non-access credential portion of the connection string and an optional encrypted credential. See examples in the following sections. |Yes | -| credential |The access credential portion of the connection string specified in driver-specific property-value format. Example: `“Uid=<user ID>;Pwd=<password>;RefreshToken=<secret refresh token>;”.` |No | -| authenticationType |Type of authentication used to connect to the ODBC data store. Possible values are: Anonymous and Basic. |Yes | -| username |Specify user name if you are using Basic authentication. |No | -| password |Specify password for the user account you specified for the username. |No | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the ODBC data store. |Yes | --#### Example - Using Basic authentication --```json -{ - "name": "ODBCLinkedService", - "properties": { - "type": "OnPremisesOdbc", - "typeProperties": { - "authenticationType": "Basic", - "connectionString": "Driver={SQL Server};Server=Server.database.windows.net; Database=TestDatabase;", - "userName": "username", - "password": "password", - "gatewayName": "<onpremgateway>" - } - } -} -``` -#### Example - Using Basic authentication with encrypted credentials -You can encrypt the credentials using the [New-AzDataFactoryEncryptValue](/powershell/module/az.datafactory/new-azdatafactoryencryptvalue) cmdlet. --```json -{ - "name": "ODBCLinkedService", - "properties": { - "type": "OnPremisesOdbc", - "typeProperties": { - "authenticationType": "Basic", - "connectionString": "Driver={SQL Server};Server=myserver.database.windows.net; Database=TestDatabase;;EncryptedCredential=eyJDb25uZWN0...........................", - "gatewayName": "<onpremgateway>" - } - } -} -``` --#### Example: Using Anonymous authentication --```json -{ - "name": "ODBCLinkedService", - "properties": { - "type": "OnPremisesOdbc", - "typeProperties": { - "authenticationType": "Anonymous", - "connectionString": "Driver={SQL Server};Server={servername}.database.windows.net; Database=TestDatabase;", - "credential": "UID={uid};PWD={pwd}", - "gatewayName": "<onpremgateway>" - } - } -} -``` --For more information, see [ODBC connector](data-factory-odbc-connector.md#linked-service-properties) article. --### Dataset -To define an ODBC dataset, set the **type** of the dataset to **RelationalTable**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| tableName |Name of the table in the ODBC data store. |Yes | ---#### Example --```json -{ - "name": "ODBCDataSet", - "properties": { - "type": "RelationalTable", - "linkedServiceName": "ODBCLinkedService", - "typeProperties": {}, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --For more information, see [ODBC connector](data-factory-odbc-connector.md#dataset-properties) article. --### Relational Source in Copy Activity -If you are copying data from an ODBC data store, set the **source type** of the copy activity to **RelationalSource**, and specify following properties in the **source** section: --| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL query string. For example: `select * from MyTable`. |Yes | --#### Example --```json -{ - "name": "CopyODBCToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [{ - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [{ - "name": "OdbcDataSet" - }], - "outputs": [{ - "name": "AzureBlobOdbcDataSet" - }], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "OdbcToBlob" - }], - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00" - } -} -``` --For more information, see [ODBC connector](data-factory-odbc-connector.md#copy-activity-properties) article. --## Salesforce ---### Linked service -To define a Salesforce linked service, set the **type** of the linked service to **Salesforce**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| environmentUrl | Specify the URL of Salesforce instance. <br><br> - Default is "https:\//login.salesforce.com". <br> - To copy data from sandbox, specify "https://test.salesforce.com". <br> - To copy data from custom domain, specify, for example, "https://[domain].my.salesforce.com". |No | -| username |Specify a user name for the user account. |Yes | -| password |Specify a password for the user account. |Yes | -| securityToken |Specify a security token for the user account. See [Get security token](https://help.salesforce.com/apex/HTViewHelpDoc?id=user_security_token.htm) for instructions on how to reset/get a security token. To learn about security tokens in general, see [Security and the API](https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/sforce_api_concepts_security.htm). |Yes | --#### Example --```json -{ - "name": "SalesforceLinkedService", - "properties": { - "type": "Salesforce", - "typeProperties": { - "username": "<user name>", - "password": "<password>", - "securityToken": "<security token>" - } - } -} -``` --For more information, see [Salesforce connector](data-factory-salesforce-connector.md#linked-service-properties) article. --### Dataset -To define a Salesforce dataset, set the **type** of the dataset to **RelationalTable**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| tableName |Name of the table in Salesforce. |No (if a **query** of **RelationalSource** is specified) | --#### Example --```json -{ - "name": "SalesforceInput", - "properties": { - "linkedServiceName": "SalesforceLinkedService", - "type": "RelationalTable", - "typeProperties": { - "tableName": "AllDataType__c" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --For more information, see [Salesforce connector](data-factory-salesforce-connector.md#dataset-properties) article. --### Relational Source in Copy Activity -If you are copying data from Salesforce, set the **source type** of the copy activity to **RelationalSource**, and specify following properties in the **source** section: --| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |A SQL-92 query or [Salesforce Object Query Language (SOQL)](https://developer.salesforce.com/docs/atlas.en-us.soql_sosl.meta/soql_sosl/sforce_api_calls_soql.htm) query. For example: `select * from MyTable__c`. |No (if the **tableName** of the **dataset** is specified) | --#### Example ----```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline with copy activity", - "activities": [{ - "name": "SalesforceToAzureBlob", - "description": "Copy from Salesforce to an Azure blob", - "type": "Copy", - "inputs": [{ - "name": "SalesforceInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "SELECT Id, Col_AutoNumber__c, Col_Checkbox__c, Col_Currency__c, Col_Date__c, Col_DateTime__c, Col_Email__c, Col_Number__c, Col_Percent__c, Col_Phone__c, Col_Picklist__c, Col_Picklist_MultiSelect__c, Col_Text__c, Col_Text_Area__c, Col_Text_AreaLong__c, Col_Text_AreaRich__c, Col_URL__c, Col_Text_Encrypt__c, Col_Lookup__c FROM AllDataType__c" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` --> [!IMPORTANT] -> The "__c" part of the API Name is needed for any custom object. --For more information, see [Salesforce connector](data-factory-salesforce-connector.md#copy-activity-properties) article. --## Web Data --### Linked service -To define a Web linked service, set the **type** of the linked service to **Web**, and specify following properties in the **typeProperties** section: --| Property | Description | Required | -| | | | -| Url |URL to the Web source |Yes | -| authenticationType |Anonymous. |Yes | ---#### Example ---```json -{ - "name": "web", - "properties": { - "type": "Web", - "typeProperties": { - "authenticationType": "Anonymous", - "url": "https://en.wikipedia.org/wiki/" - } - } -} -``` --For more information, see [Web Table connector](data-factory-web-table-connector.md#linked-service-properties) article. --### Dataset -To define a Web dataset, set the **type** of the dataset to **WebTable**, and specify the following properties in the **typeProperties** section: --| Property | Description | Required | -|: |: |: | -| type |type of the dataset. must be set to **WebTable** |Yes | -| path |A relative URL to the resource that contains the table. |No. When path is not specified, only the URL specified in the linked service definition is used. | -| index |The index of the table in the resource. See Get index of a table in an HTML page section for steps to getting index of a table in an HTML page. |Yes | --#### Example --```json -{ - "name": "WebTableInput", - "properties": { - "type": "WebTable", - "linkedServiceName": "WebLinkedService", - "typeProperties": { - "index": 1, - "path": "AFI's_100_Years...100_Movies" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --For more information, see [Web Table connector](data-factory-web-table-connector.md#dataset-properties) article. --### Web Source in Copy Activity -If you are copying data from a web table, set the **source type** of the copy activity to **WebSource**. Currently, when the source in copy activity is of type **WebSource**, no additional properties are supported. --#### Example --```json -{ - "name": "SamplePipeline", - "properties": { - "start": "2016-06-01T18:00:00", - "end": "2016-06-01T19:00:00", - "description": "pipeline with copy activity", - "activities": [{ - "name": "WebTableToAzureBlob", - "description": "Copy from a Web table to an Azure blob", - "type": "Copy", - "inputs": [{ - "name": "WebTableInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "typeProperties": { - "source": { - "type": "WebSource" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - }] - } -} -``` --For more information, see [Web Table connector](data-factory-web-table-connector.md#copy-activity-properties) article. --## COMPUTE ENVIRONMENTS -The following table lists the compute environments supported by Data Factory and the transformation activities that can run on them. Click the link for the compute you are interested in to see the JSON schemas for linked service to link it to a data factory. --| Compute environment | Activities | -| | | -| [On-demand HDInsight cluster](#on-demand-azure-hdinsight-cluster) or [your own HDInsight cluster](#existing-azure-hdinsight-cluster) |[.NET custom activity](#net-custom-activity), [Hive activity](#hdinsight-hive-activity), [Pig activity](#hdinsight-pig-activity), [MapReduce activity](#hdinsight-mapreduce-activity), Hadoop streaming activity, [Spark activity](#hdinsight-spark-activity) | -| [Azure Batch](#azure-batch) |[.NET custom activity](#net-custom-activity) | -| [Machine Learning Studio (classic)](#ml-studio-classic) | [ML Studio (classic) Batch Execution Activity](#ml-studio-classic-batch-execution-activity), [ML Studio (classic) Update Resource Activity](#ml-studio-classic-update-resource-activity) | -| [Azure Data Lake Analytics](#azure-data-lake-analytics) |[Data Lake Analytics U-SQL](#data-lake-analytics-u-sql-activity) | -| [Azure SQL Database](#azure-sql-database), [Azure Synapse Analytics](#azure-synapse-analytics), [SQL Server](#sql-server-stored-procedure) |[Stored Procedure](#stored-procedure-activity) | --## On-demand Azure HDInsight cluster -The Azure Data Factory service can automatically create a Windows/Linux-based on-demand HDInsight cluster to process data. The cluster is created in the same region as the storage account (linkedServiceName property in the JSON) associated with the cluster. You can run the following transformation activities on this linked service: [.NET custom activity](#net-custom-activity), [Hive activity](#hdinsight-hive-activity), [Pig activity](#hdinsight-pig-activity), [MapReduce activity](#hdinsight-mapreduce-activity), Hadoop streaming activity, [Spark activity](#hdinsight-spark-activity). --### Linked service -The following table provides descriptions for the properties used in the Azure JSON definition of an on-demand HDInsight linked service. --| Property | Description | Required | -| | | | -| type |The type property should be set to **HDInsightOnDemand**. |Yes | -| clusterSize |Number of worker/data nodes in the cluster. The HDInsight cluster is created with 2 head nodes along with the number of worker nodes you specify for this property. The nodes are of size Standard_D3 that has 4 cores, so a 4 worker node cluster takes 24 cores (4\*4 = 16 cores for worker nodes, plus 2\*4 = 8 cores for head nodes). See [Create Linux-based Hadoop clusters in HDInsight](../../hdinsight/hdinsight-hadoop-provision-linux-clusters.md) for details about the Standard_D3 tier. |Yes | -| timetolive |The allowed idle time for the on-demand HDInsight cluster. Specifies how long the on-demand HDInsight cluster stays alive after completion of an activity run if there are no other active jobs in the cluster.<br/><br/>For example, if an activity run takes 6 minutes and timetolive is set to 5 minutes, the cluster stays alive for 5 minutes after the 6 minutes of processing the activity run. If another activity run is executed with the 6 minutes window, it is processed by the same cluster.<br/><br/>Creating an on-demand HDInsight cluster is an expensive operation (could take a while), so use this setting as needed to improve performance of a data factory by reusing an on-demand HDInsight cluster.<br/><br/>If you set timetolive value to 0, the cluster is deleted as soon as the activity run in processed. On the other hand, if you set a high value, the cluster may stay idle unnecessarily resulting in high costs. Therefore, it is important that you set the appropriate value based on your needs.<br/><br/>Multiple pipelines can share the same instance of the on-demand HDInsight cluster if the timetolive property value is appropriately set |Yes | -| version |Version of the HDInsight cluster. For details, see [supported HDInsight versions in Azure Data Factory](data-factory-compute-linked-services.md#supported-hdinsight-versions-in-azure-data-factory). |No | -| linkedServiceName |Azure Storage linked service to be used by the on-demand cluster for storing and processing data. <p>Currently, you cannot create an on-demand HDInsight cluster that uses an Azure Data Lake Store as the storage. If you want to store the result data from HDInsight processing in an Azure Data Lake Store, use a Copy Activity to copy the data from the Azure Blob Storage to the Azure Data Lake Store.</p> | Yes | -| additionalLinkedServiceNames |Specifies additional storage accounts for the HDInsight linked service so that the Data Factory service can register them on your behalf. |No | -| osType |Type of operating system. Allowed values are: Windows (default) and Linux |No | -| hcatalogLinkedServiceName |The name of Azure SQL linked service that point to the HCatalog database. The on-demand HDInsight cluster is created by using Azure SQL Database as the metastore. |No | --### JSON example -The following JSON defines a Linux-based on-demand HDInsight linked service. The Data Factory service automatically creates a **Linux-based** HDInsight cluster when processing a data slice. --```json -{ - "name": "HDInsightOnDemandLinkedService", - "properties": { - "type": "HDInsightOnDemand", - "typeProperties": { - "version": "3.5", - "clusterSize": 1, - "timeToLive": "00:05:00", - "osType": "Linux", - "linkedServiceName": "StorageLinkedService" - } - } -} -``` --For more information, see [Compute linked services](data-factory-compute-linked-services.md) article. --## Existing Azure HDInsight cluster -You can create an Azure HDInsight linked service to register your own HDInsight cluster with Data Factory. You can run the following data transformation activities on this linked service: [.NET custom activity](#net-custom-activity), [Hive activity](#hdinsight-hive-activity), [Pig activity](#hdinsight-pig-activity), [MapReduce activity](#hdinsight-mapreduce-activity), Hadoop streaming activity, [Spark activity](#hdinsight-spark-activity). --### Linked service -The following table provides descriptions for the properties used in the Azure JSON definition of an Azure HDInsight linked service. --| Property | Description | Required | -| | | | -| type |The type property should be set to **HDInsight**. |Yes | -| clusterUri |The URI of the HDInsight cluster. |Yes | -| username |Specify the name of the user to be used to connect to an existing HDInsight cluster. |Yes | -| password |Specify password for the user account. |Yes | -| linkedServiceName | Name of the Azure Storage linked service that refers to the Azure blob storage used by the HDInsight cluster. <p>Currently, you cannot specify an Azure Data Lake Store linked service for this property. You may access data in the Azure Data Lake Store from Hive/Pig scripts if the HDInsight cluster has access to the Data Lake Store. </p> |Yes | --For versions of HDInsight clusters supported, see [supported HDInsight versions](data-factory-compute-linked-services.md#supported-hdinsight-versions-in-azure-data-factory). --#### JSON example --```json -{ - "name": "HDInsightLinkedService", - "properties": { - "type": "HDInsight", - "typeProperties": { - "clusterUri": " https://<hdinsightclustername>.azurehdinsight.net/", - "userName": "admin", - "password": "<password>", - "linkedServiceName": "MyHDInsightStoragelinkedService" - } - } -} -``` --## Azure Batch -You can create an Azure Batch linked service to register a Batch pool of virtual machines (VMs) with a data factory. You can run .NET custom activities using either Azure Batch or Azure HDInsight. You can run a [.NET custom activity](#net-custom-activity) on this linked service. --### Linked service -The following table provides descriptions for the properties used in the Azure JSON definition of an Azure Batch linked service. --| Property | Description | Required | -| | | | -| type |The type property should be set to **AzureBatch**. |Yes | -| accountName |Name of the Azure Batch account. |Yes | -| accessKey |Access key for the Azure Batch account. |Yes | -| poolName |Name of the pool of virtual machines. |Yes | -| linkedServiceName |Name of the Azure Storage linked service associated with this Azure Batch linked service. This linked service is used for staging files required to run the activity and storing the activity execution logs. |Yes | ---#### JSON example --```json -{ - "name": "AzureBatchLinkedService", - "properties": { - "type": "AzureBatch", - "typeProperties": { - "accountName": "<Azure Batch account name>", - "accessKey": "<Azure Batch account key>", - "poolName": "<Azure Batch pool name>", - "linkedServiceName": "<Specify associated storage linked service reference here>" - } - } -} -``` --## ML Studio (classic) ---You create an ML Studio (classic) linked service to register a Studio (classic) batch scoring endpoint with a data factory. Two data transformation activities that can run on this linked service: [ML Studio (classic) Batch Execution Activity](#ml-studio-classic-batch-execution-activity), [ML Studio (classic) Update Resource Activity](#ml-studio-classic-update-resource-activity). --### Linked service -The following table provides descriptions for the properties used in the Azure JSON definition of a Studio (classic) linked service. --| Property | Description | Required | -| | | | -| Type |The type property should be set to: **AzureML**. |Yes | -| mlEndpoint |The batch scoring URL. |Yes | -| apiKey |The published workspace model’s API. |Yes | --#### JSON example --```json -{ - "name": "AzureMLLinkedService", - "properties": { - "type": "AzureML", - "typeProperties": { - "mlEndpoint": "https://[batch scoring endpoint]/jobs", - "apiKey": "<apikey>" - } - } -} -``` --## Azure Data Lake Analytics -You create an **Azure Data Lake Analytics** linked service to link an Azure Data Lake Analytics compute service to an Azure data factory before using the [Data Lake Analytics U-SQL activity](data-factory-usql-activity.md) in a pipeline. --### Linked service --The following table provides descriptions for the properties used in the JSON definition of an Azure Data Lake Analytics linked service. --| Property | Description | Required | -| | | | -| Type |The type property should be set to: **AzureDataLakeAnalytics**. |Yes | -| accountName |Azure Data Lake Analytics Account Name. |Yes | -| dataLakeAnalyticsUri |Azure Data Lake Analytics URI. |No | -| authorization |Authorization code is automatically retrieved after clicking **Authorize** button in the Data Factory Editor and completing the OAuth login. |Yes | -| subscriptionId |Azure subscription ID |No (If not specified, subscription of the data factory is used). | -| resourceGroupName |Azure resource group name |No (If not specified, resource group of the data factory is used). | -| sessionId |session ID from the OAuth authorization session. Each session ID is unique and may only be used once. When you use the Data Factory Editor, this ID is auto-generated. |Yes | ---#### JSON example -The following example provides JSON definition for an Azure Data Lake Analytics linked service. --```json -{ - "name": "AzureDataLakeAnalyticsLinkedService", - "properties": { - "type": "AzureDataLakeAnalytics", - "typeProperties": { - "accountName": "<account name>", - "dataLakeAnalyticsUri": "datalakeanalyticscompute.net", - "authorization": "<authcode>", - "sessionId": "<session ID>", - "subscriptionId": "<subscription id>", - "resourceGroupName": "<resource group name>" - } - } -} -``` --## SQL Server Stored Procedure --You create a SQL Server linked service and use it with the [Stored Procedure Activity](data-factory-stored-proc-activity.md) to invoke a stored procedure from a Data Factory pipeline. --### Linked service -You create a linked service of type **OnPremisesSqlServer** to link a SQL Server database to a data factory. The following table provides description for JSON elements specific to SQL Server linked service. --The following table provides description for JSON elements specific to SQL Server linked service. --| Property | Description | Required | -| | | | -| type |The type property should be set to: **OnPremisesSqlServer**. |Yes | -| connectionString |Specify connectionString information needed to connect to the SQL Server database using either SQL authentication or Windows authentication. |Yes | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the SQL Server database. |Yes | -| username |Specify user name if you are using Windows Authentication. Example: **domainname\\username**. |No | -| password |Specify password for the user account you specified for the username. |No | --You can encrypt credentials using the **New-AzDataFactoryEncryptValue** cmdlet and use them in the connection string as shown in the following example (**EncryptedCredential** property): --```JSON -"connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated Security=True;EncryptedCredential=<encrypted credential>", -``` ---#### Example: JSON for using SQL Authentication --```json -{ - "name": "MyOnPremisesSQLDB", - "properties": { - "type": "OnPremisesSqlServer", - "typeProperties": { - "connectionString": "Data Source=<servername>;Initial Catalog=MarketingCampaigns;Integrated Security=False;User ID=<username>;Password=<password>;", - "gatewayName": "<gateway name>" - } - } -} -``` -#### Example: JSON for using Windows Authentication --If username and password are specified, gateway uses them to impersonate the specified user account to connect to the SQL Server database. Otherwise, gateway connects to the SQL Server directly with the security context of Gateway (its startup account). --```json -{ - "Name": " MyOnPremisesSQLDB", - "Properties": { - "type": "OnPremisesSqlServer", - "typeProperties": { - "ConnectionString": "Data Source=<servername>;Initial Catalog=MarketingCampaigns;Integrated Security=True;", - "username": "<domain\\username>", - "password": "<password>", - "gatewayName": "<gateway name>" - } - } -} -``` --For more information, see [SQL Server connector](data-factory-sqlserver-connector.md#linked-service-properties) article. --## DATA TRANSFORMATION ACTIVITIES --Activity | Description | ---[HDInsight Hive activity](#hdinsight-hive-activity) | The HDInsight Hive activity in a Data Factory pipeline executes Hive queries on your own or on-demand Windows/Linux-based HDInsight cluster. -[HDInsight Pig activity](#hdinsight-pig-activity) | The HDInsight Pig activity in a Data Factory pipeline executes Pig queries on your own or on-demand Windows/Linux-based HDInsight cluster. -[HDInsight MapReduce Activity](#hdinsight-mapreduce-activity) | The HDInsight MapReduce activity in a Data Factory pipeline executes MapReduce programs on your own or on-demand Windows/Linux-based HDInsight cluster. -[HDInsight Streaming Activity](#hdinsight-streaming-activity) | The HDInsight Streaming Activity in a Data Factory pipeline executes Hadoop Streaming programs on your own or on-demand Windows/Linux-based HDInsight cluster. -[HDInsight Spark Activity](#hdinsight-spark-activity) | The HDInsight Spark activity in a Data Factory pipeline executes Spark programs on your own HDInsight cluster. -[ML Studio (classic) Batch Execution Activity](#ml-studio-classic-batch-execution-activity) | Azure Data Factory enables you to easily create pipelines that use a published Studio (classic) web service for predictive analytics. Using the Batch Execution Activity in an Azure Data Factory pipeline, you can invoke a Studio (classic) web service to make predictions on the data in batch. -[ML Studio (classic) Update Resource Activity](#ml-studio-classic-update-resource-activity) | Over time, the predictive models in the ML Studio (classic) scoring experiments need to be retrained using new input datasets. After you are done with retraining, you want to update the scoring web service with the retrained machine learning model. You can use the Update Resource Activity to update the web service with the newly trained model. -[Stored Procedure Activity](#stored-procedure-activity) | You can use the Stored Procedure activity in a Data Factory pipeline to invoke a stored procedure in one of the following data stores: Azure SQL Database, Azure Synapse Analytics, SQL Server Database in your enterprise or an Azure VM. -[Data Lake Analytics U-SQL activity](#data-lake-analytics-u-sql-activity) | Data Lake Analytics U-SQL Activity runs a U-SQL script on an Azure Data Lake Analytics cluster. -[.NET custom activity](#net-custom-activity) | If you need to transform data in a way that is not supported by Data Factory, you can create a custom activity with your own data processing logic and use the activity in the pipeline. You can configure the custom .NET activity to run using either an Azure Batch service or an Azure HDInsight cluster. ---## HDInsight Hive Activity -You can specify the following properties in a Hive Activity JSON definition. The type property for the activity must be: **HDInsightHive**. You must create a HDInsight linked service first and specify the name of it as a value for the **linkedServiceName** property. The following properties are supported in the **typeProperties** section when you set the type of activity to HDInsightHive: --| Property | Description | Required | -| | | | -| script |Specify the Hive script inline |No | -| script path |Store the Hive script in an Azure blob storage and provide the path to the file. Use 'script' or 'scriptPath' property. Both cannot be used together. The file name is case-sensitive. |No | -| defines |Specify parameters as key/value pairs for referencing within the Hive script using 'hiveconf' |No | --These type properties are specific to the Hive Activity. Other properties (outside the typeProperties section) are supported for all activities. --### JSON example -The following JSON defines a HDInsight Hive activity in a pipeline. --```json -{ - "name": "Hive Activity", - "description": "description", - "type": "HDInsightHive", - "inputs": [ - { - "name": "input tables" - } - ], - "outputs": [ - { - "name": "output tables" - } - ], - "linkedServiceName": "MyHDInsightLinkedService", - "typeProperties": { - "script": "Hive script", - "scriptPath": "<pathtotheHivescriptfileinAzureblobstorage>", - "defines": { - "param1": "param1Value" - } - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - } -} -``` --For more information, see [Hive Activity](data-factory-hive-activity.md) article. --## HDInsight Pig Activity -You can specify the following properties in a Pig Activity JSON definition. The type property for the activity must be: **HDInsightPig**. You must create a HDInsight linked service first and specify the name of it as a value for the **linkedServiceName** property. The following properties are supported in the **typeProperties** section when you set the type of activity to HDInsightPig: --| Property | Description | Required | -| | | | -| script |Specify the Pig script inline |No | -| script path |Store the Pig script in an Azure blob storage and provide the path to the file. Use 'script' or 'scriptPath' property. Both cannot be used together. The file name is case-sensitive. |No | -| defines |Specify parameters as key/value pairs for referencing within the Pig script |No | --These type properties are specific to the Pig Activity. Other properties (outside the typeProperties section) are supported for all activities. --### JSON example --```json -{ - "name": "HiveActivitySamplePipeline", - "properties": { - "activities": [ - { - "name": "Pig Activity", - "description": "description", - "type": "HDInsightPig", - "inputs": [ - { - "name": "input tables" - } - ], - "outputs": [ - { - "name": "output tables" - } - ], - "linkedServiceName": "MyHDInsightLinkedService", - "typeProperties": { - "script": "Pig script", - "scriptPath": "<pathtothePigscriptfileinAzureblobstorage>", - "defines": { - "param1": "param1Value" - } - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - } - } - ] - } -} -``` --For more information, see Pig Activity article. --## HDInsight MapReduce Activity -You can specify the following properties in a MapReduce Activity JSON definition. The type property for the activity must be: **HDInsightMapReduce**. You must create a HDInsight linked service first and specify the name of it as a value for the **linkedServiceName** property. The following properties are supported in the **typeProperties** section when you set the type of activity to HDInsightMapReduce: --| Property | Description | Required | -| | | | -| jarLinkedService | Name of the linked service for the Azure Storage that contains the JAR file. | Yes | -| jarFilePath | Path to the JAR file in the Azure Storage. | Yes | -| className | Name of the main class in the JAR file. | Yes | -| arguments | A list of comma-separated arguments for the MapReduce program. At runtime, you see a few extra arguments (for example: mapreduce.job.tags) from the MapReduce framework. To differentiate your arguments with the MapReduce arguments, consider using both option and value as arguments as shown in the following example (-s, --input, --output etc., are options immediately followed by their values) | No | --### JSON example --```json -{ - "name": "MahoutMapReduceSamplePipeline", - "properties": { - "description": "Sample Pipeline to Run a Mahout Custom Map Reduce Jar. This job calculates an Item Similarity Matrix to determine the similarity between two items", - "activities": [ - { - "type": "HDInsightMapReduce", - "typeProperties": { - "className": "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob", - "jarFilePath": "adfsamples/Mahout/jars/mahout-examples-0.9.0.2.2.7.1-34.jar", - "jarLinkedService": "StorageLinkedService", - "arguments": ["-s", "SIMILARITY_LOGLIKELIHOOD", "--input", "wasb://adfsamples@spestore.blob.core.windows.net/Mahout/input", "--output", "wasb://adfsamples@spestore.blob.core.windows.net/Mahout/output/", "--maxSimilaritiesPerItem", "500", "--tempDir", "wasb://adfsamples@spestore.blob.core.windows.net/Mahout/temp/mahout"] - }, - "inputs": [ - { - "name": "MahoutInput" - } - ], - "outputs": [ - { - "name": "MahoutOutput" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1, - "retry": 3 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "MahoutActivity", - "description": "Custom Map Reduce to generate Mahout result", - "linkedServiceName": "HDInsightLinkedService" - } - ], - "start": "2017-01-03T00:00:00", - "end": "2017-01-04T00:00:00" - } -} -``` --For more information, see [MapReduce Activity](data-factory-map-reduce.md) article. --## HDInsight Streaming Activity -You can specify the following properties in a Hadoop Streaming Activity JSON definition. The type property for the activity must be: **HDInsightStreaming**. You must create a HDInsight linked service first and specify the name of it as a value for the **linkedServiceName** property. The following properties are supported in the **typeProperties** section when you set the type of activity to HDInsightStreaming: --| Property | Description | -| | | -| mapper | Name of the mapper executable. In the example, cat.exe is the mapper executable.| -| reducer | Name of the reducer executable. In the example, wc.exe is the reducer executable. | -| input | Input file (including location) for the mapper. In the example: `"wasb://adfsample@<account name>.blob.core.windows.net/example/data/gutenberg/davinci.txt"`: adfsample is the blob container, example/data/Gutenberg is the folder, and davinci.txt is the blob. | -| output | Output file (including location) for the reducer. The output of the Hadoop Streaming job is written to the location specified for this property. | -| filePaths | Paths for the mapper and reducer executables. In the example: "adfsample/example/apps/wc.exe", adfsample is the blob container, example/apps is the folder, and wc.exe is the executable. | -| fileLinkedService | Azure Storage linked service that represents the Azure storage that contains the files specified in the filePaths section. | -| arguments | A list of comma-separated arguments for the MapReduce program. At runtime, you see a few extra arguments (for example: mapreduce.job.tags) from the MapReduce framework. To differentiate your arguments with the MapReduce arguments, consider using both option and value as arguments as shown in the following example (-s, --input, --output etc., are options immediately followed by their values) | -| getDebugInfo | An optional element. When it is set to Failure, the logs are downloaded only on failure. When it is set to All, logs are always downloaded irrespective of the execution status. | --> [!NOTE] -> You must specify an output dataset for the Hadoop Streaming Activity for the **outputs** property. This dataset can be just a dummy dataset that is required to drive the pipeline schedule (hourly, daily, etc.). If the activity doesn't take an input, you can skip specifying an input dataset for the activity for the **inputs** property. --## JSON example --```json -{ - "name": "HadoopStreamingPipeline", - "properties": { - "description": "Hadoop Streaming Demo", - "activities": [ - { - "type": "HDInsightStreaming", - "typeProperties": { - "mapper": "cat.exe", - "reducer": "wc.exe", - "input": "wasb://<nameofthecluster>@spestore.blob.core.windows.net/example/data/gutenberg/davinci.txt", - "output": "wasb://<nameofthecluster>@spestore.blob.core.windows.net/example/data/StreamingOutput/wc.txt", - "filePaths": ["<nameofthecluster>/example/apps/wc.exe","<nameofthecluster>/example/apps/cat.exe"], - "fileLinkedService": "StorageLinkedService", - "getDebugInfo": "Failure" - }, - "outputs": [ - { - "name": "StreamingOutputDataset" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1 - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - }, - "name": "RunHadoopStreamingJob", - "description": "Run a Hadoop streaming job", - "linkedServiceName": "HDInsightLinkedService" - } - ], - "start": "2014-01-04T00:00:00", - "end": "2014-01-05T00:00:00" - } -} -``` --For more information, see [Hadoop Streaming Activity](data-factory-hadoop-streaming-activity.md) article. --## HDInsight Spark Activity -You can specify the following properties in a Spark Activity JSON definition. The type property for the activity must be: **HDInsightSpark**. You must create a HDInsight linked service first and specify the name of it as a value for the **linkedServiceName** property. The following properties are supported in the **typeProperties** section when you set the type of activity to HDInsightSpark: --| Property | Description | Required | -| -- | -- | -- | -| rootPath | The Azure Blob container and folder that contains the Spark file. The file name is case-sensitive. | Yes | -| entryFilePath | Relative path to the root folder of the Spark code/package. | Yes | -| className | Application's Java/Spark main class | No | -| arguments | A list of command-line arguments to the Spark program. | No | -| proxyUser | The user account to impersonate to execute the Spark program | No | -| sparkConfig | Spark configuration properties. | No | -| getDebugInfo | Specifies when the Spark log files are copied to the Azure storage used by HDInsight cluster (or) specified by sparkJobLinkedService. Allowed values: None, Always, or Failure. Default value: None. | No | -| sparkJobLinkedService | The Azure Storage linked service that holds the Spark job file, dependencies, and logs. If you do not specify a value for this property, the storage associated with HDInsight cluster is used. | No | --### JSON example --```json -{ - "name": "SparkPipeline", - "properties": { - "activities": [ - { - "type": "HDInsightSpark", - "typeProperties": { - "rootPath": "adfspark\\pyFiles", - "entryFilePath": "test.py", - "getDebugInfo": "Always" - }, - "outputs": [ - { - "name": "OutputDataset" - } - ], - "name": "MySparkActivity", - "linkedServiceName": "HDInsightLinkedService" - } - ], - "start": "2017-02-05T00:00:00", - "end": "2017-02-06T00:00:00" - } -} -``` -Note the following points: --- The **type** property is set to **HDInsightSpark**.-- The **rootPath** is set to **adfspark\\pyFiles** where adfspark is the Azure Blob container and pyFiles is fine folder in that container. In this example, the Azure Blob Storage is the one that is associated with the Spark cluster. You can upload the file to a different Azure Storage. If you do so, create an Azure Storage linked service to link that storage account to the data factory. Then, specify the name of the linked service as a value for the **sparkJobLinkedService** property. See Spark Activity properties for details about this property and other properties supported by the Spark Activity.-- The **entryFilePath** is set to the **test.py**, which is the Python file.-- The **getDebugInfo** property is set to **Always**, which means the log files are always generated (success or failure).-- > [!IMPORTANT] - > We recommend that you do not set this property to Always in a production environment unless you are troubleshooting an issue. -- The **outputs** section has one output dataset. You must specify an output dataset even if the spark program does not produce any output. The output dataset drives the schedule for the pipeline (hourly, daily, etc.).--For more information about the activity, see [Spark Activity](data-factory-spark.md) article. --## ML Studio (classic) Batch Execution Activity ---You can specify the following properties in an ML Studio (classic) Batch Execution Activity JSON definition. The type property for the activity must be: **AzureMLBatchExecution**. You must create a Studio (classic) linked service first and specify the name of it as a value for the **linkedServiceName** property. The following properties are supported in the **typeProperties** section when you set the type of activity to AzureMLBatchExecution: --Property | Description | Required | -- | ---webServiceInput | The dataset to be passed as an input for the Studio (classic) web service. This dataset must also be included in the inputs for the activity. |Use either webServiceInput or webServiceInputs. | -webServiceInputs | Specify datasets to be passed as inputs for the Studio (classic) web service. If the web service takes multiple inputs, use the webServiceInputs property instead of using the webServiceInput property. Datasets that are referenced by the **webServiceInputs** must also be included in the Activity **inputs**. | Use either webServiceInput or webServiceInputs. | -webServiceOutputs | The datasets that are assigned as outputs for the Studio (classic) web service. The web service returns output data in this dataset. | Yes | -globalParameters | Specify values for the web service parameters in this section. | No | --### JSON example -In this example, the activity has the dataset **MLSqlInput** as input and **MLSqlOutput** as the output. The **MLSqlInput** is passed as an input to the web service by using the **webServiceInput** JSON property. The **MLSqlOutput** is passed as an output to the Web service by using the **webServiceOutputs** JSON property. --```json -{ - "name": "MLWithSqlReaderSqlWriter", - "properties": { - "description": "Azure ML model with sql azure reader/writer", - "activities": [{ - "name": "MLSqlReaderSqlWriterActivity", - "type": "AzureMLBatchExecution", - "description": "test", - "inputs": [ { "name": "MLSqlInput" }], - "outputs": [ { "name": "MLSqlOutput" } ], - "linkedServiceName": "MLSqlReaderSqlWriterDecisionTreeModel", - "typeProperties": - { - "webServiceInput": "MLSqlInput", - "webServiceOutputs": { - "output1": "MLSqlOutput" - }, - "globalParameters": { - "Database server name": "<myserver>.database.windows.net", - "Database name": "<database>", - "Server user account name": "<user name>", - "Server user account password": "<password>" - } - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "02:00:00" - } - }], - "start": "2016-02-13T00:00:00", - "end": "2016-02-14T00:00:00" - } -} -``` --In the JSON example, the deployed Studio (classic) Web service uses a reader and a writer module to read/write data from/to an Azure SQL Database. This Web service exposes the following four parameters: Database server name, Database name, Server user account name, and Server user account password. --> [!NOTE] -> Only inputs and outputs of the AzureMLBatchExecution activity can be passed as parameters to the Web service. For example, in the above JSON snippet, MLSqlInput is an input to the AzureMLBatchExecution activity, which is passed as an input to the Web service via webServiceInput parameter. --## ML Studio (classic) Update Resource Activity -You can specify the following properties in an ML Studio (classic) Update Resource Activity JSON definition. The type property for the activity must be: **AzureMLUpdateResource**. You must create a Studio (classic) linked service first and specify the name of it as a value for the **linkedServiceName** property. The following properties are supported in the **typeProperties** section when you set the type of activity to AzureMLUpdateResource: --Property | Description | Required | -- | ---trainedModelName | Name of the retrained model. | Yes | -trainedModelDatasetName | Dataset pointing to the iLearner file returned by the retraining operation. | Yes | --### JSON example -The pipeline has two activities: **AzureMLBatchExecution** and **AzureMLUpdateResource**. The Studio (classic) Batch Execution activity takes the training data as input and produces an iLearner file as an output. The activity invokes the training web service (training experiment exposed as a web service) with the input training data and receives the ilearner file from the webservice. The placeholderBlob is just a dummy output dataset that is required by the Azure Data Factory service to run the pipeline. ---```json -{ - "name": "pipeline", - "properties": { - "activities": [ - { - "name": "retraining", - "type": "AzureMLBatchExecution", - "inputs": [ - { - "name": "trainingData" - } - ], - "outputs": [ - { - "name": "trainedModelBlob" - } - ], - "typeProperties": { - "webServiceInput": "trainingData", - "webServiceOutputs": { - "output1": "trainedModelBlob" - } - }, - "linkedServiceName": "trainingEndpoint", - "policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "02:00:00" - } - }, - { - "type": "AzureMLUpdateResource", - "typeProperties": { - "trainedModelName": "trained model", - "trainedModelDatasetName" : "trainedModelBlob" - }, - "inputs": [{ "name": "trainedModelBlob" }], - "outputs": [{ "name": "placeholderBlob" }], - "policy": { - "timeout": "01:00:00", - "concurrency": 1, - "retry": 3 - }, - "name": "AzureML Update Resource", - "linkedServiceName": "updatableScoringEndpoint2" - } - ], - "start": "2016-02-13T00:00:00", - "end": "2016-02-14T00:00:00" - } -} -``` --## Data Lake Analytics U-SQL Activity -You can specify the following properties in a U-SQL Activity JSON definition. The type property for the activity must be: **DataLakeAnalyticsU-SQL**. You must create an Azure Data Lake Analytics linked service and specify the name of it as a value for the **linkedServiceName** property. The following properties are supported in the **typeProperties** section when you set the type of activity to DataLakeAnalyticsU-SQL: --| Property | Description | Required | -|: |: |: | -| scriptPath |Path to folder that contains the U-SQL script. Name of the file is case-sensitive. |No (if you use script) | -| scriptLinkedService |Linked service that links the storage that contains the script to the data factory |No (if you use script) | -| script |Specify inline script instead of specifying scriptPath and scriptLinkedService. For example: "script": "CREATE DATABASE test". |No (if you use scriptPath and scriptLinkedService) | -| degreeOfParallelism |The maximum number of nodes simultaneously used to run the job. |No | -| priority |Determines which jobs out of all that are queued should be selected to run first. The lower the number, the higher the priority. |No | -| parameters |Parameters for the U-SQL script |No | --### JSON example --```json -{ - "name": "ComputeEventsByRegionPipeline", - "properties": { - "description": "This pipeline computes events for en-gb locale and date less than Feb 19, 2012.", - "activities": - [ - { - "type": "DataLakeAnalyticsU-SQL", - "typeProperties": { - "scriptPath": "scripts\\kona\\SearchLogProcessing.txt", - "scriptLinkedService": "StorageLinkedService", - "degreeOfParallelism": 3, - "priority": 100, - "parameters": { - "in": "/datalake/input/SearchLog.tsv", - "out": "/datalake/output/Result.tsv" - } - }, - "inputs": [ - { - "name": "DataLakeTable" - } - ], - "outputs": - [ - { - "name": "EventsByRegionTable" - } - ], - "policy": { - "timeout": "06:00:00", - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1 - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - }, - "name": "EventsByRegion", - "linkedServiceName": "AzureDataLakeAnalyticsLinkedService" - } - ], - "start": "2015-08-08T00:00:00", - "end": "2015-08-08T01:00:00", - "isPaused": false - } -} -``` --For more information, see [Data Lake Analytics U-SQL Activity](data-factory-usql-activity.md). --## Stored Procedure Activity -You can specify the following properties in a Stored Procedure Activity JSON definition. The type property for the activity must be: **SqlServerStoredProcedure**. You must create a one of the following linked services and specify the name of the linked service as a value for the **linkedServiceName** property: --- SQL Server-- Azure SQL Database-- Azure Synapse Analytics--The following properties are supported in the **typeProperties** section when you set the type of activity to SqlServerStoredProcedure: --| Property | Description | Required | -| | | | -| storedProcedureName |Specify the name of the stored procedure in Azure SQL Database or Azure Synapse Analytics that is represented by the linked service that the output table uses. |Yes | -| storedProcedureParameters |Specify values for stored procedure parameters. If you need to pass null for a parameter, use the syntax: "param1": null (all lower case). See the following sample to learn about using this property. |No | --If you do specify an input dataset, it must be available (in ‘Ready’ status) for the stored procedure activity to run. The input dataset cannot be consumed in the stored procedure as a parameter. It is only used to check the dependency before starting the stored procedure activity. You must specify an output dataset for a stored procedure activity. --Output dataset specifies the **schedule** for the stored procedure activity (hourly, weekly, monthly, etc.). The output dataset must use a **linked service** that refers to an Azure SQL Database or an Azure Synapse Analytics or a SQL Server Database in which you want the stored procedure to run. The output dataset can serve as a way to pass the result of the stored procedure for subsequent processing by another activity ([chaining activities](data-factory-scheduling-and-execution.md#multiple-activities-in-a-pipeline)) in the pipeline. However, Data Factory does not automatically write the output of a stored procedure to this dataset. It is the stored procedure that writes to a SQL table that the output dataset points to. In some cases, the output dataset can be a **dummy dataset**, which is used only to specify the schedule for running the stored procedure activity. --### JSON example --```json -{ - "name": "SprocActivitySamplePipeline", - "properties": { - "activities": [ - { - "type": "SqlServerStoredProcedure", - "typeProperties": { - "storedProcedureName": "usp_sample", - "storedProcedureParameters": { - "DateTime": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)" - } - }, - "outputs": [{ "name": "sprocsampleout" }], - "name": "SprocActivitySample" - } - ], - "start": "2016-08-02T00:00:00", - "end": "2016-08-02T05:00:00", - "isPaused": false - } -} -``` --For more information, see [Stored Procedure Activity](data-factory-stored-proc-activity.md) article. --## .NET custom activity -You can specify the following properties in a .NET custom activity JSON definition. The type property for the activity must be: **DotNetActivity**. You must create an Azure HDInsight linked service or an Azure Batch linked service, and specify the name of the linked service as a value for the **linkedServiceName** property. The following properties are supported in the **typeProperties** section when you set the type of activity to DotNetActivity: --| Property | Description | Required | -|: |: |: | -| AssemblyName | Name of the assembly. In the example, it is: **MyDotnetActivity.dll**. | Yes | -| EntryPoint |Name of the class that implements the IDotNetActivity interface. In the example, it is: **MyDotNetActivityNS.MyDotNetActivity** where MyDotNetActivityNS is the namespace and MyDotNetActivity is the class. | Yes | -| PackageLinkedService | Name of the Azure Storage linked service that points to the blob storage that contains the custom activity zip file. In the example, it is: **AzureStorageLinkedService**.| Yes | -| PackageFile | Name of the zip file. In the example, it is: **customactivitycontainer/MyDotNetActivity.zip**. | Yes | -| extendedProperties | Extended properties that you can define and pass on to the .NET code. In this example, the **SliceStart** variable is set to a value based on the SliceStart system variable. | No | --### JSON example --```json -{ - "name": "ADFTutorialPipelineCustom", - "properties": { - "description": "Use custom activity", - "activities": [ - { - "Name": "MyDotNetActivity", - "Type": "DotNetActivity", - "Inputs": [ - { - "Name": "InputDataset" - } - ], - "Outputs": [ - { - "Name": "OutputDataset" - } - ], - "LinkedServiceName": "AzureBatchLinkedService", - "typeProperties": { - "AssemblyName": "MyDotNetActivity.dll", - "EntryPoint": "MyDotNetActivityNS.MyDotNetActivity", - "PackageLinkedService": "AzureStorageLinkedService", - "PackageFile": "customactivitycontainer/MyDotNetActivity.zip", - "extendedProperties": { - "SliceStart": "$$Text.Format('{0:yyyyMMddHH-mm}', Time.AddMinutes(SliceStart, 0))" - } - }, - "Policy": { - "Concurrency": 2, - "ExecutionPriorityOrder": "OldestFirst", - "Retry": 3, - "Timeout": "00:30:00", - "Delay": "00:00:00" - } - } - ], - "start": "2016-11-16T00:00:00", - "end": "2016-11-16T05:00:00", - "isPaused": false - } -} -``` --For detailed information, see [Use custom activities in Data Factory](data-factory-use-custom-activities.md) article. --## Next steps -See the following tutorials: --- [Tutorial: create a pipeline with a copy activity](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md)-- [Tutorial: create a pipeline with a hive activity](data-factory-build-your-first-pipeline.md) |
data-factory | Data Factory Load Sql Data Warehouse | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-load-sql-data-warehouse.md | - Title: Load terabytes of data into Azure Synapse Analytics -description: Demonstrates how 1 TB of data can be loaded into Azure Synapse Analytics under 15 minutes with Azure Data Factory ---- Previously updated : 04/12/2023----# Load 1 TB into Azure Synapse Analytics under 15 minutes with Data Factory -> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Copy data to or from Azure Synapse Analytics by using Data Factory](../connector-azure-sql-data-warehouse.md). ---[Azure Synapse Analytics](../../synapse-analytics/sql-data-warehouse/sql-data-warehouse-overview-what-is.md) is a cloud-based, scale-out database capable of processing massive volumes of data, both relational and non-relational. Built on massively parallel processing (MPP) architecture, Azure Synapse Analytics is optimized for enterprise data warehouse workloads. It offers cloud elasticity with the flexibility to scale storage and compute independently. --Getting started with Azure Synapse Analytics is now easier than ever using **Azure Data Factory**. Azure Data Factory is a fully managed cloud-based data integration service, which can be used to populate Azure Synapse Analytics with the data from your existing system, and saving you valuable time while evaluating Azure Synapse Analytics and building your analytics solutions. Here are the key benefits of loading data into Azure Synapse Analytics using Azure Data Factory: --* **Easy to set up**: 5-step intuitive wizard with no scripting required. -* **Rich data store support**: built-in support for a rich set of on-premises and cloud-based data stores. -* **Secure and compliant**: data is transferred over HTTPS or ExpressRoute, and global service presence ensures your data never leaves the geographical boundary -* **Unparalleled performance by using PolyBase** - Using Polybase is the most efficient way to move data into Azure Synapse Analytics. Using the staging blob feature, you can achieve high load speeds from all types of data stores besides Azure Blob storage, which the Polybase supports by default. --This article shows you how to use Data Factory Copy Wizard to load 1-TB data from Azure Blob Storage into Azure Synapse Analytics in under 15 minutes, at over 1.2 GBps throughput. --This article provides step-by-step instructions for moving data into Azure Synapse Analytics by using the Copy Wizard. --> [!NOTE] -> For general information about capabilities of Data Factory in moving data to/from Azure Synapse Analytics, see [Move data to and from Azure Synapse Analytics using Azure Data Factory](data-factory-azure-sql-data-warehouse-connector.md) article. -> -> You can also build pipelines using Visual Studio, PowerShell, etc. See [Tutorial: Copy data from Azure Blob to Azure SQL Database](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for a quick walkthrough with step-by-step instructions for using the Copy Activity in Azure Data Factory. -> -> --## Prerequisites -* Azure Blob Storage: this experiment uses Azure Blob Storage (GRS) for storing TPC-H testing dataset. If you do not have an Azure storage account, learn [how to create a storage account](../../storage/common/storage-account-create.md). -* [TPC-H](http://www.tpc.org/tpch/) data: we are going to use TPC-H as the testing dataset. To do that, you need to use `dbgen` from TPC-H toolkit, which helps you generate the dataset. You can either download source code for `dbgen` from [TPC Tools](http://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp) and compile it yourself, or download the compiled binary from [GitHub](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/TPCHTools). Run dbgen.exe with the following commands to generate 1 TB flat file for `lineitem` table spread across 10 files: -- * `Dbgen -s 1000 -S **1** -C 10 -T L -v` - * `Dbgen -s 1000 -S **2** -C 10 -T L -v` - * … - * `Dbgen -s 1000 -S **10** -C 10 -T L -v` -- Now copy the generated files to Azure Blob. Refer to [Move data to and from an on-premises file system by using Azure Data Factory](data-factory-onprem-file-system-connector.md) for how to do that using ADF Copy. -* Azure Synapse Analytics: this experiment loads data into Azure Synapse Analytics created with 6,000 DWUs -- Refer to [Create an Azure Synapse Analytics](../../synapse-analytics/sql-data-warehouse/create-data-warehouse-portal.md) for detailed instructions on how to create an Azure Synapse Analytics database. To get the best possible load performance into Azure Synapse Analytics using Polybase, we choose maximum number of Data Warehouse Units (DWUs) allowed in the Performance setting, which is 6,000 DWUs. -- > [!NOTE] - > When loading from Azure Blob, the data loading performance is directly proportional to the number of DWUs you configure for Azure Synapse Analytics: - > - > Loading 1 TB into 1,000 DWU Azure Synapse Analytics takes 87 minutes (~200 MBps throughput) - > Loading 1 TB into 2,000 DWU Azure Synapse Analytics takes 46 minutes (~380 MBps throughput) - > Loading 1 TB into 6,000 DWU Azure Synapse Analytics takes 14 minutes (~1.2 GBps throughput) - > - > -- To create a dedicated SQL pool with 6,000 DWUs, move the Performance slider all the way to the right: -- :::image type="content" source="media/data-factory-load-sql-data-warehouse/performance-slider.png" alt-text="Performance slider"::: -- For an existing database that is not configured with 6,000 DWUs, you can scale it up using Azure portal. Navigate to the database in Azure portal, and there is a **Scale** button in the **Overview** panel shown in the following image: -- :::image type="content" source="media/data-factory-load-sql-data-warehouse/scale-button.png" alt-text="Scale button"::: -- Click the **Scale** button to open the following panel, move the slider to the maximum value, and click **Save** button. -- :::image type="content" source="media/data-factory-load-sql-data-warehouse/scale-dialog.png" alt-text="Scale dialog"::: -- This experiment loads data into Azure Synapse Analytics using `xlargerc` resource class. -- To achieve best possible throughput, copy needs to be performed using an Azure Synapse Analytics user belonging to `xlargerc` resource class. Learn how to do that by following [Change a user resource class example](../../synapse-analytics/sql-data-warehouse/resource-classes-for-workload-management.md). -* Create destination table schema in Azure Synapse Analytics database, by running the following DDL statement: -- ```SQL - CREATE TABLE [dbo].[lineitem] - ( - [L_ORDERKEY] [bigint] NOT NULL, - [L_PARTKEY] [bigint] NOT NULL, - [L_SUPPKEY] [bigint] NOT NULL, - [L_LINENUMBER] [int] NOT NULL, - [L_QUANTITY] [decimal](15, 2) NULL, - [L_EXTENDEDPRICE] [decimal](15, 2) NULL, - [L_DISCOUNT] [decimal](15, 2) NULL, - [L_TAX] [decimal](15, 2) NULL, - [L_RETURNFLAG] [char](1) NULL, - [L_LINESTATUS] [char](1) NULL, - [L_SHIPDATE] [date] NULL, - [L_COMMITDATE] [date] NULL, - [L_RECEIPTDATE] [date] NULL, - [L_SHIPINSTRUCT] [char](25) NULL, - [L_SHIPMODE] [char](10) NULL, - [L_COMMENT] [varchar](44) NULL - ) - WITH - ( - DISTRIBUTION = ROUND_ROBIN, - CLUSTERED COLUMNSTORE INDEX - ) - ``` - With the prerequisite steps completed, we are now ready to configure the copy activity using the Copy Wizard. --## Launch Copy Wizard -1. Log in to the [Azure portal](https://portal.azure.com). -2. Click **Create a resource** from the top-left corner, click **Intelligence + analytics**, and click **Data Factory**. -3. In the **New data factory** pane: -- 1. Enter **LoadIntoSQLDWDataFactory** for the **name**. - The name of the Azure data factory must be globally unique. If you receive the error: **Data factory name "LoadIntoSQLDWDataFactory" is not available**, change the name of the data factory (for example, yournameLoadIntoSQLDWDataFactory) and try creating again. See [Data Factory - Naming Rules](data-factory-naming-rules.md) topic for naming rules for Data Factory artifacts. - 2. Select your Azure **subscription**. - 3. For Resource Group, do one of the following steps: - 1. Select **Use existing** to select an existing resource group. - 2. Select **Create new** to enter a name for a resource group. - 4. Select a **location** for the data factory. - 5. Select **Pin to dashboard** check box at the bottom of the blade. - 6. Click **Create**. -4. After the creation is complete, you see the **Data Factory** blade as shown in the following image: -- :::image type="content" source="media/data-factory-load-sql-data-warehouse/data-factory-home-page-copy-data.png" alt-text="Data factory home page"::: -5. On the Data Factory home page, click the **Copy data** tile to launch **Copy Wizard**. -- > [!NOTE] - > If you see that the web browser is stuck at "Authorizing...", disable/uncheck **Block third party cookies and site data** setting (or) keep it enabled and create an exception for **login.microsoftonline.com** and then try launching the wizard again. - > - > --## Step 1: Configure data loading schedule -The first step is to configure the data loading schedule. --In the **Properties** page: --1. Enter **CopyFromBlobToAzureSqlDataWarehouse** for **Task name** -2. Select **Run once now** option. -3. Click **Next**. -- :::image type="content" source="media/data-factory-load-sql-data-warehouse/copy-wizard-properties-page.png" alt-text="Copy Wizard - Properties page"::: --## Step 2: Configure source -This section shows you the steps to configure the source: Azure Blob containing the 1-TB TPC-H line item files. --1. Select the **Azure Blob Storage** as the data store and click **Next**. -- :::image type="content" source="media/data-factory-load-sql-data-warehouse/select-source-connection.png" alt-text="Copy Wizard - Select source page"::: --2. Fill in the connection information for the Azure Blob storage account, and click **Next**. -- :::image type="content" source="media/data-factory-load-sql-data-warehouse/source-connection-info.png" alt-text="Copy Wizard - Source connection information"::: --3. Choose the **folder** containing the TPC-H line item files and click **Next**. -- :::image type="content" source="media/data-factory-load-sql-data-warehouse/select-input-folder.png" alt-text="Copy Wizard - select input folder"::: --4. Upon clicking **Next**, the file format settings are detected automatically. Check to make sure that column delimiter is '|' instead of the default comma ','. Click **Next** after you have previewed the data. -- :::image type="content" source="media/data-factory-load-sql-data-warehouse/file-format-settings.png" alt-text="Copy Wizard - file format settings"::: --## Step 3: Configure destination -This section shows you how to configure the destination: `lineitem` table in the Azure Synapse Analytics database. --1. Choose **Azure Synapse Analytics** as the destination store and click **Next**. -- :::image type="content" source="media/data-factory-load-sql-data-warehouse/select-destination-data-store.png" alt-text="Copy Wizard - select destination data store"::: --2. Fill in the connection information for Azure Synapse Analytics. Make sure you specify the user that is a member of the role `xlargerc` (see the **prerequisites** section for detailed instructions), and click **Next**. -- :::image type="content" source="media/data-factory-load-sql-data-warehouse/destination-connection-info.png" alt-text="Copy Wizard - destination connection info"::: --3. Choose the destination table and click **Next**. -- :::image type="content" source="media/data-factory-load-sql-data-warehouse/table-mapping-page.png" alt-text="Copy Wizard - table mapping page"::: --4. In Schema mapping page, leave "Apply column mapping" option unchecked and click **Next**. --## Step 4: Performance settings --**Allow polybase** is checked by default. Click **Next**. ---## Step 5: Deploy and monitor load results -1. Click **Finish** button to deploy. -- :::image type="content" source="media/data-factory-load-sql-data-warehouse/summary-page.png" alt-text="Copy Wizard - summary page 1"::: --2. After the deployment is complete, click `Click here to monitor copy pipeline` to monitor the copy run progress. Select the copy pipeline you created in the **Activity Windows** list. -- :::image type="content" source="media/data-factory-load-sql-data-warehouse/select-pipeline-monitor-manage-app.png" alt-text="Copy Wizard - summary page 2"::: -- You can view the copy run details in the **Activity Window Explorer** in the right panel, including the data volume read from source and written into destination, duration, and the average throughput for the run. -- As you can see from the following screenshot, copying 1 TB from Azure Blob Storage into Azure Synapse Analytics took 14 minutes, effectively achieving 1.22 GBps throughput! -- :::image type="content" source="media/data-factory-load-sql-data-warehouse/succeeded-info.png" alt-text="Copy Wizard - succeeded dialog"::: --## Best practices -Here are a few best practices for running your Azure Synapse Analytics database: --* Use a larger resource class when loading into a CLUSTERED COLUMNSTORE INDEX. -* For more efficient joins, consider using hash distribution by a select column instead of default round robin distribution. -* For faster load speeds, consider using heap for transient data. -* Create statistics after you finish loading to Azure Synapse Analytics. --See [Best practices for Azure Synapse Analytics](../../synapse-analytics/sql/best-practices-dedicated-sql-pool.md) for details. --## Next steps -* [Data Factory Copy Wizard](data-factory-copy-wizard.md) - This article provides details about the Copy Wizard. -* [Copy Activity performance and tuning guide](data-factory-copy-activity-performance.md) - This article contains the reference performance measurements and tuning guide. |
data-factory | Data Factory Map Columns | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-map-columns.md | - Title: Mapping dataset columns in Azure Data Factory -description: 'Learn how to map source columns to destination columns.' ----- Previously updated : 04/12/2023----# Map source dataset columns to destination dataset columns -> [!NOTE] -> This article applies to version 1 of Data Factory. --Column mapping can be used to specify how columns specified in the "structure" of source table map to columns specified in the "structure" of sink table. The **columnMapping** property is available in the **typeProperties** section of the Copy activity. --Column mapping supports the following scenarios: --* All columns in the source dataset structure are mapped to all columns in the sink dataset structure. -* A subset of the columns in the source dataset structure is mapped to all columns in the sink dataset structure. --The following are error conditions that result in an exception: --* Either fewer columns or more columns in the "structure" of sink table than specified in the mapping. -* Duplicate mapping. -* SQL query result does not have a column name that is specified in the mapping. --> [!NOTE] -> The following samples are for Azure SQL and Azure Blob but are applicable to any data store that supports rectangular datasets. Adjust dataset and linked service definitions in examples to point to data in the relevant data source. --## Sample 1 - column mapping from Azure SQL to Azure blob -In this sample, the input table has a structure and it points to a SQL table in Azure SQL Database. --```json -{ - "name": "AzureSQLInput", - "properties": { - "structure": - [ - { "name": "userid"}, - { "name": "name"}, - { "name": "group"} - ], - "type": "AzureSqlTable", - "linkedServiceName": "AzureSqlLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --In this sample, the output table has a structure and it points to a blob in an Azure blob storage. --```json -{ - "name": "AzureBlobOutput", - "properties": - { - "structure": - [ - { "name": "myuserid"}, - { "name": "myname" }, - { "name": "mygroup"} - ], - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder", - "fileName":"myfile.csv", - "format": - { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": - { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --The following JSON defines a copy activity in a pipeline. The columns from source mapped to columns in sink (**columnMappings**) by using the **Translator** property. --```json -{ - "name": "CopyActivity", - "description": "description", - "type": "Copy", - "inputs": [ { "name": "AzureSQLInput" } ], - "outputs": [ { "name": "AzureBlobOutput" } ], - "typeProperties": { - "source": - { - "type": "SqlSource" - }, - "sink": - { - "type": "BlobSink" - }, - "translator": - { - "type": "TabularTranslator", - "ColumnMappings": "UserId: MyUserId, Group: MyGroup, Name: MyName" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - } -} -``` -**Column mapping flow:** ---## Sample 2 - column mapping with SQL query from Azure SQL to Azure blob -In this sample, a SQL query is used to extract data from Azure SQL instead of simply specifying the table name and the column names in "structure" section. --```json -{ - "name": "CopyActivity", - "description": "description", - "type": "CopyActivity", - "inputs": [ { "name": " AzureSQLInput" } ], - "outputs": [ { "name": " AzureBlobOutput" } ], - "typeProperties": - { - "source": - { - "type": "SqlSource", - "SqlReaderQuery": "$$Text.Format('SELECT * FROM MyTable WHERE StartDateTime = \\'{0:yyyyMMdd-HH}\\'', WindowStart)" - }, - "sink": - { - "type": "BlobSink" - }, - "Translator": - { - "type": "TabularTranslator", - "ColumnMappings": "UserId: MyUserId, Group: MyGroup,Name: MyName" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - } -} -``` -In this case, the query results are first mapped to columns specified in "structure" of source. Next, the columns from source "structure" are mapped to columns in sink "structure" with rules specified in columnMappings. Suppose the query returns 5 columns, two more columns than those specified in the "structure" of source. --**Column mapping flow** ---## Next steps -See the article for a tutorial on using Copy Activity: --- [Copy data from Blob Storage to SQL Database](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) |
data-factory | Data Factory Map Reduce | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-map-reduce.md | - Title: Invoke MapReduce Program from Azure Data Factory -description: Learn how to process data by running MapReduce programs on an Azure HDInsight cluster from an Azure data factory. ------ Previously updated : 04/12/2023---# Invoke MapReduce Programs from Data Factory -> [!div class="op_single_selector" title1="Transformation Activities"] -> * [Hive Activity](data-factory-hive-activity.md) -> * [Pig Activity](data-factory-pig-activity.md) -> * [MapReduce Activity](data-factory-map-reduce.md) -> * [Hadoop Streaming Activity](data-factory-hadoop-streaming-activity.md) -> * [Spark Activity](data-factory-spark.md) -> * [Azure Machine Learning Studio (classic) Batch Execution Activity](data-factory-azure-ml-batch-execution-activity.md) -> * [Azure Machine Learning Studio (classic) Update Resource Activity](data-factory-azure-ml-update-resource-activity.md) -> * [Stored Procedure Activity](data-factory-stored-proc-activity.md) -> * [Data Lake Analytics U-SQL Activity](data-factory-usql-activity.md) -> * [.NET Custom Activity](data-factory-use-custom-activities.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [transform data using MapReduce activity in Data Factory](../transform-data-using-hadoop-map-reduce.md). ---The HDInsight MapReduce activity in a Data Factory [pipeline](data-factory-create-pipelines.md) executes MapReduce programs on [your own](data-factory-compute-linked-services.md#azure-hdinsight-linked-service) or [on-demand](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service) Windows/Linux-based HDInsight cluster. This article builds on the [data transformation activities](data-factory-data-transformation-activities.md) article, which presents a general overview of data transformation and the supported transformation activities. --> [!NOTE] -> If you are new to Azure Data Factory, read through [Introduction to Azure Data Factory](data-factory-introduction.md) and do the tutorial: [Build your first data pipeline](data-factory-build-your-first-pipeline.md) before reading this article. --## Introduction -A pipeline in an Azure data factory processes data in linked storage services by using linked compute services. It contains a sequence of activities where each activity performs a specific processing operation. This article describes using the HDInsight MapReduce Activity. --See [Pig](data-factory-pig-activity.md) and [Hive](data-factory-hive-activity.md) for details about running Pig/Hive scripts on a Windows/Linux-based HDInsight cluster from a pipeline by using HDInsight Pig and Hive activities. --## JSON for HDInsight MapReduce Activity -In the JSON definition for the HDInsight Activity: --1. Set the **type** of the **activity** to **HDInsight**. -2. Specify the name of the class for **className** property. -3. Specify the path to the JAR file including the file name for **jarFilePath** property. -4. Specify the linked service that refers to the Azure Blob Storage that contains the JAR file for **jarLinkedService** property. -5. Specify any arguments for the MapReduce program in the **arguments** section. At runtime, you see a few extra arguments (for example: mapreduce.job.tags) from the MapReduce framework. To differentiate your arguments with the MapReduce arguments, consider using both option and value as arguments as shown in the following example (-s, --input, --output etc., are options immediately followed by their values). -- ```json - { - "name": "MahoutMapReduceSamplePipeline", - "properties": { - "description": "Sample Pipeline to Run a Mahout Custom Map Reduce Jar. This job calcuates an Item Similarity Matrix to determine the similarity between 2 items", - "activities": [ - { - "type": "HDInsightMapReduce", - "typeProperties": { - "className": "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob", - "jarFilePath": "adfsamples/Mahout/jars/mahout-examples-0.9.0.2.2.7.1-34.jar", - "jarLinkedService": "StorageLinkedService", - "arguments": [ - "-s", - "SIMILARITY_LOGLIKELIHOOD", - "--input", - "wasb://adfsamples@spestore.blob.core.windows.net/Mahout/input", - "--output", - "wasb://adfsamples@spestore.blob.core.windows.net/Mahout/output/", - "--maxSimilaritiesPerItem", - "500", - "--tempDir", - "wasb://adfsamples@spestore.blob.core.windows.net/Mahout/temp/mahout" - ] - }, - "inputs": [ - { - "name": "MahoutInput" - } - ], - "outputs": [ - { - "name": "MahoutOutput" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1, - "retry": 3 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "MahoutActivity", - "description": "Custom Map Reduce to generate Mahout result", - "linkedServiceName": "HDInsightLinkedService" - } - ], - "start": "2017-01-03T00:00:00Z", - "end": "2017-01-04T00:00:00Z" - } - } - ``` -- You can use the HDInsight MapReduce Activity to run any MapReduce jar file on an HDInsight cluster. In the following sample JSON definition of a pipeline, the HDInsight Activity is configured to run a Mahout JAR file. --## Sample on GitHub -You can download a sample for using the HDInsight MapReduce Activity from: [Data Factory Samples on GitHub](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/JSON/MapReduce_Activity_Sample). --## Running the Word Count program -The pipeline in this example runs the Word Count Map/Reduce program on your Azure HDInsight cluster. --### Linked Services -First, you create a linked service to link the Azure Storage that is used by the Azure HDInsight cluster to the Azure data factory. If you copy/paste the following code, do not forget to replace **account name** and **account key** with the name and key of your Azure Storage. --#### Azure Storage linked service --```JSON -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>" - } - } -} -``` --#### Azure HDInsight linked service -Next, you create a linked service to link your Azure HDInsight cluster to the Azure data factory. If you copy/paste the following code, replace **HDInsight cluster name** with the name of your HDInsight cluster, and change user name and password values. --```JSON -{ - "name": "HDInsightLinkedService", - "properties": { - "type": "HDInsight", - "typeProperties": { - "clusterUri": "https://<HDInsight cluster name>.azurehdinsight.net", - "userName": "admin", - "password": "**********", - "linkedServiceName": "StorageLinkedService" - } - } -} -``` --### Datasets -#### Output dataset -The pipeline in this example does not take any inputs. You specify an output dataset for the HDInsight MapReduce Activity. This dataset is just a dummy dataset that is required to drive the pipeline schedule. --```JSON -{ - "name": "MROutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "fileName": "WordCountOutput1.txt", - "folderPath": "example/data/", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Day", - "interval": 1 - } - } -} -``` --### Pipeline -The pipeline in this example has only one activity that is of type: HDInsightMapReduce. Some of the important properties in the JSON are: --| Property | Notes | -|: |: | -| type |The type must be set to **HDInsightMapReduce**. | -| className |Name of the class is: **wordcount** | -| jarFilePath |Path to the jar file containing the class. If you copy/paste the following code, don't forget to change the name of the cluster. | -| jarLinkedService |Azure Storage linked service that contains the jar file. This linked service refers to the storage that is associated with the HDInsight cluster. | -| arguments |The wordcount program takes two arguments, an input and an output. The input file is the davinci.txt file. | -| frequency/interval |The values for these properties match the output dataset. | -| linkedServiceName |refers to the HDInsight linked service you had created earlier. | --```JSON -{ - "name": "MRSamplePipeline", - "properties": { - "description": "Sample Pipeline to Run the Word Count Program", - "activities": [ - { - "type": "HDInsightMapReduce", - "typeProperties": { - "className": "wordcount", - "jarFilePath": "<HDInsight cluster name>/example/jars/hadoop-examples.jar", - "jarLinkedService": "StorageLinkedService", - "arguments": [ - "/example/data/gutenberg/davinci.txt", - "/example/data/WordCountOutput1" - ] - }, - "outputs": [ - { - "name": "MROutput" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1, - "retry": 3 - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - }, - "name": "MRActivity", - "linkedServiceName": "HDInsightLinkedService" - } - ], - "start": "2014-01-03T00:00:00Z", - "end": "2014-01-04T00:00:00Z" - } -} -``` --## Run Spark programs -You can use MapReduce activity to run Spark programs on your HDInsight Spark cluster. See [Invoke Spark programs from Azure Data Factory](data-factory-spark.md) for details. --[developer-reference]: /previous-versions/azure/dn834987(v=azure.100) -[cmdlet-reference]: /powershell/resourcemanager/Azurerm.DataFactories/v2.2.0/Azurerm.DataFactories ---[adfgetstarted]: data-factory-copy-data-from-azure-blob-storage-to-sql-database.md -[adfgetstartedmonitoring]:data-factory-copy-data-from-azure-blob-storage-to-sql-database.md#monitor-pipelines --[Developer Reference]: /previous-versions/azure/dn834987(v=azure.100) --## See Also -* [Hive Activity](data-factory-hive-activity.md) -* [Pig Activity](data-factory-pig-activity.md) -* [Hadoop Streaming Activity](data-factory-hadoop-streaming-activity.md) -* [Invoke Spark programs](data-factory-spark.md) -* [Invoke R scripts](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/RunRScriptUsingADFSample) |
data-factory | Data Factory Monitor Manage App | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-monitor-manage-app.md | - Title: Monitor and manage data pipelines - Azure -description: Learn how to use the Monitoring and Management app to monitor and manage Azure data factories and pipelines. ------ Previously updated : 04/12/2023---# Monitor and manage Azure Data Factory pipelines by using the Monitoring and Management app -> [!div class="op_single_selector"] -> * [Using Azure portal/Azure PowerShell](data-factory-monitor-manage-pipelines.md) -> * [Using Monitoring and Management app](data-factory-monitor-manage-app.md) -> -> --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [monitor and manage Data Factory pipelines in](../monitor-visually.md). --This article describes how to use the Monitoring and Management app to monitor, manage, and debug your Data Factory pipelines. You can get started with using the application by watching the following video: --> [!NOTE] -> The user interface shown in the video may not exactly match what you see in the portal. It's slightly older, but concepts remain the same. ---## Launch the Monitoring and Management app -To launch the Monitor and Management app, click the **Monitor & Manage** tile on the **Data Factory** blade for your data factory. ---You should see the Monitoring and Management app open in a separate window. ---> [!NOTE] -> If you see that the web browser is stuck at "Authorizing...", clear the **Block third-party cookies and site data** check box--or keep it selected, create an exception for **login.microsoftonline.com**, and then try to open the app again. ---In the Activity Windows list in the middle pane, you see an activity window for each run of an activity. For example, if you have the activity scheduled to run hourly for five hours, you see five activity windows associated with five data slices. If you don't see activity windows in the list at the bottom, do the following: - -- Update the **start time** and **end time** filters at the top to match the start and end times of your pipeline, and then click the **Apply** button. -- The Activity Windows list is not automatically refreshed. Click the **Refresh** button on the toolbar in the **Activity Windows** list. --If you don't have a Data Factory application to test these steps with, do the tutorial: [copy data from Blob Storage to SQL Database using Data Factory](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --## Understand the Monitoring and Management app -There are three tabs on the left: **Resource Explorer**, **Monitoring Views**, and **Alerts**. The first tab (**Resource Explorer**) is selected by default. --### Resource Explorer -You see the following: --* The Resource Explorer **tree view** in the left pane. -* The **Diagram View** at the top in the middle pane. -* The **Activity Windows** list at the bottom in the middle pane. -* The **Properties**, **Activity Window Explorer**, and **Script** tabs in the right pane. --In Resource Explorer, you see all resources (pipelines, datasets, linked services) in the data factory in a tree view. When you select an object in Resource Explorer: --* The associated Data Factory entity is highlighted in the Diagram View. -* [Associated activity windows](data-factory-scheduling-and-execution.md) are highlighted in the Activity Windows list at the bottom. -* The properties of the selected object are shown in the Properties window in the right pane. -* The JSON definition of the selected object is shown, if applicable. For example: a linked service, a dataset, or a pipeline. ---See the [Scheduling and Execution](data-factory-scheduling-and-execution.md) article for detailed conceptual information about activity windows. --### Diagram View -The Diagram View of a data factory provides a single pane of glass to monitor and manage a data factory and its assets. When you select a Data Factory entity (dataset/pipeline) in the Diagram View: --* The data factory entity is selected in the tree view. -* The associated activity windows are highlighted in the Activity Windows list. -* The properties of the selected object are shown in the Properties window. --When the pipeline is enabled (not in a paused state), it's shown with a green line: ---You can pause, resume, or terminate a pipeline by selecting it in the diagram view and using the buttons on the command bar. -- -There are three command bar buttons for the pipeline in the Diagram View. You can use the second button to pause the pipeline. Pausing doesn't terminate the currently running activities and lets them proceed to completion. The third button pauses the pipeline and terminates its existing executing activities. The first button resumes the pipeline. When your pipeline is paused, the color of the pipeline changes. For example, a paused pipeline looks like in the following image: ---You can multi-select two or more pipelines by using the Ctrl key. You can use the command bar buttons to pause/resume multiple pipelines at a time. --You can also right-click a pipeline and select options to suspend, resume, or terminate a pipeline. ---Click the **Open pipeline** option to see all the activities in the pipeline. ---In the opened pipeline view, you see all activities in the pipeline. In this example, there is only one activity: Copy Activity. ---To go back to the previous view, click the data factory name in the breadcrumb menu at the top. --In the pipeline view, when you select an output dataset or when you move your mouse over the output dataset, you see the Activity Windows pop-up window for that dataset. ---You can click an activity window to see details for it in the **Properties** window in the right pane. ---In the right pane, switch to the **Activity Window Explorer** tab to see more details. ---You also see **resolved variables** for each run attempt for an activity in the **Attempts** section. ---Switch to the **Script** tab to see the JSON script definition for the selected object. ---You can see activity windows in three places: --* The Activity Windows pop-up in the Diagram View (middle pane). -* The Activity Window Explorer in the right pane. -* The Activity Windows list in the bottom pane. --In the Activity Windows pop-up and Activity Window Explorer, you can scroll to the previous week and the next week by using the left and right arrows. ---At the bottom of the Diagram View, you see these buttons: Zoom In, Zoom Out, Zoom to Fit, Zoom 100%, Lock layout. The **Lock layout** button prevents you from accidentally moving tables and pipelines in the Diagram View. It's on by default. You can turn it off and move entities around in the diagram. When you turn it off, you can use the last button to automatically position tables and pipelines. You can also zoom in or out by using the mouse wheel. ---### Activity Windows list -The Activity Windows list at the bottom of the middle pane displays all activity windows for the dataset that you selected in the Resource Explorer or the Diagram View. By default, the list is in descending order, which means that you see the latest activity window at the top. ---This list doesn't refresh automatically, so use the refresh button on the toolbar to manually refresh it. --Activity windows can be in one of the following statuses: --<table> -<tr> - <th align="left">Status</th><th align="left">Substatus</th><th align="left">Description</th> -</tr> -<tr> - <td rowspan="8">Waiting</td><td>ScheduleTime</td><td>The time hasn't come for the activity window to run.</td> -</tr> -<tr> -<td>DatasetDependencies</td><td>The upstream dependencies aren't ready.</td> -</tr> -<tr> -<td>ComputeResources</td><td>The compute resources aren't available.</td> -</tr> -<tr> -<td>ConcurrencyLimit</td> <td>All the activity instances are busy running other activity windows.</td> -</tr> -<tr> -<td>ActivityResume</td><td>The activity is paused and can't run the activity windows until it's resumed.</td> -</tr> -<tr> -<td>Retry</td><td>The activity execution is being retried.</td> -</tr> -<tr> -<td>Validation</td><td>Validation hasn't started yet.</td> -</tr> -<tr> -<td>ValidationRetry</td><td>Validation is waiting to be retried.</td> -</tr> -<tr> -<tr> -<td rowspan="2">InProgress</td><td>Validating</td><td>Validation is in progress.</td> -</tr> -<td>-</td> -<td>The activity window is being processed.</td> -</tr> -<tr> -<td rowspan="4">Failed</td><td>TimedOut</td><td>The activity execution took longer than what is allowed by the activity.</td> -</tr> -<tr> -<td>Canceled</td><td>The activity window was canceled by user action.</td> -</tr> -<tr> -<td>Validation</td><td>Validation has failed.</td> -</tr> -<tr> -<td>-</td><td>The activity window failed to be generated or validated.</td> -</tr> -<td>Ready</td><td>-</td><td>The activity window is ready for consumption.</td> -</tr> -<tr> -<td>Skipped</td><td>-</td><td>The activity window wasn't processed.</td> -</tr> -<tr> -<td>None</td><td>-</td><td>An activity window used to exist with a different status, but has been reset.</td> -</tr> -</table> ---When you click an activity window in the list, you see details about it in the **Activity Windows Explorer** or the **Properties** window on the right. ---### Refresh activity windows -The details aren't automatically refreshed, so use the refresh button (the second button) on the command bar to manually refresh the activity windows list. --### Properties window -The Properties window is in the right-most pane of the Monitoring and Management app. ---It displays properties for the item that you selected in the Resource Explorer (tree view), Diagram View, or Activity Windows list. --### Activity Window Explorer -The **Activity Window Explorer** window is in the right-most pane of the Monitoring and Management app. It displays details about the activity window that you selected in the Activity Windows pop-up window or the Activity Windows list. ---You can switch to another activity window by clicking it in the calendar view at the top. You can also use the left arrow/right arrow buttons at the top to see activity windows from the previous week or the next week. --You can use the toolbar buttons in the bottom pane to rerun the activity window or refresh the details in the pane. --### Script -You can use the **Script** tab to view the JSON definition of the selected Data Factory entity (linked service, dataset, or pipeline). ---## Use system views -The Monitoring and Management app includes pre-built system views (**Recent activity windows**, **Failed activity windows**, **In-Progress activity windows**) that allow you to view recent/failed/in-progress activity windows for your data factory. --Switch to the **Monitoring Views** tab on the left by clicking it. ---Currently, there are three system views that are supported. Select an option to see recent activity windows, failed activity windows, or in-progress activity windows in the Activity Windows list (at the bottom of the middle pane). --When you select the **Recent activity windows** option, you see all recent activity windows in descending order of the **last attempt time**. --You can use the **Failed activity windows** view to see all failed activity windows in the list. Select a failed activity window in the list to see details about it in the **Properties** window or the **Activity Window Explorer**. You can also download any logs for a failed activity window. --## Sort and filter activity windows -Change the **start time** and **end time** settings in the command bar to filter activity windows. After you change the start time and end time, click the button next to the end time to refresh the Activity Windows list. ---> [!NOTE] -> Currently, all times are in UTC format in the Monitoring and Management app. -> -> --In the **Activity Windows list**, click the name of a column (for example: Status). ---You can do the following: --* Sort in ascending order. -* Sort in descending order. -* Filter by one or more values (Ready, Waiting, and so on). --When you specify a filter on a column, you see the filter button enabled for that column, which indicates that the values in the column are filtered values. ---You can use the same pop-up window to clear filters. To clear all filters for the Activity Windows list, click the clear filter button on the command bar. ---## Perform batch actions -### Rerun selected activity windows -Select an activity window, click the down arrow for the first command bar button, and select **Rerun** / **Rerun with upstream in pipeline**. When you select the **Rerun with upstream in pipeline** option, it reruns all upstream activity windows as well. - :::image type="content" source="./media/data-factory-monitor-manage-app/ReRunSlice.png" alt-text="Rerun an activity window"::: --You can also select multiple activity windows in the list and rerun them at the same time. You might want to filter activity windows based on the status (for example: **Failed**)--and then rerun the failed activity windows after correcting the issue that causes the activity windows to fail. See the following section for details about filtering activity windows in the list. --### Pause/resume multiple pipelines -You can multiselect two or more pipelines by using the Ctrl key. You can use the command bar buttons (which are highlighted in the red rectangle in the following image) to pause/resume them. - |
data-factory | Data Factory Monitor Manage Pipelines | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-monitor-manage-pipelines.md | - Title: Monitor and manage pipelines by using the Azure portal and PowerShell -description: Learn how to use the Azure portal and Azure PowerShell to monitor and manage the Azure data factories and pipelines that you have created. ------ Previously updated : 04/12/2023----# Monitor and manage Azure Data Factory pipelines by using the Azure portal and PowerShell -> [!div class="op_single_selector"] -> * [Using Azure portal/Azure PowerShell](data-factory-monitor-manage-pipelines.md) -> * [Using Monitoring and Management app](data-factory-monitor-manage-app.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [monitor and manage Data Factory pipelines in](../monitor-visually.md). --This article describes how to monitor, manage, and debug your pipelines by using Azure portal and PowerShell. --> [!IMPORTANT] -> The monitoring & management application provides a better support for monitoring and managing your data pipelines, and troubleshooting any issues. For details about using the application, see [monitor and manage Data Factory pipelines by using the Monitoring and Management app](data-factory-monitor-manage-app.md). --> [!IMPORTANT] -> Azure Data Factory version 1 now uses the new [Azure Monitor alerting infrastructure](../../azure-monitor/alerts/alerts-metric.md). The old alerting infrastructure is deprecated. As a result, your existing alerts configured for version 1 data factories no longer work. Your existing alerts for v1 data factories are not migrated automatically. You have to recreate these alerts on the new alerting infrastructure. Log in to the Azure portal and select **Monitor** to create new alerts on metrics (such as failed runs or successful runs) for your version 1 data factories. ---## Understand pipelines and activity states -By using the Azure portal, you can: --* View your data factory as a diagram. -* View activities in a pipeline. -* View input and output datasets. --This section also describes how a dataset slice transitions from one state to another state. --### Navigate to your data factory -1. Sign in to the [Azure portal](https://portal.azure.com). -2. Click **Data factories** on the menu on the left. If you don't see it, click **More services >**, and then click **Data factories** under the **INTELLIGENCE + ANALYTICS** category. -- :::image type="content" source="./media/data-factory-monitor-manage-pipelines/browseall-data-factories.png" alt-text="Browse all > Data factories"::: -3. On the **Data factories** blade, select the data factory that you're interested in. -- :::image type="content" source="./media/data-factory-monitor-manage-pipelines/select-data-factory.png" alt-text="Select data factory"::: -- You should see the home page for the data factory. -- :::image type="content" source="./media/data-factory-monitor-manage-pipelines/data-factory-blade.png" alt-text="Data factory blade"::: --#### Diagram view of your data factory -The **Diagram** view of a data factory provides a single pane of glass to monitor and manage the data factory and its assets. To see the **Diagram** view of your data factory, click **Diagram** on the home page for the data factory. ---You can zoom in, zoom out, zoom to fit, zoom to 100%, lock the layout of the diagram, and automatically position pipelines and datasets. You can also see the data lineage information (that is, show upstream and downstream items of selected items). --### Activities inside a pipeline -1. Right-click the pipeline, and then click **Open pipeline** to see all activities in the pipeline, along with input and output datasets for the activities. This feature is useful when your pipeline includes more than one activity and you want to understand the operational lineage of a single pipeline. -- :::image type="content" source="./media/data-factory-monitor-manage-pipelines/open-pipeline-menu.png" alt-text="Open pipeline menu"::: -2. In the following example, you see a copy activity in the pipeline with an input and an output. -- :::image type="content" source="./media/data-factory-monitor-manage-pipelines/activities-inside-pipeline.png" alt-text="Activities inside a pipeline"::: -3. You can navigate back to the home page of the data factory by clicking the **Data factory** link in the breadcrumb at the top-left corner. -- :::image type="content" source="./media/data-factory-monitor-manage-pipelines/navigate-back-to-data-factory.png" alt-text="Navigate back to data factory"::: --### View the state of each activity inside a pipeline -You can view the current state of an activity by viewing the status of any of the datasets that are produced by the activity. --By double-clicking the **OutputBlobTable** in the **Diagram**, you can see all the slices that are produced by different activity runs inside a pipeline. You can see that the copy activity ran successfully for the last eight hours and produced the slices in the **Ready** state. ---The dataset slices in the data factory can have one of the following statuses: --<table> -<tr> - <th align="left">State</th><th align="left">Substate</th><th align="left">Description</th> -</tr> -<tr> - <td rowspan="8">Waiting</td><td>ScheduleTime</td><td>The time hasn't come for the slice to run.</td> -</tr> -<tr> -<td>DatasetDependencies</td><td>The upstream dependencies aren't ready.</td> -</tr> -<tr> -<td>ComputeResources</td><td>The compute resources aren't available.</td> -</tr> -<tr> -<td>ConcurrencyLimit</td> <td>All the activity instances are busy running other slices.</td> -</tr> -<tr> -<td>ActivityResume</td><td>The activity is paused and can't run the slices until the activity is resumed.</td> -</tr> -<tr> -<td>Retry</td><td>Activity execution is being retried.</td> -</tr> -<tr> -<td>Validation</td><td>Validation hasn't started yet.</td> -</tr> -<tr> -<td>ValidationRetry</td><td>Validation is waiting to be retried.</td> -</tr> -<tr> -<tr> -<td rowspan="2">InProgress</td><td>Validating</td><td>Validation is in progress.</td> -</tr> -<td>-</td> -<td>The slice is being processed.</td> -</tr> -<tr> -<td rowspan="4">Failed</td><td>TimedOut</td><td>The activity execution took longer than what is allowed by the activity.</td> -</tr> -<tr> -<td>Canceled</td><td>The slice was canceled by user action.</td> -</tr> -<tr> -<td>Validation</td><td>Validation has failed.</td> -</tr> -<tr> -<td>-</td><td>The slice failed to be generated and/or validated.</td> -</tr> -<td>Ready</td><td>-</td><td>The slice is ready for consumption.</td> -</tr> -<tr> -<td>Skipped</td><td>None</td><td>The slice isn't being processed.</td> -</tr> -<tr> -<td>None</td><td>-</td><td>A slice used to exist with a different status, but it has been reset.</td> -</tr> -</table> ----You can view the details about a slice by clicking a slice entry on the **Recently Updated Slices** blade. ---If the slice has been executed multiple times, you see multiple rows in the **Activity runs** list. You can view details about an activity run by clicking the run entry in the **Activity runs** list. The list shows all the log files, along with an error message if there is one. This feature is useful to view and debug logs without having to leave your data factory. ---If the slice isn't in the **Ready** state, you can see the upstream slices that aren't ready and are blocking the current slice from executing in the **Upstream slices that are not ready** list. This feature is useful when your slice is in **Waiting** state and you want to understand the upstream dependencies that the slice is waiting on. ---### Dataset state diagram -After you deploy a data factory and the pipelines have a valid active period, the dataset slices transition from one state to another. Currently, the slice status follows the following state diagram: ---The dataset state transition flow in data factory is the following: Waiting -> In-Progress/In-Progress (Validating) -> Ready/Failed. --The slice starts in a **Waiting** state, waiting for preconditions to be met before it executes. Then, the activity starts executing, and the slice goes into an **In-Progress** state. The activity execution might succeed or fail. The slice is marked as **Ready** or **Failed**, based on the result of the execution. --You can reset the slice to go back from the **Ready** or **Failed** state to the **Waiting** state. You can also mark the slice state to **Skip**, which prevents the activity from executing and not processing the slice. --## Pause and resume pipelines -You can manage your pipelines by using Azure PowerShell. For example, you can pause and resume pipelines by running Azure PowerShell cmdlets. --> [!NOTE] -> The diagram view does not support pausing and resuming pipelines. If you want to use a user interface, use the monitoring and managing application. For details about using the application, see [monitor and manage Data Factory pipelines by using the Monitoring and Management app](data-factory-monitor-manage-app.md) article. --You can pause/suspend pipelines by using the **Suspend-AzDataFactoryPipeline** PowerShell cmdlet. This cmdlet is useful when you don't want to run your pipelines until an issue is fixed. --```powershell -Suspend-AzDataFactoryPipeline [-ResourceGroupName] <String> [-DataFactoryName] <String> [-Name] <String> -``` -For example: --```powershell -Suspend-AzDataFactoryPipeline -ResourceGroupName ADF -DataFactoryName productrecgamalbox1dev -Name PartitionProductsUsagePipeline -``` --After the issue has been fixed with the pipeline, you can resume the suspended pipeline by running the following PowerShell command: --```powershell -Resume-AzDataFactoryPipeline [-ResourceGroupName] <String> [-DataFactoryName] <String> [-Name] <String> -``` -For example: --```powershell -Resume-AzDataFactoryPipeline -ResourceGroupName ADF -DataFactoryName productrecgamalbox1dev -Name PartitionProductsUsagePipeline -``` --## Debug pipelines -Azure Data Factory provides rich capabilities for you to debug and troubleshoot pipelines by using the Azure portal and Azure PowerShell. --> [!NOTE] -> It is much easier to troubleshot errors using the Monitoring & Management App. For details about using the application, see [monitor and manage Data Factory pipelines by using the Monitoring and Management app](data-factory-monitor-manage-app.md) article. --### Find errors in a pipeline -If the activity run fails in a pipeline, the dataset that is produced by the pipeline is in an error state because of the failure. You can debug and troubleshoot errors in Azure Data Factory by using the following methods. --#### Use the Azure portal to debug an error -1. On the **Table** blade, click the problem slice that has the **Status** set to **Failed**. -- :::image type="content" source="./media/data-factory-monitor-manage-pipelines/table-blade-with-error.png" alt-text="Table blade with problem slice"::: -2. On the **Data slice** blade, click the activity run that failed. -- :::image type="content" source="./media/data-factory-monitor-manage-pipelines/dataslice-with-error.png" alt-text="Data slice with an error"::: -3. On the **Activity run details** blade, you can download the files that are associated with the HDInsight processing. Click **Download** for Status/stderr to download the error log file that contains details about the error. -- :::image type="content" source="./media/data-factory-monitor-manage-pipelines/activity-run-details-with-error.png" alt-text="Activity run details blade with error"::: --#### Use PowerShell to debug an error --1. Launch **PowerShell**. -2. Run the **Get-AzDataFactorySlice** command to see the slices and their statuses. You should see a slice with the status of **Failed**. -- ```powershell - Get-AzDataFactorySlice [-ResourceGroupName] <String> [-DataFactoryName] <String> [-DatasetName] <String> [-StartDateTime] <DateTime> [[-EndDateTime] <DateTime> ] [-Profile <AzureProfile> ] [ <CommonParameters>] - ``` -- For example: -- ```powershell - Get-AzDataFactorySlice -ResourceGroupName ADF -DataFactoryName LogProcessingFactory -DatasetName EnrichedGameEventsTable -StartDateTime 2014-05-04 20:00:00 - ``` -- Replace **StartDateTime** with start time of your pipeline. --3. Now, run the **Get-AzDataFactoryRun** cmdlet to get details about the activity run for the slice. -- ```powershell - Get-AzDataFactoryRun [-ResourceGroupName] <String> [-DataFactoryName] <String> [-DatasetName] <String> [-StartDateTime] - <DateTime> [-Profile <AzureProfile> ] [ <CommonParameters>] - ``` -- For example: -- ```powershell - Get-AzDataFactoryRun -ResourceGroupName ADF -DataFactoryName LogProcessingFactory -DatasetName EnrichedGameEventsTable -StartDateTime "5/5/2014 12:00:00 AM" - ``` -- The value of StartDateTime is the start time for the error/problem slice that you noted from the previous step. The date-time should be enclosed in double quotes. --4. You should see output with details about the error that is similar to the following: -- ```output - Id : 841b77c9-d56c-48d1-99a3-8c16c3e77d39 - ResourceGroupName : ADF - DataFactoryName : LogProcessingFactory3 - DatasetName : EnrichedGameEventsTable - ProcessingStartTime : 10/10/2014 3:04:52 AM - ProcessingEndTime : 10/10/2014 3:06:49 AM - PercentComplete : 0 - DataSliceStart : 5/5/2014 12:00:00 AM - DataSliceEnd : 5/6/2014 12:00:00 AM - Status : FailedExecution - Timestamp : 10/10/2014 3:04:52 AM - RetryAttempt : 0 - Properties : {} - ErrorMessage : Pig script failed with exit code '5'. See wasb:// adfjobs@spestore.blob.core.windows.net/PigQuery - Jobs/841b77c9-d56c-48d1-99a3- - 8c16c3e77d39/10_10_2014_03_04_53_277/Status/stderr' for - more details. - ActivityName : PigEnrichLogs - PipelineName : EnrichGameLogsPipeline - Type : - ``` --5. You can run the **Save-AzDataFactoryLog** cmdlet with the Id value that you see from the output, and download the log files by using the **-DownloadLogsoption** for the cmdlet. -- ```powershell - Save-AzDataFactoryLog -ResourceGroupName "ADF" -DataFactoryName "LogProcessingFactory" -Id "841b77c9-d56c-48d1-99a3-8c16c3e77d39" -DownloadLogs -Output "C:\Test" - ``` --## Rerun failures in a pipeline --> [!IMPORTANT] -> It's easier to troubleshoot errors and rerun failed slices by using the Monitoring & Management App. For details about using the application, see [monitor and manage Data Factory pipelines by using the Monitoring and Management app](data-factory-monitor-manage-app.md). --### Use the Azure portal -After you troubleshoot and debug failures in a pipeline, you can rerun failures by navigating to the error slice and clicking the **Run** button on the command bar. ---In case the slice has failed validation because of a policy failure (for example, if data isn't available), you can fix the failure and validate again by clicking the **Validate** button on the command bar. ---### Use Azure PowerShell -You can rerun failures by using the **Set-AzDataFactorySliceStatus** cmdlet. See the [Set-AzDataFactorySliceStatus](/powershell/module/az.datafactory/set-azdatafactoryslicestatus) topic for syntax and other details about the cmdlet. --**Example:** --The following example sets the status of all slices for the table 'DAWikiAggregatedData' to 'Waiting' in the Azure data factory 'WikiADF'. --The 'UpdateType' is set to 'UpstreamInPipeline', which means that statuses of each slice for the table and all the dependent (upstream) tables are set to 'Waiting'. The other possible value for this parameter is 'Individual'. --```powershell -Set-AzDataFactorySliceStatus -ResourceGroupName ADF -DataFactoryName WikiADF -DatasetName DAWikiAggregatedData -Status Waiting -UpdateType UpstreamInPipeline -StartDateTime 2014-05-21T16:00:00 -EndDateTime 2014-05-21T20:00:00 -``` -## Create alerts in the Azure portal --1. Log in to the Azure portal and select **Monitor -> Alerts** to open the Alerts page. -- :::image type="content" source="media/data-factory-monitor-manage-pipelines/v1alerts-image1.png" alt-text="Open the Alerts page."::: --2. Select **+ New Alert rule** to create a new alert. -- :::image type="content" source="media/data-factory-monitor-manage-pipelines/v1alerts-image2.png" alt-text="Create a new alert"::: --3. Define the **Alert condition**. (Make sure to select **Data factories** in the **Filter by resource type** field.) You can also specify values for **Dimensions**. -- :::image type="content" source="media/data-factory-monitor-manage-pipelines/v1alerts-image3.png" alt-text="Define the Alert Condition - Select target"::: -- :::image type="content" source="media/data-factory-monitor-manage-pipelines/v1alerts-image4.png" alt-text="Define the Alert Condition - Add alert criteria"::: -- :::image type="content" source="media/data-factory-monitor-manage-pipelines/v1alerts-image5.png" alt-text="Define the Alert Condition - Add alert logic"::: --4. Define the **Alert details**. -- :::image type="content" source="media/data-factory-monitor-manage-pipelines/v1alerts-image6.png" alt-text="Define the Alert Details"::: --5. Define the **Action group**. -- :::image type="content" source="media/data-factory-monitor-manage-pipelines/v1alerts-image7.png" alt-text="Define the Action Group - create a new Action group"::: -- :::image type="content" source="media/data-factory-monitor-manage-pipelines/v1alerts-image8.png" alt-text="Define the Action Group - set properties"::: -- :::image type="content" source="media/data-factory-monitor-manage-pipelines/v1alerts-image9.png" alt-text="Define the Action Group - new action group created"::: --## Move a data factory to a different resource group or subscription -You can move a data factory to a different resource group or a different subscription by using the **Move** command bar button on the home page of your data factory. ---You can also move any related resources (such as alerts that are associated with the data factory), along with the data factory. - |
data-factory | Data Factory Move Data Between Onprem And Cloud | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-move-data-between-onprem-and-cloud.md | - Title: Move data - Data Management Gateway -description: Set up a data gateway to move data between on-premises and the cloud. Use Data Management Gateway in Azure Data Factory to move your data. ----- Previously updated : 04/12/2023---# Move data between on-premises sources and the cloud with Data Management Gateway -> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [copy data between on-premises and cloud using Data Factory](../tutorial-hybrid-copy-powershell.md). --This article provides an overview of data integration between on-premises data stores and cloud data stores using Data Factory. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article and other data factory core concepts articles: [datasets](data-factory-create-datasets.md) and [pipelines](data-factory-create-pipelines.md). --## Data Management Gateway -You must install Data Management Gateway on your on-premises machine to enable moving data to/from an on-premises data store. The gateway can be installed on the same machine as the data store or on a different machine as long as the gateway can connect to the data store. --> [!IMPORTANT] -> See [Data Management Gateway](data-factory-data-management-gateway.md) article for details about Data Management Gateway. --The following walkthrough shows you how to create a data factory with a pipeline that moves data from an on-premises **SQL Server** database to an Azure blob storage. As part of the walkthrough, you install and configure the Data Management Gateway on your machine. --## Walkthrough: copy on-premises data to cloud -In this walkthrough you do the following steps: --1. Create a data factory. -2. Create a data management gateway. -3. Create linked services for source and sink data stores. -4. Create datasets to represent input and output data. -5. Create a pipeline with a copy activity to move the data. --## Prerequisites for the tutorial -Before you begin this walkthrough, you must have the following prerequisites: --* **Azure subscription**. If you don't have a subscription, you can create a free trial account in just a couple of minutes. See the [Free Trial](https://azure.microsoft.com/pricing/free-trial/) article for details. -* **Azure Storage Account**. You use the blob storage as a **destination/sink** data store in this tutorial. if you don't have an Azure storage account, see the [Create a storage account](../../storage/common/storage-account-create.md) article for steps to create one. -* **SQL Server**. You use a SQL Server database as a **source** data store in this tutorial. --## Create data factory -In this step, you use the Azure portal to create an Azure Data Factory instance named **ADFTutorialOnPremDF**. --1. Log in to the [Azure portal](https://portal.azure.com). -2. Click **Create a resource**, click **Intelligence + analytics**, and click **Data Factory**. -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/NewDataFactoryMenu.png" alt-text="New->DataFactory"::: -3. In the **New data factory** page, enter **ADFTutorialOnPremDF** for the Name. -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/OnPremNewDataFactoryAddToStartboard.png" alt-text="Add to Startboard"::: -- > [!IMPORTANT] - > The name of the Azure data factory must be globally unique. If you receive the error: **Data factory name "ADFTutorialOnPremDF" is not available**, change the name of the data factory (for example, yournameADFTutorialOnPremDF) and try creating again. Use this name in place of ADFTutorialOnPremDF while performing remaining steps in this tutorial. - > - > The name of the data factory may be registered as a **DNS** name in the future and hence become publicly visible. - > - > -4. Select the **Azure subscription** where you want the data factory to be created. -5. Select existing **resource group** or create a resource group. For the tutorial, create a resource group named: **ADFTutorialResourceGroup**. -6. Click **Create** on the **New data factory** page. -- > [!IMPORTANT] - > To create Data Factory instances, you must be a member of the [Data Factory Contributor](../../role-based-access-control/built-in-roles.md#data-factory-contributor) role at the subscription/resource group level. - > - > -7. After creation is complete, you see the **Data Factory** page as shown in the following image: -- :::image type="content" source="./mediataFactoryHomePage.png" alt-text="Data Factory Home Page"::: --## Create gateway -1. In the **Data Factory** page, click **Author and deploy** tile to launch the **Editor** for the data factory. -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/author-deploy-tile.png" alt-text="Author and Deploy Tile"::: -2. In the Data Factory Editor, click **... More** on the toolbar and then click **New data gateway**. Alternatively, you can right-click **Data Gateways** in the tree view, and click **New data gateway**. -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/NewDataGateway.png" alt-text="New data gateway on toolbar"::: -3. In the **Create** page, enter **adftutorialgateway** for the **name**, and click **OK**. -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/OnPremCreateGatewayBlade.png" alt-text="Create Gateway page"::: -- > [!NOTE] - > In this walkthrough, you create the logical gateway with only one node (on-premises Windows machine). You can scale out a data management gateway by associating multiple on-premises machines with the gateway. You can scale up by increasing number of data movement jobs that can run concurrently on a node. This feature is also available for a logical gateway with a single node. See [Scaling data management gateway in Azure Data Factory](data-factory-data-management-gateway-high-availability-scalability.md) article for details. -4. In the **Configure** page, click **Install directly on this computer**. This action downloads the installation package for the gateway, installs, configures, and registers the gateway on the computer. -- > [!NOTE] - > Use Internet Explorer or a Microsoft ClickOnce compatible web browser. - > - > If you are using Chrome, go to the [Chrome web store](https://chrome.google.com/webstore/), search with "ClickOnce" keyword, choose one of the ClickOnce extensions, and install it. - > - > Do the same for Firefox (install add-in). Click **Open Menu** button on the toolbar (**three horizontal lines** in the top-right corner), click **Add-ons**, search with "ClickOnce" keyword, choose one of the ClickOnce extensions, and install it. - > - > -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/OnPremGatewayConfigureBlade.png" alt-text="Gateway - Configure page"::: -- This way is the easiest way (one-click) to download, install, configure, and register the gateway in one single step. You can see the **Microsoft Data Management Gateway Configuration Manager** application is installed on your computer. You can also find the executable **ConfigManager.exe** in the folder: **C:\Program Files\Microsoft Data Management Gateway\2.0\Shared**. -- You can also download and install gateway manually by using the links in this page and register it using the key shown in the **NEW KEY** text box. -- See [Data Management Gateway](data-factory-data-management-gateway.md) article for all the details about the gateway. -- > [!NOTE] - > You must be an administrator on the local computer to install and configure the Data Management Gateway successfully. You can add additional users to the **Data Management Gateway Users** local Windows group. The members of this group can use the Data Management Gateway Configuration Manager tool to configure the gateway. - > - > -5. Wait for a couple of minutes or wait until you see the following notification message: -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/gateway-install-success.png" alt-text="Gateway installation successful"::: -6. Launch **Data Management Gateway Configuration Manager** application on your computer. In the **Search** window, type **Data Management Gateway** to access this utility. You can also find the executable **ConfigManager.exe** in the folder: **C:\Program Files\Microsoft Data Management Gateway\2.0\Shared** -- :::image type="content" source="./mediMGConfigurationManager.png" alt-text="Gateway Configuration Manager"::: -7. Confirm that you see `adftutorialgateway is connected to the cloud service` message. The status bar the bottom displays **Connected to the cloud service** along with a **green check mark**. -- On the **Home** tab, you can also do the following operations: -- * **Register** a gateway with a key from the Azure portal by using the Register button. - * **Stop** the Data Management Gateway Host Service running on your gateway machine. - * **Schedule updates** to be installed at a specific time of the day. - * View when the gateway was **last updated**. - * Specify time at which an update to the gateway can be installed. -8. Switch to the **Settings** tab. The certificate specified in the **Certificate** section is used to encrypt/decrypt credentials for the on-premises data store that you specify on the portal. (optional) Click **Change** to use your own certificate instead. By default, the gateway uses the certificate that is auto-generated by the Data Factory service. -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/gateway-certificate.png" alt-text="Gateway certificate configuration"::: -- You can also do the following actions on the **Settings** tab: -- * View or export the certificate being used by the gateway. - * Change the HTTPS endpoint used by the gateway. - * Set an HTTP proxy to be used by the gateway. -9. (optional) Switch to the **Diagnostics** tab, check the **Enable verbose logging** option if you want to enable verbose logging that you can use to troubleshoot any issues with the gateway. The logging information can be found in **Event Viewer** under **Applications and Services Logs** -> **Data Management Gateway** node. -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/diagnostics-tab.png" alt-text="Diagnostics tab"::: -- You can also perform the following actions in the **Diagnostics** tab: -- * Use **Test Connection** section to an on-premises data source using the gateway. - * Click **View Logs** to see the Data Management Gateway log in an Event Viewer window. - * Click **Send Logs** to upload a zip file with logs of last seven days to Microsoft to facilitate troubleshooting of your issues. -10. On the **Diagnostics** tab, in the **Test Connection** section, select **SqlServer** for the type of the data store, enter the name of the database server, name of the database, specify authentication type, enter user name, and password, and click **Test** to test whether the gateway can connect to the database. -11. Switch to the web browser, and in the **Azure portal**, click **OK** on the **Configure** page and then on the **New data gateway** page. -12. You should see **adftutorialgateway** under **Data Gateways** in the tree view on the left. If you click it, you should see the associated JSON. --## Create linked services -In this step, you create two linked --#### Add a linked service to a SQL Server database -1. In the **Data Factory Editor**, click **New data store** on the toolbar and select **SQL Server**. -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/NewSQLServer.png" alt-text="New SQL Server linked service"::: -2. In the **JSON editor** on the right, do the following steps: -- 1. For the **gatewayName**, specify **adftutorialgateway**. - 2. In the **connectionString**, do the following steps: -- 1. For **servername**, enter the name of the server that hosts the SQL Server database. - 2. For **databasename**, enter the name of the database. - 3. Click **Encrypt** button on the toolbar. You see the Credentials Manager application. -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/credentials-manager-application.png" alt-text="Credentials Manager application"::: - 4. In the **Setting Credentials** dialog box, specify authentication type, user name, and password, and click **OK**. If the connection is successful, the encrypted credentials are stored in the JSON and the dialog box closes. - 5. Close the empty browser tab that launched the dialog box if it is not automatically closed and get back to the tab with the Azure portal. -- On the gateway machine, these credentials are **encrypted** by using a certificate that the Data Factory service owns. If you want to use the certificate that is associated with the Data Management Gateway instead, see Set credentials securely. - 3. Click **Deploy** on the command bar to deploy the SQL Server linked service. You should see the linked service in the tree view. -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/sql-linked-service-in-tree-view.png" alt-text="SQL Server linked service in the tree view"::: --#### Add a linked service for an Azure storage account -1. In the **Data Factory Editor**, click **New data store** on the command bar and click **Azure storage**. -2. Enter the name of your Azure storage account for the **Account name**. -3. Enter the key for your Azure storage account for the **Account key**. -4. Click **Deploy** to deploy the **AzureStorageLinkedService**. --## Create datasets -In this step, you create input and output datasets that represent input and output data for the copy operation (On-premises SQL Server database => Azure blob storage). Before creating datasets, do the following steps (detailed steps follows the list): --* Create a table named **emp** in the SQL Server Database you added as a linked service to the data factory and insert a couple of sample entries into the table. -* Create a blob container named **adftutorial** in the Azure blob storage account you added as a linked service to the data factory. --### Prepare On-premises SQL Server for the tutorial -1. In the database you specified for the SQL Server linked service (**SqlServerLinkedService**), use the following SQL script to create the **emp** table in the database. -- ```sql - CREATE TABLE dbo.emp - ( - ID int IDENTITY(1,1) NOT NULL, - FirstName varchar(50), - LastName varchar(50), - CONSTRAINT PK_emp PRIMARY KEY (ID) - ) - GO - ``` --2. Insert some sample into the table: -- ```sql - INSERT INTO emp VALUES ('John', 'Doe') - INSERT INTO emp VALUES ('Jane', 'Doe') - ``` --### Create input dataset --1. In the **Data Factory Editor**, click **... More**, click **New dataset** on the command bar, and click **SQL Server table**. --2. Replace the JSON in the right pane with the following text: -- ```json - { - "name": "EmpOnPremSQLTable", - "properties": { - "type": "SqlServerTable", - "linkedServiceName": "SqlServerLinkedService", - "typeProperties": { - "tableName": "emp" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } - } - ``` -- Note the following points: -- * **type** is set to **SqlServerTable**. - * **tableName** is set to **emp**. - * **linkedServiceName** is set to **SqlServerLinkedService** (you had created this linked service earlier in this walkthrough.). - * For an input dataset that is not generated by another pipeline in Azure Data Factory, you must set **external** to **true**. It denotes the input data is produced external to the Azure Data Factory service. You can optionally specify any external data policies using the **externalData** element in the **Policy** section. -- See [Move data to/from SQL Server](data-factory-sqlserver-connector.md) for details about JSON properties. --3. Click **Deploy** on the command bar to deploy the dataset. --### Create output dataset --1. In the **Data Factory Editor**, click **New dataset** on the command bar, and click **Azure Blob storage**. -2. Replace the JSON in the right pane with the following text: -- ```json - { - "name": "OutputBlobTable", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "adftutorial/outfromonpremdf", - "format": { - "type": "TextFormat", - "columnDelimiter": "," - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } - } - ``` -- Note the following points: -- * **type** is set to **AzureBlob**. - * **linkedServiceName** is set to **AzureStorageLinkedService** (you had created this linked service in Step 2). - * **folderPath** is set to **adftutorial/outfromonpremdf** where outfromonpremdf is the folder in the adftutorial container. Create the **adftutorial** container if it does not already exist. - * The **availability** is set to **hourly** (**frequency** set to **hour** and **interval** set to **1**). The Data Factory service generates an output data slice every hour in the **emp** table in the Azure SQL Database. -- If you do not specify a **fileName** for an **output table**, the generated files in the **folderPath** are named in the following format: `Data.<Guid>.txt` (for example: : Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt.). -- To set **folderPath** and **fileName** dynamically based on the **SliceStart** time, use the partitionedBy property. In the following example, folderPath uses Year, Month, and Day from the SliceStart (start time of the slice being processed) and fileName uses Hour from the SliceStart. For example, if a slice is being produced for 2014-10-20T08:00:00, the folderName is set to wikidatagateway/wikisampledataout/2014/10/20 and the fileName is set to 08.csv. -- ```JSON - "folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}", - "fileName": "{Hour}.csv", - "partitionedBy": - [ -- { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, - { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, - { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, - { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } } - ], - ``` -- See [Move data to/from Azure Blob Storage](data-factory-azure-blob-connector.md) for details about JSON properties. -3. Click **Deploy** on the command bar to deploy the dataset. Confirm that you see both the datasets in the tree view. --## Create pipeline -In this step, you create a **pipeline** with one **Copy Activity** that uses **EmpOnPremSQLTable** as input and **OutputBlobTable** as output. --1. In Data Factory Editor, click **... More**, and click **New pipeline**. -2. Replace the JSON in the right pane with the following text: -- ```JSON - { - "name": "ADFTutorialPipelineOnPrem", - "properties": { - "description": "This pipeline has one Copy activity that copies data from an on premises SQL to Azure blob", - "activities": [ - { - "name": "CopyFromSQLtoBlob", - "description": "Copy data from on premises SQL server to blob", - "type": "Copy", - "inputs": [ - { - "name": "EmpOnPremSQLTable" - } - ], - "outputs": [ - { - "name": "OutputBlobTable" - } - ], - "typeProperties": { - "source": { - "type": "SqlSource", - "sqlReaderQuery": "select * from emp" - }, - "sink": { - "type": "BlobSink" - } - }, - "Policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "style": "StartOfInterval", - "retry": 0, - "timeout": "01:00:00" - } - } - ], - "start": "2016-07-05T00:00:00Z", - "end": "2016-07-06T00:00:00Z", - "isPaused": false - } - } - ``` -- > [!IMPORTANT] - > Replace the value of the **start** property with the current day and **end** value with the next day. -- Note the following points: -- * In the activities section, there is only activity whose **type** is set to **Copy**. - * **Input** for the activity is set to **EmpOnPremSQLTable** and **output** for the activity is set to **OutputBlobTable**. - * In the **typeProperties** section, **SqlSource** is specified as the **source type** and **BlobSink** is specified as the **sink type**. - * SQL query `select * from emp` is specified for the **sqlReaderQuery** property of **SqlSource**. -- Both start and end datetimes must be in [ISO format](https://en.wikipedia.org/wiki/ISO_8601). For example: 2014-10-14T16:32:41Z. The **end** time is optional, but we use it in this tutorial. -- If you do not specify value for the **end** property, it is calculated as "**start + 48 hours**". To run the pipeline indefinitely, specify **9/9/9999** as the value for the **end** property. -- You are defining the time duration in which the data slices are processed based on the **Availability** properties that were defined for each Azure Data Factory dataset. -- In the example, there are 24 data slices as each data slice is produced hourly. -3. Click **Deploy** on the command bar to deploy the dataset (table is a rectangular dataset). Confirm that the pipeline shows up in the tree view under **Pipelines** node. -4. Now, click **X** twice to close the page to get back to the **Data Factory** page for the **ADFTutorialOnPremDF**. --**Congratulations!** You have successfully created an Azure data factory, linked services, datasets, and a pipeline and scheduled the pipeline. --#### View the data factory in a Diagram View -1. In the **Azure portal**, click **Diagram** tile on the home page for the **ADFTutorialOnPremDF** data factory. : -- :::image type="content" source="./mediiagramLink.png" alt-text="Diagram Link"::: -2. You should see the diagram similar to the following image: -- :::image type="content" source="./mediiagramView.png" alt-text="Diagram View"::: -- You can zoom in, zoom out, zoom to 100%, zoom to fit, automatically position pipelines and datasets, and show lineage information (highlights upstream and downstream items of selected items). You can double-click an object (input/output dataset or pipeline) to see properties for it. --## Monitor pipeline -In this step, you use the Azure portal to monitor what's going on in an Azure data factory. You can also use PowerShell cmdlets to monitor datasets and pipelines. For details about monitoring, see [Monitor and Manage Pipelines](data-factory-monitor-manage-pipelines.md). --1. In the diagram, double-click **EmpOnPremSQLTable**. -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/OnPremSQLTableSlicesBlade.png" alt-text="EmpOnPremSQLTable slices"::: -2. Notice that all the data slices up are in **Ready** state because the pipeline duration (start time to end time) is in the past. It is also because you have inserted the data in the SQL Server database and it is there all the time. Confirm that no slices show up in the **Problem slices** section at the bottom. To view all the slices, click **See More** at the bottom of the list of slices. -3. Now, In the **Datasets** page, click **OutputBlobTable**. -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/OutputBlobTableSlicesBlade.png" alt-text="OputputBlobTable slices"::: -4. Click any data slice from the list and you should see the **Data Slice** page. You see activity runs for the slice. You see only one activity run usually. -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/DataSlice.png" alt-text="Data Slice Blade"::: -- If the slice is not in the **Ready** state, you can see the upstream slices that are not Ready and are blocking the current slice from executing in the **Upstream slices that are not ready** list. -5. Click the **activity run** from the list at the bottom to see **activity run details**. -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/ActivityRunDetailsBlade.png" alt-text="Activity Run Details page"::: -- You would see information such as throughput, duration, and the gateway used to transfer the data. -6. Click **X** to close all the pages until you -7. get back to the home page for the **ADFTutorialOnPremDF**. -8. (optional) Click **Pipelines**, click **ADFTutorialOnPremDF**, and drill through input tables (**Consumed**) or output datasets (**Produced**). -9. Use tools such as [Microsoft Storage Explorer](https://storageexplorer.com/) to verify that a blob/file is created for each hour. -- :::image type="content" source="./media/data-factory-move-data-between-onprem-and-cloud/OnPremAzureStorageExplorer.png" alt-text="Azure Storage Explorer"::: --## Next steps -* See [Data Management Gateway](data-factory-data-management-gateway.md) article for all the details about the Data Management Gateway. -* See [Copy data from Azure Blob to Azure SQL](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) to learn about how to use Copy Activity to move data from a source data store to a sink data store. |
data-factory | Data Factory Naming Rules | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-naming-rules.md | - Title: Rules for naming Azure Data Factory entities - version 1 -description: Describes naming rules for Data Factory v1 entities. ------ Previously updated : 04/12/2023---# Rules for naming Azure Data Factory entities -> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [naming rules in Data Factory](../naming-rules.md). --The following table provides naming rules for Data Factory artifacts. --| Name | Name Uniqueness | Validation Checks | -|: |: |: | -| Data Factory |Unique across Microsoft Azure. Names are case-insensitive, that is, `MyDF` and `mydf` refer to the same data factory. |<ul><li>Each data factory is tied to exactly one Azure subscription.</li><li>Object names must start with a letter or a number, and can contain only letters, numbers, and the dash (-) character.</li><li>Every dash (-) character must be immediately preceded and followed by a letter or a number. Consecutive dashes are not permitted in container names.</li><li>Name can be 3-63 characters long.</li></ul> | -| Linked Services/Tables/Pipelines |Unique with in a data factory. Names are case-insensitive. |<ul><li>Maximum number of characters in a table name: 260.</li><li>Object names must start with a letter, number, or an underscore (_).</li><li>Following characters are not allowed: ".", "+", "?", "/", "<", ">","*","%","&",":","\\"</li></ul> | -| Resource Group |Unique across Microsoft Azure. Names are case-insensitive. |<ul><li>Maximum number of characters: 1000.</li><li>Name can contain letters, digits, and the following characters: "-", "_", "," and "."</li></ul> | - |
data-factory | Data Factory Odata Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-odata-connector.md | - Title: Move data from OData sources -description: Learn about how to move data from OData sources using Azure Data Factory. ---- Previously updated : 04/12/2023----# Move data From an OData source using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-odata-connector.md) -> * [Version 2 (current version)](../connector-odata.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [OData connector in V2](../connector-odata.md). ---This article explains how to use the Copy Activity in Azure Data Factory to move data from an OData source. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --You can copy data from an OData source to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the [Supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. Data factory currently supports only moving data from an OData source to other data stores, but not for moving data from other data stores to an OData source. --## Supported versions and authentication types -This OData connector support OData version 3.0 and 4.0, and you can copy data from both cloud OData and on-premises OData sources. For the latter, you need to install the Data Management Gateway. See [Move data between on-premises and cloud](data-factory-move-data-between-onprem-and-cloud.md) article for details about Data Management Gateway. --Below authentication types are supported: --* To access **cloud** OData feed, you can use anonymous, basic (user name and password), or Microsoft Entra ID based OAuth authentication. -* To access **on-premises** OData feed, you can use anonymous, basic (user name and password), or Windows authentication. --## Getting started -You can create a pipeline with a copy activity that moves data from an OData source by using different tools/APIs. --The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. --You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an OData source, see [JSON example: Copy data from OData source to Azure Blob](#json-example-copy-data-from-odata-source-to-azure-blob) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to OData source: --## Linked Service properties -The following table provides description for JSON elements specific to OData linked service. --| Property | Description | Required | -| | | | -| type |The type property must be set to: **OData** |Yes | -| url |Url of the OData service. |Yes | -| authenticationType |Type of authentication used to connect to the OData source. <br/><br/> For cloud OData, possible values are Anonymous, Basic, and OAuth (note Azure Data Factory currently only support Microsoft Entra ID based OAuth). <br/><br/> For on-premises OData, possible values are Anonymous, Basic, and Windows. |Yes | -| username |Specify user name if you are using Basic authentication. |Yes (only if you are using Basic authentication) | -| password |Specify password for the user account you specified for the username. |Yes (only if you are using Basic authentication) | -| authorizedCredential |If you are using OAuth, click **Authorize** button in the Data Factory Copy Wizard or Editor and enter your credential, then the value of this property will be auto-generated. |Yes (only if you are using OAuth authentication) | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the on-premises OData service. Specify only if you are copying data from on premises OData source. |No | --### Using Basic authentication -```json -{ - "name": "inputLinkedService", - "properties": - { - "type": "OData", - "typeProperties": - { - "url": "https://services.odata.org/OData/OData.svc", - "authenticationType": "Basic", - "username": "username", - "password": "password" - } - } -} -``` --### Using Anonymous authentication -```json -{ - "name": "ODataLinkedService", - "properties": - { - "type": "OData", - "typeProperties": - { - "url": "https://services.odata.org/OData/OData.svc", - "authenticationType": "Anonymous" - } - } -} -``` --### Using Windows authentication accessing on-premises OData source -```json -{ - "name": "inputLinkedService", - "properties": - { - "type": "OData", - "typeProperties": - { - "url": "<endpoint of on-premises OData source e.g. Dynamics CRM>", - "authenticationType": "Windows", - "username": "domain\\user", - "password": "password", - "gatewayName": "mygateway" - } - } -} -``` --### Using OAuth authentication accessing cloud OData source -```json -{ - "name": "inputLinkedService", - "properties": - { - "type": "OData", - "typeProperties": - { - "url": "<endpoint of cloud OData source e.g. https://<tenant>.crm.dynamics.com/XRMServices/2011/OrganizationData.svc>", - "authenticationType": "OAuth", - "authorizedCredential": "<auto generated by clicking the Authorize button on UI>" - } - } -} -``` --## Dataset properties -For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --The **typeProperties** section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type **ODataResource** (which includes OData dataset) has the following properties --| Property | Description | Required | -| | | | -| path |Path to the OData resource |No | --## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and policy are available for all types of activities. --Properties available in the typeProperties section of the activity on the other hand vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --When source is of type **RelationalSource** (which includes OData) the following properties are available in typeProperties section: --| Property | Description | Example | Required | -| | | | | -| query |Use the custom query to read data. |"?$select=Name, Description&$top=5" |No | --## Type Mapping for OData -As mentioned in the [data movement activities](data-factory-data-movement-activities.md) article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach. --1. Convert from native source types to .NET type -2. Convert from .NET type to native sink type --When moving data from OData, the following mappings are used from OData types to .NET type. --| OData Data Type | .NET Type | -| | | -| Edm.Binary |Byte[] | -| Edm.Boolean |Bool | -| Edm.Byte |Byte[] | -| Edm.DateTime |DateTime | -| Edm.Decimal |Decimal | -| Edm.Double |Double | -| Edm.Single |Single | -| Edm.Guid |Guid | -| Edm.Int16 |Int16 | -| Edm.Int32 |Int32 | -| Edm.Int64 |Int64 | -| Edm.SByte |Int16 | -| Edm.String |String | -| Edm.Time |TimeSpan | -| Edm.DateTimeOffset |DateTimeOffset | --> [!Note] -> OData complex data types e.g. Object are not supported. --## JSON example: Copy data from OData source to Azure Blob -This example provides sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). They show how to copy data from an OData source to an Azure Blob Storage. However, data can be copied to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. The sample has the following Data Factory entities: --1. A linked service of type [OData](#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [ODataResource](#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -5. A [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [RelationalSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data from querying against an OData source to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples. --**OData linked service:** -This example uses the Anonymous authentication. See [OData linked service](#linked-service-properties) section for different types of authentication you can use. --```json -{ - "name": "ODataLinkedService", - "properties": - { - "type": "OData", - "typeProperties": - { - "url": "https://services.odata.org/OData/OData.svc", - "authenticationType": "Anonymous" - } - } -} -``` --**Azure Storage linked service:** --```json -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --**OData input dataset:** --Setting "external": "true" informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. --```json -{ - "name": "ODataDataset", - "properties": - { - "type": "ODataResource", - "typeProperties": - { - "path": "Products" - }, - "linkedServiceName": "ODataLinkedService", - "structure": [], - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } -} -``` --Specifying **path** in the dataset definition is optional. --**Azure Blob output dataset:** --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```json -{ - "name": "AzureBlobODataDataSet", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/odata/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Copy activity in a pipeline with OData source and Blob sink:** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **RelationalSource** and **sink** type is set to **BlobSink**. The SQL query specified for the **query** property selects the latest (newest) data from the OData source. --```json -{ - "name": "CopyODataToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "?$select=Name, Description&$top=5", - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ - { - "name": "ODataDataSet" - } - ], - "outputs": [ - { - "name": "AzureBlobODataDataSet" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "ODataToBlob" - } - ], - "start": "2017-02-01T18:00:00Z", - "end": "2017-02-03T19:00:00Z" - } -} -``` --Specifying **query** in the pipeline definition is optional. The **URL** that the Data Factory service uses to retrieve data is: URL specified in the linked service (required) + path specified in the dataset (optional) + query in the pipeline (optional). --### Type mapping for OData -As mentioned in the [data movement activities](data-factory-data-movement-activities.md) article, Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: --1. Convert from native source types to .NET type -2. Convert from .NET type to native sink type --When moving data from OData data stores, OData data types are mapped to .NET types. --## Map source to sink columns -To learn about mapping columns in source dataset to columns in sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Repeatable read from relational sources -When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See [Repeatable read from relational sources](data-factory-repeatable-copy.md#repeatable-read-from-relational-sources). --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory Odbc Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-odbc-connector.md | - Title: Move data from ODBC data stores -description: Learn about how to move data from ODBC data stores using Azure Data Factory. ---- Previously updated : 04/12/2023----# Move data From ODBC data stores using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-odbc-connector.md) -> * [Version 2 (current version)](../connector-odbc.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [ODBC connector in V2](../connector-odbc.md). ---This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises ODBC data store. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --You can copy data from an ODBC data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the [Supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. Data factory currently supports only moving data from an ODBC data store to other data stores, but not for moving data from other data stores to an ODBC data store. ---## Enabling connectivity -Data Factory service supports connecting to on-premises ODBC sources using the Data Management Gateway. See [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. Use the gateway to connect to an ODBC data store even if it is hosted in an Azure IaaS VM. --You can install the gateway on the same on-premises machine or the Azure VM as the ODBC data store. However, we recommend that you install the gateway on a separate machine/Azure IaaS VM to avoid resource contention and for better performance. When you install the gateway on a separate machine, the machine should be able to access the machine with the ODBC data store. --Apart from the Data Management Gateway, you also need to install the ODBC driver for the data store on the gateway machine. --> [!NOTE] -> See [Troubleshoot gateway issues](data-factory-data-management-gateway.md#troubleshooting-gateway-issues) for tips on troubleshooting connection/gateway related issues. --## Getting started -You can create a pipeline with a copy activity that moves data from an ODBC data store by using different tools/APIs. --The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. --You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an ODBC data store, see [JSON example: Copy data from ODBC data store to Azure Blob](#json-example-copy-data-from-odbc-data-store-to-azure-blob) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to ODBC data store: --## Linked service properties -The following table provides description for JSON elements specific to ODBC linked service. --| Property | Description | Required | -| | | | -| type |The type property must be set to: **OnPremisesOdbc** |Yes | -| connectionString |The non-access credential portion of the connection string and an optional encrypted credential. See examples in the following sections. <br/><br/>You can specify the connection string with pattern like `"Driver={SQL Server};Server=Server.database.windows.net; Database=TestDatabase;"`, or use the system DSN (Data Source Name) you set up on the gateway machine with `"DSN=<name of the DSN>;"` (you need still specify the credential portion in linked service accordingly). |Yes | -| credential |The access credential portion of the connection string specified in driver-specific property-value format. Example: `"Uid=<user ID>;Pwd=<password>;RefreshToken=<secret refresh token>;"`. |No | -| authenticationType |Type of authentication used to connect to the ODBC data store. Possible values are: Anonymous and Basic. |Yes | -| userName |Specify the user name if you're using Basic authentication. |No | -| password |Specify the password for the user account that you specified for the userName. |No | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the ODBC data store. |Yes | --### Using Basic authentication --```json -{ - "name": "odbc", - "properties": - { - "type": "OnPremisesOdbc", - "typeProperties": - { - "authenticationType": "Basic", - "connectionString": "Driver={SQL Server};Server=Server.database.windows.net; Database=TestDatabase;", - "userName": "username", - "password": "password", - "gatewayName": "mygateway" - } - } -} -``` -### Using Basic authentication with encrypted credentials -You can encrypt the credentials using the [New-AzDataFactoryEncryptValue](/powershell/module/az.datafactory/new-azdatafactoryencryptvalue) (1.0 version of Azure PowerShell) cmdlet or [New-AzureDataFactoryEncryptValue](/previous-versions/azure/dn834940(v=azure.100)) (0.9 or earlier version of the Azure PowerShell). --```json -{ - "name": "odbc", - "properties": - { - "type": "OnPremisesOdbc", - "typeProperties": - { - "authenticationType": "Basic", - "connectionString": "Driver={SQL Server};Server=myserver.database.windows.net; Database=TestDatabase;;EncryptedCredential=eyJDb25uZWN0...........................", - "gatewayName": "mygateway" - } - } -} -``` --### Using Anonymous authentication --```json -{ - "name": "odbc", - "properties": - { - "type": "OnPremisesOdbc", - "typeProperties": - { - "authenticationType": "Anonymous", - "connectionString": "Driver={SQL Server};Server={servername}.database.windows.net; Database=TestDatabase;", - "credential": "UID={uid};PWD={pwd}", - "gatewayName": "mygateway" - } - } -} -``` --## Dataset properties -For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --The **typeProperties** section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type **RelationalTable** (which includes ODBC dataset) has the following properties --| Property | Description | Required | -| | | | -| tableName |Name of the table in the ODBC data store. |Yes | --## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and policies are available for all types of activities. --Properties available in the **typeProperties** section of the activity on the other hand vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --In copy activity, when source is of type **RelationalSource** (which includes ODBC), the following properties are available in typeProperties section: --| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL query string. For example: select * from MyTable. |Yes | ---## JSON example: Copy data from ODBC data store to Azure Blob -This example provides JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). It shows how to copy data from an ODBC source to an Azure Blob Storage. However, data can be copied to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --The sample has the following data factory entities: --1. A linked service of type [OnPremisesOdbc](#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [RelationalTable](#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -5. A [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [RelationalSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data from a query result in an ODBC data store to a blob every hour. The JSON properties used in these samples are described in sections following the samples. --As a first step, set up the data management gateway. The instructions are in the [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article. --**ODBC linked service** -This example uses the Basic authentication. See [ODBC linked service](#linked-service-properties) section for different types of authentication you can use. --```json -{ - "name": "OnPremOdbcLinkedService", - "properties": - { - "type": "OnPremisesOdbc", - "typeProperties": - { - "authenticationType": "Basic", - "connectionString": "Driver={SQL Server};Server=Server.database.windows.net; Database=TestDatabase;", - "userName": "username", - "password": "password", - "gatewayName": "mygateway" - } - } -} -``` --**Azure Storage linked service** --```json -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --**ODBC input dataset** --The sample assumes you have created a table "MyTable" in an ODBC database and it contains a column called "timestampcolumn" for time series data. --Setting "external": "true" informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. --```json -{ - "name": "ODBCDataSet", - "properties": { - "published": false, - "type": "RelationalTable", - "linkedServiceName": "OnPremOdbcLinkedService", - "typeProperties": {}, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --**Azure Blob output dataset** --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```json -{ - "name": "AzureBlobOdbcDataSet", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/odbc/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Copy activity in a pipeline with ODBC source (RelationalSource) and Blob sink (BlobSink)** --The pipeline contains a Copy Activity that is configured to use these input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **RelationalSource** and **sink** type is set to **BlobSink**. The SQL query specified for the **query** property selects the data in the past hour to copy. --```json -{ - "name": "CopyODBCToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ - { - "name": "OdbcDataSet" - } - ], - "outputs": [ - { - "name": "AzureBlobOdbcDataSet" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "OdbcToBlob" - } - ], - "start": "2016-06-01T18:00:00Z", - "end": "2016-06-01T19:00:00Z" - } -} -``` -### Type mapping for ODBC -As mentioned in the [data movement activities](data-factory-data-movement-activities.md) article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach: --1. Convert from native source types to .NET type -2. Convert from .NET type to native sink type --When moving data from ODBC data stores, ODBC data types are mapped to .NET types as mentioned in the [ODBC Data Type Mappings](/dotnet/framework/data/adonet/odbc-data-type-mappings) topic. --## Map source to sink columns -To learn about mapping columns in source dataset to columns in sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Repeatable read from relational sources -When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See [Repeatable read from relational sources](data-factory-repeatable-copy.md#repeatable-read-from-relational-sources). --## Troubleshoot connectivity issues -To troubleshoot connection issues, use the **Diagnostics** tab of **Data Management Gateway Configuration Manager**. --1. Launch **Data Management Gateway Configuration Manager**. You can either run "C:\Program Files\Microsoft Data Management Gateway\1.0\Shared\ConfigManager.exe" directly (or) search for **Gateway** to find a link to **Microsoft Data Management Gateway** application as shown in the following image. -- :::image type="content" source="./media/data-factory-odbc-connector/search-gateway.png" alt-text="Search gateway"::: -2. Switch to the **Diagnostics** tab. -- :::image type="content" source="./media/data-factory-odbc-connector/data-factory-gateway-diagnostics.png" alt-text="Gateway diagnostics"::: -3. Select the **type** of data store (linked service). -4. Specify **authentication** and enter **credentials** (or) enter **connection string** that is used to connect to the data store. -5. Click **Test connection** to test the connection to the data store. --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory On Premises Mongodb Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-on-premises-mongodb-connector.md | - Title: Move data from MongoDB -description: Learn about how to move data from MongoDB database using Azure Data Factory. ----- Previously updated : 04/12/2023---# Move data From MongoDB using Azure Data Factory --> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-on-premises-mongodb-connector.md) -> * [Version 2 (current version)](../connector-mongodb.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [MongoDB connector in V2](../connector-mongodb.md). ---This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises MongoDB database. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --You can copy data from an on-premises MongoDB data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the [Supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. Data factory currently supports only moving data from a MongoDB data store to other data stores, but not for moving data from other data stores to an MongoDB datastore. --## Prerequisites -For the Azure Data Factory service to be able to connect to your on-premises MongoDB database, you must install the following components: --- Supported MongoDB versions are: 2.4, 2.6, 3.0, 3.2, 3.4 and 3.6.-- Data Management Gateway on the same machine that hosts the database or on a separate machine to avoid competing for resources with the database. Data Management Gateway is a software that connects on-premises data sources to cloud services in a secure and managed way. See [Data Management Gateway](data-factory-data-management-gateway.md) article for details about Data Management Gateway. See [Move data from on-premises to cloud](data-factory-move-data-between-onprem-and-cloud.md) article for step-by-step instructions on setting up the gateway a data pipeline to move data.-- When you install the gateway, it automatically installs a Microsoft MongoDB ODBC driver used to connect to MongoDB. -- > [!NOTE] - > You need to use the gateway to connect to MongoDB even if it is hosted in Azure IaaS VMs. If you are trying to connect to an instance of MongoDB hosted in cloud, you can also install the gateway instance in the IaaS VM. --## Getting started -You can create a pipeline with a copy activity that moves data from an on-premises MongoDB data store by using different tools/APIs. --The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. --You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises MongoDB data store, see [JSON example: Copy data from MongoDB to Azure Blob](#json-example-copy-data-from-mongodb-to-azure-blob) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to MongoDB source: --## Linked service properties -The following table provides description for JSON elements specific to **OnPremisesMongoDB** linked service. --| Property | Description | Required | -| | | | -| type |The type property must be set to: **OnPremisesMongoDb** |Yes | -| server |IP address or host name of the MongoDB server. |Yes | -| port |TCP port that the MongoDB server uses to listen for client connections. |Optional, default value: 27017 | -| authenticationType |Basic, or Anonymous. |Yes | -| username |User account to access MongoDB. |Yes (if basic authentication is used). | -| password |Password for the user. |Yes (if basic authentication is used). | -| authSource |Name of the MongoDB database that you want to use to check your credentials for authentication. |Optional (if basic authentication is used). default: uses the admin account and the database specified using databaseName property. | -| databaseName |Name of the MongoDB database that you want to access. |Yes | -| gatewayName |Name of the gateway that accesses the data store. |Yes | -| encryptedCredential |Credential encrypted by gateway. |Optional | --## Dataset properties -For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --The **typeProperties** section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type **MongoDbCollection** has the following properties: --| Property | Description | Required | -| | | | -| collectionName |Name of the collection in MongoDB database. |Yes | --## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and policy are available for all types of activities. --Properties available in the **typeProperties** section of the activity on the other hand vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --When the source is of type **MongoDbSource** the following properties are available in typeProperties section: --| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL-92 query string. For example: select * from MyTable. |No (if **collectionName** of **dataset** is specified) | ----## JSON example: Copy data from MongoDB to Azure Blob -This example provides sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). It shows how to copy data from an on-premises MongoDB to an Azure Blob Storage. However, data can be copied to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --The sample has the following data factory entities: --1. A linked service of type [OnPremisesMongoDb](#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [MongoDbCollection](#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -5. A [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [MongoDbSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data from a query result in MongoDB database to a blob every hour. The JSON properties used in these samples are described in sections following the samples. --As a first step, setup the data management gateway as per the instructions in the [Data Management Gateway](data-factory-data-management-gateway.md) article. --**MongoDB linked service:** --```json -{ - "name": "OnPremisesMongoDbLinkedService", - "properties": - { - "type": "OnPremisesMongoDb", - "typeProperties": - { - "authenticationType": "<Basic or Anonymous>", - "server": "< The IP address or host name of the MongoDB server >", - "port": "<The number of the TCP port that the MongoDB server uses to listen for client connections.>", - "username": "<username>", - "password": "<password>", - "authSource": "< The database that you want to use to check your credentials for authentication. >", - "databaseName": "<database name>", - "gatewayName": "<mygateway>" - } - } -} -``` --**Azure Storage linked service:** --```json -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --**MongoDB input dataset:** -Setting ΓÇ£externalΓÇ¥: ΓÇ¥trueΓÇ¥ informs the Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. --```json -{ - "name": "MongoDbInputDataset", - "properties": { - "type": "MongoDbCollection", - "linkedServiceName": "OnPremisesMongoDbLinkedService", - "typeProperties": { - "collectionName": "<Collection name>" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true - } -} -``` --**Azure Blob output dataset:** --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```json -{ - "name": "AzureBlobOutputDataSet", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/frommongodb/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Copy activity in a pipeline with MongoDB source and Blob sink:** --The pipeline contains a Copy Activity that is configured to use the above input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **MongoDbSource** and **sink** type is set to **BlobSink**. The SQL query specified for the **query** property selects the data in the past hour to copy. --```json -{ - "name": "CopyMongoDBToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "MongoDbSource", - "query": "$$Text.Format('select * from MyTable where LastModifiedDate >= {{ts\'{0:yyyy-MM-dd HH:mm:ss}\'}} AND LastModifiedDate < {{ts\'{1:yyyy-MM-dd HH:mm:ss}\'}}', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ - { - "name": "MongoDbInputDataset" - } - ], - "outputs": [ - { - "name": "AzureBlobOutputDataSet" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "MongoDBToAzureBlob" - } - ], - "start": "2016-06-01T18:00:00Z", - "end": "2016-06-01T19:00:00Z" - } -} -``` ---## Schema by Data Factory -Azure Data Factory service infers schema from a MongoDB collection by using the latest 100 documents in the collection. If these 100 documents do not contain full schema, some columns may be ignored during the copy operation. --## Type mapping for MongoDB -As mentioned in the [data movement activities](data-factory-data-movement-activities.md) article, Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: --1. Convert from native source types to .NET type -2. Convert from .NET type to native sink type --When moving data to MongoDB the following mappings are used from MongoDB types to .NET types. --| MongoDB type | .NET Framework type | -| | | -| Binary |Byte[] | -| Boolean |Boolean | -| Date |DateTime | -| NumberDouble |Double | -| NumberInt |Int32 | -| NumberLong |Int64 | -| ObjectID |String | -| String |String | -| UUID |Guid | -| Object |Renormalized into flatten columns with ΓÇ£_ΓÇ¥ as nested separator | --> [!NOTE] -> To learn about support for arrays using virtual tables, refer to [Support for complex types using virtual tables](#support-for-complex-types-using-virtual-tables) section below. --Currently, the following MongoDB data types are not supported: DBPointer, JavaScript, Max/Min key, Regular Expression, Symbol, Timestamp, Undefined --## Support for complex types using virtual tables -Azure Data Factory uses a built-in ODBC driver to connect to and copy data from your MongoDB database. For complex types such as arrays or objects with different types across the documents, the driver re-normalizes data into corresponding virtual tables. Specifically, if a table contains such columns, the driver generates the following virtual tables: --* A **base table**, which contains the same data as the real table except for the complex type columns. The base table uses the same name as the real table that it represents. -* A **virtual table** for each complex type column, which expands the nested data. The virtual tables are named using the name of the real table, a separator ΓÇ£_ΓÇ¥ and the name of the array or object. --Virtual tables refer to the data in the real table, enabling the driver to access the denormalized data. See Example section below details. You can access the content of MongoDB arrays by querying and joining the virtual tables. --You can use the [Copy Wizard](data-factory-data-movement-activities.md#create-a-pipeline-with-copy-activity) to intuitively view the list of tables in MongoDB database including the virtual tables, and preview the data inside. You can also construct a query in the Copy Wizard and validate to see the result. --### Example -For example, ΓÇ£ExampleTableΓÇ¥ below is a MongoDB table that has one column with an array of Objects in each cell - Invoices, and one column with an array of Scalar types - Ratings. --| _id | Customer Name | Invoices | Service Level | Ratings | -| | | | | | -| 1111 |ABC |[{invoice_id:ΓÇ¥123ΓÇ¥, item:ΓÇ¥toasterΓÇ¥, price:ΓÇ¥456ΓÇ¥, discount:ΓÇ¥0.2ΓÇ¥}, {invoice_id:ΓÇ¥124ΓÇ¥, item:ΓÇ¥ovenΓÇ¥, price: ΓÇ¥1235ΓÇ¥, discount: ΓÇ¥0.2ΓÇ¥}] |Silver |[5,6] | -| 2222 |XYZ |[{invoice_id:ΓÇ¥135ΓÇ¥, item:ΓÇ¥fridgeΓÇ¥, price: ΓÇ¥12543ΓÇ¥, discount: ΓÇ¥0.0ΓÇ¥}] |Gold |[1,2] | --The driver would generate multiple virtual tables to represent this single table. The first virtual table is the base table named ΓÇ£ExampleTableΓÇ¥, shown below. The base table contains all the data of the original table, but the data from the arrays has been omitted and is expanded in the virtual tables. --| _id | Customer Name | Service Level | -| | | | -| 1111 |ABC |Silver | -| 2222 |XYZ |Gold | --The following tables show the virtual tables that represent the original arrays in the example. These tables contain the following: --* A reference back to the original primary key column corresponding to the row of the original array (via the _id column) -* An indication of the position of the data within the original array -* The expanded data for each element within the array --Table ΓÇ£ExampleTable_InvoicesΓÇ¥: --| _id | ExampleTable_Invoices_dim1_idx | invoice_id | item | price | Discount | -| | | | | | | -| 1111 |0 |123 |toaster |456 |0.2 | -| 1111 |1 |124 |oven |1235 |0.2 | -| 2222 |0 |135 |fridge |12543 |0.0 | --Table ΓÇ£ExampleTable_RatingsΓÇ¥: --| _id | ExampleTable_Ratings_dim1_idx | ExampleTable_Ratings | -| | | | -| 1111 |0 |5 | -| 1111 |1 |6 | -| 2222 |0 |1 | -| 2222 |1 |2 | --## Map source to sink columns -To learn about mapping columns in source dataset to columns in sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Repeatable read from relational sources -When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See [Repeatable read from relational sources](data-factory-repeatable-copy.md#repeatable-read-from-relational-sources). --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. --## Next steps -See [Move data between on-premises and cloud](data-factory-move-data-between-onprem-and-cloud.md) article for step-by-step instructions for creating a data pipeline that moves data from an on-premises data store to an Azure data store. |
data-factory | Data Factory Onprem Cassandra Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-onprem-cassandra-connector.md | - Title: Move data from Cassandra using Data Factory -description: Learn about how to move data from an on-premises Cassandra database using Azure Data Factory. ---- Previously updated : 04/12/2023----# Move data from an on-premises Cassandra database using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-onprem-cassandra-connector.md) -> * [Version 2 (current version)](../connector-cassandra.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Cassandra connector in V2](../connector-cassandra.md). --This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises Cassandra database. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --You can copy data from an on-premises Cassandra data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the [Supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. Data factory currently supports only moving data from a Cassandra data store to other data stores, but not for moving data from other data stores to a Cassandra data store. --## Supported versions -The Cassandra connector supports the following versions of Cassandra: 2.x and 3.x. For activity running on Self-hosted Integration Runtime, Cassandra 3.x is supported since IR version 3.7 and above. --## Prerequisites -For the Azure Data Factory service to be able to connect to your on-premises Cassandra database, you must install a Data Management Gateway on the same machine that hosts the database or on a separate machine to avoid competing for resources with the database. Data Management Gateway is a component that connects on-premises data sources to cloud services in a secure and managed way. See [Data Management Gateway](data-factory-data-management-gateway.md) article for details about Data Management Gateway. See [Move data from on-premises to cloud](data-factory-move-data-between-onprem-and-cloud.md) article for step-by-step instructions on setting up the gateway a data pipeline to move data. --You must use the gateway to connect to a Cassandra database even if the database is hosted in the cloud, for example, on an Azure IaaS VM. Y You can have the gateway on the same VM that hosts the database or on a separate VM as long as the gateway can connect to the database. --When you install the gateway, it automatically installs a Microsoft Cassandra ODBC driver used to connect to Cassandra database. Therefore, you don't need to manually install any driver on the gateway machine when copying data from the Cassandra database. --> [!NOTE] -> See [Troubleshoot gateway issues](data-factory-data-management-gateway.md#troubleshooting-gateway-issues) for tips on troubleshooting connection/gateway related issues. --## Getting started -You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by using different tools/APIs. --- The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard.-- You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity.--Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises Cassandra data store, see [JSON example: Copy data from Cassandra to Azure Blob](#json-example-copy-data-from-cassandra-to-azure-blob) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to a Cassandra data store: --## Linked service properties -The following table provides description for JSON elements specific to Cassandra linked service. --| Property | Description | Required | -| | | | -| type |The type property must be set to: **OnPremisesCassandra** |Yes | -| host |One or more IP addresses or host names of Cassandra servers.<br/><br/>Specify a comma-separated list of IP addresses or host names to connect to all servers concurrently. |Yes | -| port |The TCP port that the Cassandra server uses to listen for client connections. |No, default value: 9042 | -| authenticationType |Basic, or Anonymous |Yes | -| username |Specify user name for the user account. |Yes, if authenticationType is set to Basic. | -| password |Specify password for the user account. |Yes, if authenticationType is set to Basic. | -| gatewayName |The name of the gateway that is used to connect to the on-premises Cassandra database. |Yes | -| encryptedCredential |Credential encrypted by the gateway. |No | -->[!NOTE] ->Currently connection to Cassandra using TLS is not supported. --## Dataset properties -For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --The **typeProperties** section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type **CassandraTable** has the following properties --| Property | Description | Required | -| | | | -| keyspace |Name of the keyspace or schema in Cassandra database. |Yes (If **query** for **CassandraSource** is not defined). | -| tableName |Name of the table in Cassandra database. |Yes (If **query** for **CassandraSource** is not defined). | --## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and policy are available for all types of activities. --Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --When source is of type **CassandraSource**, the following properties are available in typeProperties section: --| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL-92 query or CQL query. See [CQL reference](https://docs.datastax.com/en/cql/3.1/cql/cql_reference/cqlReferenceTOC.html). <br/><br/>When using SQL query, specify **keyspace name.table name** to represent the table you want to query. |No (if tableName and keyspace on dataset are defined). | -| consistencyLevel |The consistency level specifies how many replicas must respond to a read request before returning data to the client application. Cassandra checks the specified number of replicas for data to satisfy the read request. |ONE, TWO, THREE, QUORUM, ALL, LOCAL_QUORUM, EACH_QUORUM, LOCAL_ONE. See [Configuring data consistency](https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_config_consistency_c.html) for details. |No. Default value is ONE. | --## JSON example: Copy data from Cassandra to Azure Blob -This example provides sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). It shows how to copy data from an on-premises Cassandra database to an Azure Blob Storage. However, data can be copied to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --> [!IMPORTANT] -> This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article for step-by-step instructions. --The sample has the following data factory entities: --* A linked service of type [OnPremisesCassandra](#linked-service-properties). -* A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -* An input [dataset](data-factory-create-datasets.md) of type [CassandraTable](#dataset-properties). -* An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -* A [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [CassandraSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --**Cassandra linked service:** --This example uses the **Cassandra** linked service. See [Cassandra linked service](#linked-service-properties) section for the properties supported by this linked service. --```json -{ - "name": "CassandraLinkedService", - "properties": - { - "type": "OnPremisesCassandra", - "typeProperties": - { - "authenticationType": "Basic", - "host": "mycassandraserver", - "port": 9042, - "username": "user", - "password": "password", - "gatewayName": "mygateway" - } - } -} -``` --**Azure Storage linked service:** --```json -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --**Cassandra input dataset:** --```json -{ - "name": "CassandraInput", - "properties": { - "linkedServiceName": "CassandraLinkedService", - "type": "CassandraTable", - "typeProperties": { - "tableName": "mytable", - "keySpace": "mykeyspace" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --Setting **external** to **true** informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. --**Azure Blob output dataset:** --Data is written to a new blob every hour (frequency: hour, interval: 1). --```json -{ - "name": "AzureBlobOutput", - "properties": - { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": - { - "folderPath": "adfgetstarted/fromcassandra" - }, - "availability": - { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Copy activity in a pipeline with Cassandra source and Blob sink:** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **CassandraSource** and **sink** type is set to **BlobSink**. --See [RelationalSource type properties](#copy-activity-properties) for the list of properties supported by the RelationalSource. --```json -{ - "name":"SamplePipeline", - "properties":{ - "start":"2016-06-01T18:00:00", - "end":"2016-06-01T19:00:00", - "description":"pipeline with copy activity", - "activities":[ - { - "name": "CassandraToAzureBlob", - "description": "Copy from Cassandra to an Azure blob", - "type": "Copy", - "inputs": [ - { - "name": "CassandraInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "typeProperties": { - "source": { - "type": "CassandraSource", - "query": "select id, firstname, lastname from mykeyspace.mytable" -- }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` --### Type mapping for Cassandra -| Cassandra Type | .NET Based Type | -| | | -| ASCII |String | -| BIGINT |Int64 | -| BLOB |Byte[] | -| BOOLEAN |Boolean | -| DECIMAL |Decimal | -| DOUBLE |Double | -| FLOAT |Single | -| INET |String | -| INT |Int32 | -| TEXT |String | -| TIMESTAMP |DateTime | -| TIMEUUID |Guid | -| UUID |Guid | -| VARCHAR |String | -| VARINT |Decimal | --> [!NOTE] -> For collection types (map, set, list, etc.), refer to [Work with Cassandra collection types using virtual table](#work-with-collections-using-virtual-table) section. -> -> User-defined types are not supported. -> -> The length of Binary Column and String Column lengths cannot be greater than 4000. -> -> --## Work with collections using virtual table -Azure Data Factory uses a built-in ODBC driver to connect to and copy data from your Cassandra database. For collection types including map, set and list, the driver renormalizes the data into corresponding virtual tables. Specifically, if a table contains any collection columns, the driver generates the following virtual tables: --* A **base table**, which contains the same data as the real table except for the collection columns. The base table uses the same name as the real table that it represents. -* A **virtual table** for each collection column, which expands the nested data. The virtual tables that represent collections are named using the name of the real table, a separator "*vt*" and the name of the column. --Virtual tables refer to the data in the real table, enabling the driver to access the denormalized data. See Example section for details. You can access the content of Cassandra collections by querying and joining the virtual tables. --You can use the [Copy Wizard](data-factory-data-movement-activities.md#create-a-pipeline-with-copy-activity) to intuitively view the list of tables in Cassandra database including the virtual tables, and preview the data inside. You can also construct a query in the Copy Wizard and validate to see the result. --### Example -For example, the following "ExampleTable" is a Cassandra database table that contains an integer primary key column named "pk_int", a text column named value, a list column, a map column, and a set column (named "StringSet"). --| pk_int | Value | List | Map | StringSet | -| | | | | | -| 1 |"sample value 1" |["1", "2", "3"] |{"S1": "a", "S2": "b"} |{"A", "B", "C"} | -| 3 |"sample value 3" |["100", "101", "102", "105"] |{"S1": "t"} |{"A", "E"} | --The driver would generate multiple virtual tables to represent this single table. The foreign key columns in the virtual tables reference the primary key columns in the real table, and indicate which real table row the virtual table row corresponds to. --The first virtual table is the base table named "ExampleTable" is shown in the following table. The base table contains the same data as the original database table except for the collections, which are omitted from this table and expanded in other virtual tables. --| pk_int | Value | -| | | -| 1 |"sample value 1" | -| 3 |"sample value 3" | --The following tables show the virtual tables that renormalize the data from the List, Map, and StringSet columns. The columns with names that end with "_index" or "_key" indicate the position of the data within the original list or map. The columns with names that end with "_value" contain the expanded data from the collection. --#### Table "ExampleTable_vt_List": -| pk_int | List_index | List_value | -| | | | -| 1 |0 |1 | -| 1 |1 |2 | -| 1 |2 |3 | -| 3 |0 |100 | -| 3 |1 |101 | -| 3 |2 |102 | -| 3 |3 |103 | --#### Table "ExampleTable_vt_Map": -| pk_int | Map_key | Map_value | -| | | | -| 1 |S1 |A | -| 1 |S2 |b | -| 3 |S1 |t | --#### Table "ExampleTable_vt_StringSet": -| pk_int | StringSet_value | -| | | -| 1 |A | -| 1 |B | -| 1 |C | -| 3 |A | -| 3 |E | --## Map source to sink columns -To learn about mapping columns in source dataset to columns in sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Repeatable read from relational sources -When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See [Repeatable read from relational sources](data-factory-repeatable-copy.md#repeatable-read-from-relational-sources). --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory Onprem Db2 Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-onprem-db2-connector.md | - Title: Move data from DB2 by using Azure Data Factory -description: Learn how to move data from an on-premises DB2 database by using Azure Data Factory Copy Activity ---- Previously updated : 04/12/2023----# Move data from DB2 by using Azure Data Factory Copy Activity -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-onprem-db2-connector.md) -> * [Version 2 (current version)](../connector-db2.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [DB2 connector in V2](../connector-db2.md). ---This article describes how you can use Copy Activity in Azure Data Factory to copy data from an on-premises DB2 database to a data store. You can copy data to any store that is listed as a supported sink in the [Data Factory data movement activities](data-factory-data-movement-activities.md#supported-data-stores-and-formats) article. This topic builds on the Data Factory article, which presents an overview of data movement by using Copy Activity and lists the supported data store combinations. --Data Factory currently supports only moving data from a DB2 database to a [supported sink data store](data-factory-data-movement-activities.md#supported-data-stores-and-formats). Moving data from other data stores to a DB2 database is not supported. --## Prerequisites -Data Factory supports connecting to an on-premises DB2 database by using the [data management gateway](data-factory-data-management-gateway.md). For step-by-step instructions to set up the gateway data pipeline to move your data, see the [Move data from on-premises to cloud](data-factory-move-data-between-onprem-and-cloud.md) article. --A gateway is required even if the DB2 is hosted on Azure IaaS VM. You can install the gateway on the same IaaS VM as the data store. If the gateway can connect to the database, you can install the gateway on a different VM. --The data management gateway provides a built-in DB2 driver, so you don't need to manually install a driver to copy data from DB2. --> [!NOTE] -> For tips on troubleshooting connection and gateway issues, see the [Troubleshoot gateway issues](data-factory-data-management-gateway.md#troubleshooting-gateway-issues) article. ---## Supported versions -The Data Factory DB2 connector supports the following IBM DB2 platforms and versions with Distributed Relational Database Architecture (DRDA) SQL Access Manager versions 9, 10, and 11: --* IBM DB2 for z/OS version 11.1 -* IBM DB2 for z/OS version 10.1 -* IBM DB2 for i (AS400) version 7.2 -* IBM DB2 for i (AS400) version 7.1 -* IBM DB2 for Linux, UNIX, and Windows (LUW) version 11 -* IBM DB2 for LUW version 10.5 -* IBM DB2 for LUW version 10.1 --> [!TIP] -> If you receive the error message "The package corresponding to an SQL statement execution request was not found. SQLSTATE=51002 SQLCODE=-805," the reason is a necessary package is not created for the normal user on the OS. To resolve this issue, follow these instructions for your DB2 server type: -> - DB2 for i (AS400): Let a power user create the collection for the normal user before running Copy Activity. To create the collection, use the command: `create collection <username>` -> - DB2 for z/OS or LUW: Use a high privilege account--a power user or admin that has package authorities and BIND, BINDADD, GRANT EXECUTE TO PUBLIC permissions--to run the copy once. The necessary package is automatically created during the copy. Afterward, you can switch back to the normal user for your subsequent copy runs. --## Getting started -You can create a pipeline with a copy activity to move data from an on-premises DB2 data store by using different tools and APIs: --- The easiest way to create a pipeline is to use the Azure Data Factory Copy Wizard. For a quick walkthrough on creating a pipeline by using the Copy Wizard, see the [Tutorial: Create a pipeline by using the Copy Wizard](data-factory-copy-data-wizard-tutorial.md). -- You can also use tools to create a pipeline, including Visual Studio, Azure PowerShell, an Azure Resource Manager template, the .NET API, and the REST API. For step-by-step instructions to create a pipeline with a copy activity, see the [Copy Activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md). --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create linked services to link input and output data stores to your data factory. -2. Create datasets to represent input and output data for the copy operation. -3. Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the Copy Wizard, JSON definitions for the Data Factory linked services, datasets, and pipeline entities are automatically created for you. When you use tools or APIs (except the .NET API), you define the Data Factory entities by using the JSON format. The JSON example: Copy data from DB2 to Azure Blob storage shows the JSON definitions for the Data Factory entities that are used to copy data from an on-premises DB2 data store. --The following sections provide details about the JSON properties that are used to define the Data Factory entities that are specific to a DB2 data store. --## DB2 linked service properties -The following table lists the JSON properties that are specific to a DB2 linked service. --| Property | Description | Required | -| | | | -| **type** |This property must be set to **OnPremisesDb2**. |Yes | -| **server** |The name of the DB2 server. |Yes | -| **database** |The name of the DB2 database. |Yes | -| **schema** |The name of the schema in the DB2 database. This property is case-sensitive. |No | -| **authenticationType** |The type of authentication that is used to connect to the DB2 database. The possible values are: Anonymous, Basic, and Windows. |Yes | -| **username** |The name for the user account if you use Basic or Windows authentication. |No | -| **password** |The password for the user account. |No | -| **gatewayName** |The name of the gateway that the Data Factory service should use to connect to the on-premises DB2 database. |Yes | --## Dataset properties -For a list of the sections and properties that are available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections, such as **structure**, **availability**, and the **policy** for a dataset JSON, are similar for all dataset types (Azure SQL, Azure Blob storage, Azure Table storage, and so on). --The **typeProperties** section is different for each type of dataset and provides information about the location of the data in the data store. The **typeProperties** section for a dataset of type **RelationalTable**, which includes the DB2 dataset, has the following property: --| Property | Description | Required | -| | | | -| **tableName** |The name of the table in the DB2 database instance that the linked service refers to. This property is case-sensitive. |No (if the **query** property of a copy activity of type **RelationalSource** is specified) | --## Copy Activity properties -For a list of the sections and properties that are available for defining copy activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Copy Activity properties, such as **name**, **description**, **inputs** table, **outputs** table, and **policy**, are available for all types of activities. The properties that are available in the **typeProperties** section of the activity vary for each activity type. For Copy Activity, the properties vary depending on the types of data sources and sinks. --For Copy Activity, when the source is of type **RelationalSource** (which includes DB2), the following properties are available in the **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| **query** |Use the custom query to read the data. |SQL query string. For example: `"query": "select * from "MySchema"."MyTable""` |No (if the **tableName** property of a dataset is specified) | --> [!NOTE] -> Schema and table names are case-sensitive. In the query statement, enclose property names by using "" (double quotes). --## JSON example: Copy data from DB2 to Azure Blob storage -This example provides sample JSON definitions that you can use to create a pipeline by using the [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md), or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). The example shows you how to copy data from a DB2 database to Blob storage. However, data can be copied to [any supported data store sink type](data-factory-data-movement-activities.md#supported-data-stores-and-formats) by using Azure Data Factory Copy Activity. --The sample has the following Data Factory entities: --- A DB2 linked service of type [OnPremisesDb2](data-factory-onprem-db2-connector.md)-- An Azure Blob storage linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties)-- An input [dataset](data-factory-create-datasets.md) of type [RelationalTable](data-factory-onprem-db2-connector.md#dataset-properties)-- An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties)-- A [pipeline](data-factory-create-pipelines.md) with a copy activity that uses the [RelationalSource](data-factory-onprem-db2-connector.md#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties) properties--The sample copies data from a query result in a DB2 database to an Azure blob hourly. The JSON properties that are used in the sample are described in the sections that follow the entity definitions. --As a first step, install and configure a data gateway. Instructions are in the [Moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article. --**DB2 linked service** --```json -{ - "name": "OnPremDb2LinkedService", - "properties": { - "type": "OnPremisesDb2", - "typeProperties": { - "server": "<server>", - "database": "<database>", - "schema": "<schema>", - "authenticationType": "<authentication type>", - "username": "<username>", - "password": "<password>", - "gatewayName": "<gatewayName>" - } - } -} -``` --**Azure Blob storage linked service** --```json -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorageLinkedService", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<AccountName>;AccountKey=<AccountKey>" - } - } -} -``` --**DB2 input dataset** --The sample assumes that you have created a table in DB2 named "MyTable" that has a column labeled "timestamp" for the time series data. --The **external** property is set to "true." This setting informs the Data Factory service that this dataset is external to the data factory and is not produced by an activity in the data factory. Notice that the **type** property is set to **RelationalTable**. ---```json -{ - "name": "Db2DataSet", - "properties": { - "type": "RelationalTable", - "linkedServiceName": "OnPremDb2LinkedService", - "typeProperties": {}, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --**Azure Blob output dataset** --Data is written to a new blob every hour by setting the **frequency** property to "Hour" and the **interval** property to 1. The **folderPath** property for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses the year, month, day, and hour parts of the start time. --```json -{ - "name": "AzureBlobDb2DataSet", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/db2/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Pipeline for the copy activity** --The pipeline contains a copy activity that is configured to use specified input and output datasets and which is scheduled to run every hour. In the JSON definition for the pipeline, the **source** type is set to **RelationalSource** and the **sink** type is set to **BlobSink**. The SQL query specified for the **query** property selects the data from the "Orders" table. --```json -{ - "name": "CopyDb2ToBlob", - "properties": { - "description": "pipeline for the copy activity", - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "select * from \"Orders\"" - }, - "sink": { - "type": "BlobSink" - } - }, - "inputs": [ - { - "name": "Db2DataSet" - } - ], - "outputs": [ - { - "name": "AzureBlobDb2DataSet" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "Db2ToBlob" - } - ], - "start": "2014-06-01T18:00:00Z", - "end": "2014-06-01T19:00:00Z" - } -} -``` --## Type mapping for DB2 -As mentioned in the [data movement activities](data-factory-data-movement-activities.md) article, Copy Activity performs automatic type conversions from source type to sink type by using the following two-step approach: --1. Convert from a native source type to a .NET type -2. Convert from a .NET type to a native sink type --The following mappings are used when Copy Activity converts the data from a DB2 type to a .NET type: --| DB2 database type | .NET Framework type | -| | | -| SmallInt |Int16 | -| Integer |Int32 | -| BigInt |Int64 | -| Real |Single | -| Double |Double | -| Float |Double | -| Decimal |Decimal | -| DecimalFloat |Decimal | -| Numeric |Decimal | -| Date |DateTime | -| Time |TimeSpan | -| Timestamp |DateTime | -| Xml |Byte[] | -| Char |String | -| VarChar |String | -| LongVarChar |String | -| DB2DynArray |String | -| Binary |Byte[] | -| VarBinary |Byte[] | -| LongVarBinary |Byte[] | -| Graphic |String | -| VarGraphic |String | -| LongVarGraphic |String | -| Clob |String | -| Blob |Byte[] | -| DbClob |String | -| SmallInt |Int16 | -| Integer |Int32 | -| BigInt |Int64 | -| Real |Single | -| Double |Double | -| Float |Double | -| Decimal |Decimal | -| DecimalFloat |Decimal | -| Numeric |Decimal | -| Date |DateTime | -| Time |TimeSpan | -| Timestamp |DateTime | -| Xml |Byte[] | -| Char |String | --## Map source to sink columns -To learn how to map columns in the source dataset to columns in the sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Repeatable reads from relational sources -When you copy data from a relational data store, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure the retry **policy** property for a dataset to rerun a slice when a failure occurs. Make sure that the same data is read no matter how many times the slice is rerun, and regardless of how you rerun the slice. For more information, see [Repeatable reads from relational sources](data-factory-repeatable-copy.md#repeatable-read-from-relational-sources). --## Performance and tuning -Learn about key factors that affect the performance of Copy Activity and ways to optimize performance in the [Copy Activity Performance and Tuning Guide](data-factory-copy-activity-performance.md). |
data-factory | Data Factory Onprem File System Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-onprem-file-system-connector.md | - Title: Copy data to/from a file system using Azure Data Factory -description: Learn how to copy data to and from an on-premises file system by using Azure Data Factory. ---- Previously updated : 04/12/2023----# Copy data to and from an on-premises file system by using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-onprem-file-system-connector.md) -> * [Version 2 (current version)](../connector-file-system.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [File System connector in V2](../connector-file-system.md). ---This article explains how to use the Copy Activity in Azure Data Factory to copy data to/from an on-premises file system. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. ---## Supported scenarios -You can copy data **from an on-premises file system** to the following data stores: ---You can copy data from the following data stores **to an on-premises file system**: ---> [!NOTE] -> Copy Activity does not delete the source file after it is successfully copied to the destination. If you need to delete the source file after a successful copy, create a custom activity to delete the file and use the activity in the pipeline. --## Enabling connectivity -Data Factory supports connecting to and from an on-premises file system via **Data Management Gateway**. You must install the Data Management Gateway in your on-premises environment for the Data Factory service to connect to any supported on-premises data store including file system. To learn about Data Management Gateway and for step-by-step instructions on setting up the gateway, see [Move data between on-premises sources and the cloud with Data Management Gateway](data-factory-move-data-between-onprem-and-cloud.md). Apart from Data Management Gateway, no other binary files need to be installed to communicate to and from an on-premises file system. You must install and use the Data Management Gateway even if the file system is in Azure IaaS VM. For detailed information about the gateway, see [Data Management Gateway](data-factory-data-management-gateway.md). --To use a Linux file share, install [Samba](https://www.samba.org/) on your Linux server, and install Data Management Gateway on a Windows server. Installing Data Management Gateway on a Linux server is not supported. --## Getting started -You can create a pipeline with a copy activity that moves data to/from a file system by using different tools/APIs. --The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. --You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create a **data factory**. A data factory may contain one or more pipelines. -2. Create **linked services** to link input and output data stores to your data factory. For example, if you are copying data from an Azure blob storage to an on-premises file system, you create two linked services to link your on-premises file system and Azure storage account to your data factory. For linked service properties that are specific to an on-premises file system, see [linked service properties](#linked-service-properties) section. -3. Create **datasets** to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the blob container and folder that contains the input data. And, you create another dataset to specify the folder and file name (optional) in your file system. For dataset properties that are specific to on-premises file system, see [dataset properties](#dataset-properties) section. -4. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use BlobSource as a source and FileSystemSink as a sink for the copy activity. Similarly, if you are copying from on-premises file system to Azure Blob Storage, you use FileSystemSource and BlobSink in the copy activity. For copy activity properties that are specific to on-premises file system, see [copy activity properties](#copy-activity-properties) section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from a file system, see [JSON examples](#json-examples-for-copying-data-to-and-from-file-system) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to file system: --## Linked service properties -You can link an on-premises file system to an Azure data factory with the **On-Premises File Server** linked service. The following table provides descriptions for JSON elements that are specific to the On-Premises File Server linked service. --| Property | Description | Required | -| | | | -| type |Ensure that the type property is set to **OnPremisesFileServer**. |Yes | -| host |Specifies the root path of the folder that you want to copy. Use the escape character ' \ ' for special characters in the string. See [Sample linked service and dataset definitions](#sample-linked-service-and-dataset-definitions) for examples. |Yes | -| userid |Specify the ID of the user who has access to the server. |No (if you choose encryptedCredential) | -| password |Specify the password for the user (userid). |No (if you choose encryptedCredential | -| encryptedCredential |Specify the encrypted credentials that you can get by running the New-AzDataFactoryEncryptValue cmdlet. |No (if you choose to specify userid and password in plain text) | -| gatewayName |Specifies the name of the gateway that Data Factory should use to connect to the on-premises file server. |Yes | ---### Sample linked service and dataset definitions -| Scenario | Host in linked service definition | folderPath in dataset definition | -| | | | -| Local folder on Data Management Gateway machine: <br/><br/>Examples: D:\\\* or D:\folder\subfolder\\\* |D:\\\\ (for Data Management Gateway 2.0 and later versions) <br/><br/> localhost (for earlier versions than Data Management Gateway 2.0) |.\\\\ or folder\\\\subfolder (for Data Management Gateway 2.0 and later versions) <br/><br/>D:\\\\ or D:\\\\folder\\\\subfolder (for gateway version below 2.0) | -| Remote shared folder: <br/><br/>Examples: \\\\myserver\\share\\\* or \\\\myserver\\share\\folder\\subfolder\\\* |\\\\\\\\myserver\\\\share |.\\\\ or folder\\\\subfolder | -->[!NOTE] ->When authoring via UI, you don't need to input double backslash (`\\`) to escape like you do via JSON, specify single backslash. --### Example: Using username and password in plain text --```JSON -{ - "Name": "OnPremisesFileServerLinkedService", - "properties": { - "type": "OnPremisesFileServer", - "typeProperties": { - "host": "\\\\Contosogame-Asia", - "userid": "Admin", - "password": "123456", - "gatewayName": "mygateway" - } - } -} -``` --### Example: Using encryptedcredential --```JSON -{ - "Name": " OnPremisesFileServerLinkedService ", - "properties": { - "type": "OnPremisesFileServer", - "typeProperties": { - "host": "D:\\", - "encryptedCredential": "WFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5xxxxxxxxxxxxxxxxx", - "gatewayName": "mygateway" - } - } -} -``` --## Dataset properties -For a full list of sections and properties that are available for defining datasets, see [Creating datasets](data-factory-create-datasets.md). Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types. --The typeProperties section is different for each type of dataset. It provides information such as the location and format of the data in the data store. The typeProperties section for the dataset of type **FileShare** has the following properties: --| Property | Description | Required | -| | | | -| folderPath |Specifies the subpath to the folder. Use the escape character '\' for special characters in the string. Wildcard filter is not supported. See [Sample linked service and dataset definitions](#sample-linked-service-and-dataset-definitions) for examples.<br/><br/>You can combine this property with **partitionBy** to have folder paths based on slice start/end date-times. |Yes | -| fileName |Specify the name of the file in the **folderPath** if you want the table to refer to a specific file in the folder. If you do not specify any value for this property, the table points to all files in the folder.<br/><br/>When **fileName** is not specified for an output dataset and **preserveHierarchy** is not specified in activity sink, the name of the generated file is in the following format: <br/><br/>`Data.<Guid>.txt` (Example: Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt) |No | -| fileFilter |Specify a filter to be used to select a subset of files in the folderPath rather than all files. <br/><br/>Allowed values are: `*` (multiple characters) and `?` (single character).<br/><br/>Example 1: "fileFilter": "*.log"<br/>Example 2: "fileFilter": 2014-1-?.txt"<br/><br/>Note that fileFilter is applicable for an input FileShare dataset. |No | -| partitionedBy |You can use partitionedBy to specify a dynamic folderPath/fileName for time-series data. An example is folderPath parameterized for every hour of data. |No | -| format | The following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see [Text Format](data-factory-supported-file-and-compression-formats.md#text-format), [Json Format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro Format](data-factory-supported-file-and-compression-formats.md#avro-format), [Orc Format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet Format](data-factory-supported-file-and-compression-formats.md#parquet-format) sections. <br><br> If you want to **copy files as-is** between file-based stores (binary copy), skip the format section in both input and output dataset definitions. |No | -| compression | Specify the type and level of compression for the data. Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**. Supported levels are: **Optimal** and **Fastest**. see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). |No | --> [!NOTE] -> You cannot use fileName and fileFilter simultaneously. --### Using partitionedBy property -As mentioned in the previous section, you can specify a dynamic folderPath and filename for time series data with the **partitionedBy** property, [Data Factory functions, and the system variables](data-factory-functions-variables.md). --To understand more details on time-series datasets, scheduling, and slices, see [Creating datasets](data-factory-create-datasets.md), [Scheduling and execution](data-factory-scheduling-and-execution.md), and [Creating pipelines](data-factory-create-pipelines.md). --#### Sample 1: --```JSON -"folderPath": "wikidatagateway/wikisampledataout/{Slice}", -"partitionedBy": -[ - { "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } }, -], -``` --In this example, {Slice} is replaced with the value of the Data Factory system variable SliceStart in the format (YYYYMMDDHH). SliceStart refers to start time of the slice. The folderPath is different for each slice. For example: wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104. --#### Sample 2: --```JSON -"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}", -"fileName": "{Hour}.csv", -"partitionedBy": -[ - { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, - { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, - { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, - { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } } -], -``` --In this example, year, month, day, and time of SliceStart are extracted into separate variables that the folderPath and fileName properties use. --## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output datasets, and policies are available for all types of activities. Whereas, properties available in the **typeProperties** section of the activity vary with each activity type. --For Copy activity, they vary depending on the types of sources and sinks. If you are moving data from an on-premises file system, you set the source type in the copy activity to **FileSystemSource**. Similarly, if you are moving data to an on-premises file system, you set the sink type in the copy activity to **FileSystemSink**. This section provides a list of properties supported by FileSystemSource and FileSystemSink. --**FileSystemSource** supports the following properties: --| Property | Description | Allowed values | Required | -| | | | | -| recursive |Indicates whether the data is read recursively from the subfolders or only from the specified folder. |True, False (default) |No | --**FileSystemSink** supports the following properties: --| Property | Description | Allowed values | Required | -| | | | | -| copyBehavior |Defines the copy behavior when the source is BlobSource or FileSystem. |**PreserveHierarchy:** Preserves the file hierarchy in the target folder. That is, the relative path of the source file to the source folder is the same as the relative path of the target file to the target folder.<br/><br/>**FlattenHierarchy:** All files from the source folder are created in the first level of target folder. The target files are created with an autogenerated name.<br/><br/>**MergeFiles:** Merges all files from the source folder to one file. If the file name/blob name is specified, the merged file name is the specified name. Otherwise, it is an auto-generated file name. |No | --### recursive and copyBehavior examples -This section describes the resulting behavior of the Copy operation for different combinations of values for the recursive and copyBehavior properties. --| recursive value | copyBehavior value | Resulting behavior | -| | | | -| true |preserveHierarchy |For a source folder Folder1 with the following structure,<br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target folder Folder1 is created with the same structure as the source:<br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5 | -| true |flattenHierarchy |For a source folder Folder1 with the following structure,<br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target Folder1 is created with the following structure: <br/><br/>Folder1<br/> auto-generated name for File1<br/> auto-generated name for File2<br/> auto-generated name for File3<br/> auto-generated name for File4<br/> auto-generated name for File5 | -| true |mergeFiles |For a source folder Folder1 with the following structure,<br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target Folder1 is created with the following structure: <br/><br/>Folder1<br/> File1 + File2 + File3 + File4 + File 5 contents are merged into one file with an auto-generated file name. | -| false |preserveHierarchy |For a source folder Folder1 with the following structure,<br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target folder Folder1 is created with the following structure:<br/><br/>Folder1<br/> File1<br/> File2<br/><br/>Subfolder1 with File3, File4, and File5 is not picked up. | -| false |flattenHierarchy |For a source folder Folder1 with the following structure,<br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target folder Folder1 is created with the following structure:<br/><br/>Folder1<br/> auto-generated name for File1<br/> auto-generated name for File2<br/><br/>Subfolder1 with File3, File4, and File5 is not picked up. | -| false |mergeFiles |For a source folder Folder1 with the following structure,<br/><br/>Folder1<br/> File1<br/> File2<br/> Subfolder1<br/> File3<br/> File4<br/> File5<br/><br/>the target folder Folder1 is created with the following structure:<br/><br/>Folder1<br/> File1 + File2 contents are merged into one file with an auto-generated file name.<br/> Auto-generated name for File1<br/><br/>Subfolder1 with File3, File4, and File5 is not picked up. | --## Supported file and compression formats -See [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md) article on details. --## JSON examples for copying data to and from file system -The following examples provide sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). They show how to copy data to and from an on-premises file system and Azure Blob storage. However, you can copy data *directly* from any of the sources to any of the sinks listed in [Supported sources and sinks](data-factory-data-movement-activities.md#supported-data-stores-and-formats) by using Copy Activity in Azure Data Factory. --### Example: Copy data from an on-premises file system to Azure Blob storage -This sample shows how to copy data from an on-premises file system to Azure Blob storage. The sample has the following Data Factory entities: --* A linked service of type [OnPremisesFileServer](#linked-service-properties). -* A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -* An input [dataset](data-factory-create-datasets.md) of type [FileShare](#dataset-properties). -* An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -* A [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [FileSystemSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The following sample copies time-series data from an on-premises file system to Azure Blob storage every hour. The JSON properties that are used in these samples are described in the sections after the samples. --As a first step, set up Data Management Gateway as per the instructions in [Move data between on-premises sources and the cloud with Data Management Gateway](data-factory-move-data-between-onprem-and-cloud.md). --**On-Premises File Server linked service:** --```JSON -{ - "Name": "OnPremisesFileServerLinkedService", - "properties": { - "type": "OnPremisesFileServer", - "typeProperties": { - "host": "\\\\Contosogame-Asia.<region>.corp.<company>.com", - "userid": "Admin", - "password": "123456", - "gatewayName": "mygateway" - } - } -} -``` --We recommend using the **encryptedCredential** property instead the **userid** and **password** properties. See [File Server linked service](#linked-service-properties) for details about this linked service. --**Azure Storage linked service:** --```JSON -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --**On-premises file system input dataset:** --Data is picked up from a new file every hour. The folderPath and fileName properties are determined based on the start time of the slice. --Setting `"external": "true"` informs Data Factory that the dataset is external to the data factory and is not produced by an activity in the data factory. --```JSON -{ - "name": "OnpremisesFileSystemInput", - "properties": { - "type": " FileShare", - "linkedServiceName": " OnPremisesFileServerLinkedService ", - "typeProperties": { - "folderPath": "mysharedfolder/yearno={Year}/monthno={Month}/dayno={Day}", - "fileName": "{Hour}.csv", - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --**Azure Blob storage output dataset:** --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses the year, month, day, and hour parts of the start time. --```JSON -{ - "name": "AzureBlobOutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ], - "format": { - "type": "TextFormat", - "columnDelimiter": "\t", - "rowDelimiter": "\n" - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**A copy activity in a pipeline with File System source and Blob sink:** --The pipeline contains a copy activity that is configured to use the input and output datasets, and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **FileSystemSource**, and **sink** type is set to **BlobSink**. --```JSON -{ - "name":"SamplePipeline", - "properties":{ - "start":"2015-06-01T18:00:00", - "end":"2015-06-01T19:00:00", - "description":"Pipeline for copy activity", - "activities":[ - { - "name": "OnpremisesFileSystemtoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [ - { - "name": "OnpremisesFileSystemInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "typeProperties": { - "source": { - "type": "FileSystemSource" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` --### Example: Copy data from Azure SQL Database to an on-premises file system -The following sample shows: --* A linked service of type [AzureSqlDatabase.](data-factory-azure-sql-connector.md#linked-service-properties) -* A linked service of type [OnPremisesFileServer](#linked-service-properties). -* An input dataset of type [AzureSqlTable](data-factory-azure-sql-connector.md#dataset-properties). -* An output dataset of type [FileShare](#dataset-properties). -* A pipeline with a copy activity that uses [SqlSource](data-factory-azure-sql-connector.md#copy-activity-properties) and [FileSystemSink](#copy-activity-properties). --The sample copies time-series data from an Azure SQL table to an on-premises file system every hour. The JSON properties that are used in these samples are described in sections after the samples. --**Azure SQL Database linked service:** --```JSON -{ - "name": "AzureSqlLinkedService", - "properties": { - "type": "AzureSqlDatabase", - "typeProperties": { - "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" - } - } -} -``` --**On-Premises File Server linked service:** --```JSON -{ - "Name": "OnPremisesFileServerLinkedService", - "properties": { - "type": "OnPremisesFileServer", - "typeProperties": { - "host": "\\\\Contosogame-Asia.<region>.corp.<company>.com", - "userid": "Admin", - "password": "123456", - "gatewayName": "mygateway" - } - } -} -``` --We recommend using the **encryptedCredential** property instead of using the **userid** and **password** properties. See [File System linked service](#linked-service-properties) for details about this linked service. --**Azure SQL input dataset:** --The sample assumes that you've created a table "MyTable" in Azure SQL, and it contains a column called "timestampcolumn" for time-series data. --Setting ``"external": "true"`` informs Data Factory that the dataset is external to the data factory and is not produced by an activity in the data factory. --```JSON -{ - "name": "AzureSqlInput", - "properties": { - "type": "AzureSqlTable", - "linkedServiceName": "AzureSqlLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --**On-premises file system output dataset:** --Data is copied to a new file every hour. The folderPath and fileName for the blob are determined based on the start time of the slice. --```JSON -{ - "name": "OnpremisesFileSystemOutput", - "properties": { - "type": "FileShare", - "linkedServiceName": " OnPremisesFileServerLinkedService ", - "typeProperties": { - "folderPath": "mysharedfolder/yearno={Year}/monthno={Month}/dayno={Day}", - "fileName": "{Hour}.csv", - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --**A copy activity in a pipeline with SQL source and File System sink:** --The pipeline contains a copy activity that is configured to use the input and output datasets, and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **SqlSource**, and the **sink** type is set to **FileSystemSink**. The SQL query that is specified for the **SqlReaderQuery** property selects the data in the past hour to copy. --```JSON -{ - "name":"SamplePipeline", - "properties":{ - "start":"2015-06-01T18:00:00", - "end":"2015-06-01T20:00:00", - "description":"pipeline for copy activity", - "activities":[ - { - "name": "AzureSQLtoOnPremisesFile", - "description": "copy activity", - "type": "Copy", - "inputs": [ - { - "name": "AzureSQLInput" - } - ], - "outputs": [ - { - "name": "OnpremisesFileSystemOutput" - } - ], - "typeProperties": { - "source": { - "type": "SqlSource", - "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "FileSystemSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 3, - "timeout": "01:00:00" - } - } - ] - } -} -``` --You can also map columns from source dataset to columns from sink dataset in the copy activity definition. For details, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Performance and tuning - To learn about key factors that impact the performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it, see the [Copy Activity performance and tuning guide](data-factory-copy-activity-performance.md). |
data-factory | Data Factory Onprem Mysql Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-onprem-mysql-connector.md | - Title: Move data from MySQL using Azure Data Factory -description: Learn about how to move data from MySQL database using Azure Data Factory. ---- Previously updated : 04/12/2023----# Move data From MySQL using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-onprem-mysql-connector.md) -> * [Version 2 (current version)](../connector-mysql.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [MySQL connector in V2](../connector-mysql.md). ---This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises MySQL database. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --You can copy data from an on-premises MySQL data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the [Supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. Data factory currently supports only moving data from a MySQL data store to other data stores, but not for moving data from other data stores to an MySQL data store. --## Prerequisites -Data Factory service supports connecting to on-premises MySQL sources using the Data Management Gateway. See [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. --Gateway is required even if the MySQL database is hosted in an Azure IaaS virtual machine (VM). You can install the gateway on the same VM as the data store or on a different VM as long as the gateway can connect to the database. --> [!NOTE] -> See [Troubleshoot gateway issues](data-factory-data-management-gateway.md#troubleshooting-gateway-issues) for tips on troubleshooting connection/gateway related issues. --## Supported versions and installation -For Data Management Gateway to connect to the MySQL Database, you need to install the [MySQL Connector/NET for Microsoft Windows](https://dev.mysql.com/downloads/connector/net/) (version between 6.6.5 and 6.10.7) on the same system as the Data Management Gateway. This 32 bit driver is compatible with 64 bit Data Management Gateway. MySQL version 5.1 and above is supported. --> [!TIP] -> If you hit error on "Authentication failed because the remote party has closed the transport stream.", consider to upgrade the MySQL Connector/NET to higher version. --## Getting started -You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by using different tools/APIs. --- The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. -- You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises MySQL data store, see [JSON example: Copy data from MySQL to Azure Blob](#json-example-copy-data-from-mysql-to-azure-blob) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to a MySQL data store: --## Linked service properties -The following table provides description for JSON elements specific to MySQL linked service. --| Property | Description | Required | -| | | | -| type |The type property must be set to: **OnPremisesMySql** |Yes | -| server |Name of the MySQL server. |Yes | -| database |Name of the MySQL database. |Yes | -| schema |Name of the schema in the database. |No | -| authenticationType |Type of authentication used to connect to the MySQL database. Possible values are: `Basic`. |Yes | -| userName |Specify user name to connect to the MySQL database. |Yes | -| password |Specify password for the user account you specified. |Yes | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the on-premises MySQL database. |Yes | --## Dataset properties -For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --The **typeProperties** section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type **RelationalTable** (which includes MySQL dataset) has the following properties --| Property | Description | Required | -| | | | -| tableName |Name of the table in the MySQL Database instance that linked service refers to. |No (if **query** of **RelationalSource** is specified) | --## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, are policies are available for all types of activities. --Whereas, properties available in the **typeProperties** section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --When source in copy activity is of type **RelationalSource** (which includes MySQL), the following properties are available in typeProperties section: --| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL query string. For example: select * from MyTable. |No (if **tableName** of **dataset** is specified) | ---## JSON example: Copy data from MySQL to Azure Blob -This example provides sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). It shows how to copy data from an on-premises MySQL database to an Azure Blob Storage. However, data can be copied to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --> [!IMPORTANT] -> This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article for step-by-step instructions. --The sample has the following data factory entities: --1. A linked service of type [OnPremisesMySql](data-factory-onprem-mysql-connector.md#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [RelationalTable](data-factory-onprem-mysql-connector.md#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -5. A [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [RelationalSource](data-factory-onprem-mysql-connector.md#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data from a query result in MySQL database to a blob hourly. The JSON properties used in these samples are described in sections following the samples. --As a first step, setup the data management gateway. The instructions are in the [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article. --**MySQL linked service:** --```JSON - { - "name": "OnPremMySqlLinkedService", - "properties": { - "type": "OnPremisesMySql", - "typeProperties": { - "server": "<server name>", - "database": "<database name>", - "schema": "<schema name>", - "authenticationType": "<authentication type>", - "userName": "<user name>", - "password": "<password>", - "gatewayName": "<gateway>" - } - } - } -``` --**Azure Storage linked service:** --```JSON - { - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } - } -``` --**MySQL input dataset:** --The sample assumes you have created a table "MyTable" in MySQL and it contains a column called "timestampcolumn" for time series data. --Setting "external": "true" informs the Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. --```JSON - { - "name": "MySqlDataSet", - "properties": { - "published": false, - "type": "RelationalTable", - "linkedServiceName": "OnPremMySqlLinkedService", - "typeProperties": {}, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } - } -``` --**Azure Blob output dataset:** --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```JSON - { - "name": "AzureBlobMySqlDataSet", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/mysql/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } - } -``` --**Pipeline with Copy activity:** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **RelationalSource** and **sink** type is set to **BlobSink**. The SQL query specified for the **query** property selects the data in the past hour to copy. --```JSON - { - "name": "CopyMySqlToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ - { - "name": "MySqlDataSet" - } - ], - "outputs": [ - { - "name": "AzureBlobMySqlDataSet" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "MySqlToBlob" - } - ], - "start": "2014-06-01T18:00:00Z", - "end": "2014-06-01T19:00:00Z" - } - } -``` ---### Type mapping for MySQL -As mentioned in the [data movement activities](data-factory-data-movement-activities.md) article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach: --1. Convert from native source types to .NET type -2. Convert from .NET type to native sink type --When moving data to MySQL, the following mappings are used from MySQL types to .NET types. --| MySQL Database type | .NET Framework type | -| | | -| bigint unsigned |Decimal | -| bigint |Int64 | -| bit |Decimal | -| blob |Byte[] | -| bool |Boolean | -| char |String | -| date |Datetime | -| datetime |Datetime | -| decimal |Decimal | -| double precision |Double | -| double |Double | -| enum |String | -| float |Single | -| int unsigned |Int64 | -| int |Int32 | -| integer unsigned |Int64 | -| integer |Int32 | -| long varbinary |Byte[] | -| long varchar |String | -| longblob |Byte[] | -| longtext |String | -| mediumblob |Byte[] | -| mediumint unsigned |Int64 | -| mediumint |Int32 | -| mediumtext |String | -| numeric |Decimal | -| real |Double | -| set |String | -| smallint unsigned |Int32 | -| smallint |Int16 | -| text |String | -| time |TimeSpan | -| timestamp |Datetime | -| tinyblob |Byte[] | -| tinyint unsigned |Int16 | -| tinyint |Int16 | -| tinytext |String | -| varchar |String | -| year |Int | --## Map source to sink columns -To learn about mapping columns in source dataset to columns in sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Repeatable read from relational sources -When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See [Repeatable read from relational sources](data-factory-repeatable-copy.md#repeatable-read-from-relational-sources). --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory Onprem Oracle Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-onprem-oracle-connector.md | - Title: Copy data to or from Oracle by using Data Factory -description: Learn how to copy data to or from an on-premises Oracle database by using Azure Data Factory. ---- Previously updated : 04/12/2023----# Copy data to or from Oracle on-premises by using Azure Data Factory --> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-onprem-oracle-connector.md) -> * [Version 2 (current version)](../connector-oracle.md) --> [!NOTE] -> This article applies to version 1 of Azure Data Factory. If you're using the current version of the Azure Data Factory service, see [Oracle connector in V2](../connector-oracle.md). ---This article explains how to use Copy Activity in Azure Data Factory to move data to or from an on-premises Oracle database. The article builds on [Data movement activities](data-factory-data-movement-activities.md), which presents a general overview of data movement by using Copy Activity. --## Supported scenarios --You can copy data *from an Oracle database* to the following data stores: ---You can copy data from the following data stores *to an Oracle database*: ---## Prerequisites --Data Factory supports connecting to on-premises Oracle sources by using Data Management Gateway. See [Data Management Gateway](data-factory-data-management-gateway.md) to learn more about Data Management Gateway. For step-by-step instructions on how to set up the gateway in a data pipeline to move data, see [Move data from on-premises to cloud](data-factory-move-data-between-onprem-and-cloud.md). --The gateway is required even if the Oracle is hosted in an Azure infrastructure as a service (IaaS) VM. You can install the gateway on the same IaaS VM as the data store or on a different VM, as long as the gateway can connect to the database. --> [!NOTE] -> For tips on troubleshooting issues that are related to connection and the gateway, see [Troubleshoot gateway issues](data-factory-data-management-gateway.md#troubleshooting-gateway-issues). --## Supported versions and installation --This Oracle connector support two versions of drivers: --- **Microsoft driver for Oracle (recommended)**: Beginning in Data Management Gateway version 2.7, a Microsoft driver for Oracle is automatically installed with the gateway. You don't need to install or update the driver to establish connectivity to Oracle. You can also experience better copy performance by using this driver. These versions of Oracle databases are supported:- - Oracle 12c R1 (12.1) - - Oracle 11g R1, R2 (11.1, 11.2) - - Oracle 10g R1, R2 (10.1, 10.2) - - Oracle 9i R1, R2 (9.0.1, 9.2) - - Oracle 8i R3 (8.1.7) -- > [!NOTE] - > Oracle proxy server isn't supported. -- > [!IMPORTANT] - > Currently, the Microsoft driver for Oracle supports only copying data from Oracle. The driver doesn't support writing to Oracle. The test connection capability on the Data Management Gateway **Diagnostics** tab doesn't support this driver. Alternatively, you can use the Copy wizard to validate connectivity. - > --- **Oracle Data Provider for .NET**: You can use Oracle Data Provider to copy data from or to Oracle. This component is included in [Oracle Data Access Components for Windows](https://www.oracle.com/technetwork/topics/dotnet/downloads/). Install the relevant version (32-bit or 64-bit) on the machine where the gateway is installed. [Oracle Data Provider .NET 12.1](https://docs.oracle.com/database/121/ODPNT/InstallSystemRequirements.htm#ODPNT149) can access Oracle Database 10g Release 2 and later versions.-- If you select **XCopy Installation**, complete the steps that are described in the readme.htm file. We recommend selecting the installer that has the UI (not the XCopy installer). -- After you install the provider, restart the Data Management Gateway host service on your machine by using the Services applet or Data Management Gateway Configuration Manager. --If you use the Copy wizard to author the copy pipeline, the driver type is autodetermined. The Microsoft driver is used by default, unless your gateway version is earlier than version 2.7 or you select Oracle as the sink. --## Get started --You can create a pipeline that has a copy activity. The pipeline moves data to or from an on-premises Oracle database by using different tools or APIs. --The easiest way to create a pipeline is to use the Copy wizard. See [Tutorial: Create a pipeline by using the Copy wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline by using the Copy Data wizard. --You can also use one of the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, an **Azure Resource Manager template**, the **.NET API**, or the **REST API**. See the [Copy Activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions on how to create a pipeline that has a copy activity. --Whether you use the tools or APIs, complete the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create a **data factory**. A data factory can contain one or more pipelines. -2. Create **linked services** to link input and output data stores to your data factory. For example, if you are copying data from an Oracle database to Azure Blob storage, create two linked services to link your Oracle database and Azure storage account to your data factory. For linked service properties that are specific to Oracle, see [Linked service properties](#linked-service-properties). -3. Create **datasets** to represent input and output data for the copy operation. In the example in the preceding step, you create a dataset to specify the table in your Oracle database that contains the input data. You create another dataset to specify the blob container and the folder that holds the data copied from the Oracle database. For dataset properties that are specific to Oracle, see [Dataset properties](#dataset-properties). -4. Create a **pipeline** that has a copy activity that takes a dataset as an input and a dataset as an output. In the preceding example, you use **OracleSource** as a source and **BlobSink** as a sink for the copy activity. Similarly, if you are copying from Azure Blob storage to an Oracle database, you use **BlobSource** and **OracleSink** in the copy activity. For Copy Activity properties that are specific to an Oracle database, see [Copy Activity properties](#copy-activity-properties). For details about how to use a data store as a source or sink, select the link for your data store in the preceding section. --When you use the wizard, JSON definitions for these Data Factory entities are automatically created for you: linked services, datasets, and the pipeline. When you use tools or APIs (except for the .NET API), you define these Data Factory entities by using the JSON format. For samples that have JSON definitions for Data Factory entities that you use to copy data to or from an on-premises Oracle database, see JSON examples. --The following sections provide details about JSON properties that you use to define Data Factory entities. --## Linked service properties --The following table describes JSON elements that are specific to the Oracle linked service: --| Property | Description | Required | -| | | | -| type |The **type** property must be set to **OnPremisesOracle**. |Yes | -| driverType | Specify which driver to use to copy data from or to an Oracle database. Allowed values are **Microsoft** and **ODP** (default). See [Supported version and installation](#supported-versions-and-installation) for driver details. | No | -| connectionString | Specify the information that's needed to connect to the Oracle database instance for the **connectionString** property. | Yes | -| gatewayName | The name of the gateway that's used to connect to the on-premises Oracle server. |Yes | --**Example: Using the Microsoft driver** --> [!TIP] -> If you see an error that says "ORA-01025: UPI parameter out of range" and your Oracle is version 8i, add `WireProtocolMode=1` to your connection string and try again: --```json -{ - "name": "OnPremisesOracleLinkedService", - "properties": { - "type": "OnPremisesOracle", - "typeProperties": { - "driverType": "Microsoft", - "connectionString":"Host=<host>;Port=<port>;Sid=<service ID>;User Id=<user name>;Password=<password>;", - "gatewayName": "<gateway name>" - } - } -} -``` --**Example: Using the ODP driver** --To learn about allowed formats, see [Oracle data provider for .NET ODP](https://www.oracle.com/in/database/technologies/appdev/dotnet/odp.html). --```json -{ - "name": "OnPremisesOracleLinkedService", - "properties": { - "type": "OnPremisesOracle", - "typeProperties": { - "connectionString": "Data Source=(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<host name>)(PORT=<port number>))(CONNECT_DATA=(SERVICE_NAME=<service ID>))); User Id=<user name>;Password=<password>;", - "gatewayName": "<gateway name>" - } - } -} -``` --## Dataset properties --For a full list of sections and properties that are available for defining datasets, see [Creating datasets](data-factory-create-datasets.md). --The sections of a dataset JSON file, such as structure, availability, and policy, are similar for all dataset types (for example, for Oracle, Azure Blob storage, and Azure Table storage). --The **typeProperties** section is different for each type of dataset and provides information about the location of the data in the data store. The **typeProperties** section for the dataset of type **OracleTable** has the following properties: --| Property | Description | Required | -| | | | -| tableName |The name of the table in the Oracle database that the linked service refers to. |No (if **oracleReaderQuery** or **OracleSource** is specified) | --## Copy Activity properties --For a full list of sections and properties that are available for defining activities, see [Creating pipelines](data-factory-create-pipelines.md). --Properties like name, description, input and output tables, and policy are available for all types of activities. --> [!NOTE] -> Copy Activity takes only one input and produces only one output. --Properties that are available in the **typeProperties** section of the activity vary with each activity type. Copy Activity properties vary depending on the type of source and sink. --### OracleSource --In Copy Activity, when the source is the **OracleSource** type, the following properties are available in the **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| oracleReaderQuery |Use the custom query to read data. |A SQL query string. For example, "select \* from **MyTable**". <br/><br/>If not specified, this SQL statement is executed: "select \* from **MyTable**" |No<br />(if **tableName** of **dataset** is specified) | --### OracleSink --**OracleSink** supports the following properties: --| Property | Description | Allowed values | Required | -| | | | | -| writeBatchTimeout |The wait time for the batch insert operation to complete before it times out. |**timespan**<br/><br/> Example: 00:30:00 (30 minutes) |No | -| writeBatchSize |Inserts data into the SQL table when the buffer size reaches the value of **writeBatchSize**. |Integer (number of rows) |No (default: 100) | -| sqlWriterCleanupScript |Specifies a query for Copy Activity to execute so that the data of a specific slice is cleaned up. |A query statement. |No | -| sliceIdentifierColumnName |Specifies the column name for Copy Activity to fill with an autogenerated slice identifier. The value for **sliceIdentifierColumnName** is used to clean up data of a specific slice when rerun. |The column name of a column that has data type of **binary(32)**. |No | --## JSON examples for copying data to and from the Oracle database --The following examples provide sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). The examples show how to copy data from or to an Oracle database and to or from Azure Blob storage. However, data can be copied to any of the sinks listed in [Supported data stores and formats](data-factory-data-movement-activities.md#supported-data-stores-and-formats) by using Copy Activity in Azure Data Factory. --**Example: Copy data from Oracle to Azure Blob storage** --The sample has the following Data Factory entities: --* A linked service of type [OnPremisesOracle](data-factory-onprem-oracle-connector.md#linked-service-properties). -* A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -* An input [dataset](data-factory-create-datasets.md) of type [OracleTable](data-factory-onprem-oracle-connector.md#dataset-properties). -* An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -* A [pipeline](data-factory-create-pipelines.md) with a copy activity that uses [OracleSource](data-factory-onprem-oracle-connector.md#copy-activity-properties) as source and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties) as sink. --The sample copies data from a table in an on-premises Oracle database to a blob hourly. For more information about various properties that are used in the sample, see the sections that follow the samples. --**Oracle linked service** --```json -{ - "name": "OnPremisesOracleLinkedService", - "properties": { - "type": "OnPremisesOracle", - "typeProperties": { - "driverType": "Microsoft", - "connectionString":"Host=<host>;Port=<port>;Sid=<service ID>;User Id=<username>;Password=<password>;", - "gatewayName": "<gateway name>" - } - } -} -``` --**Azure Blob storage linked service** --```json -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>" - } - } -} -``` --**Oracle input dataset** --The sample assumes that you have created a table named **MyTable** in Oracle. It contains a column called **timestampcolumn** for time series data. --Setting **external**: **true** informs the Data Factory service that the dataset is external to the data factory and that the dataset isn't produced by an activity in the data factory. --```json -{ - "name": "OracleInput", - "properties": { - "type": "OracleTable", - "linkedServiceName": "OnPremisesOracleLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "external": true, - "availability": { - "offset": "01:00:00", - "interval": "1", - "anchorDateTime": "2014-02-27T12:00:00", - "frequency": "Hour" - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --**Azure Blob output dataset** --Data is written to a new blob every hour (**frequency**: **hour**, **interval**: **1**). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that's being processed. The folder path uses the year, month, day, and hour part of the start time. --```json -{ - "name": "AzureBlobOutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ], - "format": { - "type": "TextFormat", - "columnDelimiter": "\t", - "rowDelimiter": "\n" - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Pipeline with a copy activity** --The pipeline contains a copy activity that's configured to use the input and output datasets and scheduled to run hourly. In the pipeline JSON definition, the **source** type is set to **OracleSource** and the **sink** type is set to **BlobSink**. The SQL query that you specify by using the **oracleReaderQuery** property selects the data in the past hour to copy. --```json -{ - "name":"SamplePipeline", - "properties":{ - "start":"2014-06-01T18:00:00", - "end":"2014-06-01T19:00:00", - "description":"pipeline for a copy activity", - "activities":[ - { - "name": "OracletoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [ - { - "name": " OracleInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "typeProperties": { - "source": { - "type": "OracleSource", - "oracleReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` --**Example: Copy data from Azure Blob storage to Oracle** --This sample shows how to copy data from an Azure Blob storage account to an on-premises Oracle database. However, you can copy data *directly* from any of the sources listed in [Supported data stores and formats](data-factory-data-movement-activities.md#supported-data-stores-and-formats) by using Copy Activity in Azure Data Factory. --The sample has the following Data Factory entities: --* A linked service of type [OnPremisesOracle](data-factory-onprem-oracle-connector.md#linked-service-properties). -* A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -* An input [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -* An output [dataset](data-factory-create-datasets.md) of type [OracleTable](data-factory-onprem-oracle-connector.md#dataset-properties). -* A [pipeline](data-factory-create-pipelines.md) that has a copy activity that uses [BlobSource](data-factory-azure-blob-connector.md#copy-activity-properties) as source [OracleSink](data-factory-onprem-oracle-connector.md#copy-activity-properties) as sink. --The sample copies data from a blob to a table in an on-premises Oracle database every hour. For more information about various properties that are used in the sample, see the sections that follow the samples. --**Oracle linked service** --```json -{ - "name": "OnPremisesOracleLinkedService", - "properties": { - "type": "OnPremisesOracle", - "typeProperties": { - "connectionString": "Data Source=(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<host name>)(PORT=<port number>))(CONNECT_DATA=(SERVICE_NAME=<service ID>))); - User Id=<username>;Password=<password>;", - "gatewayName": "<gateway name>" - } - } -} -``` --**Azure Blob storage linked service** --```json -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>" - } - } -} -``` --**Azure blob input dataset** --Data is picked up from a new blob every hour (**frequency**: **hour**, **interval**: **1**). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that's being processed. The folder path uses the year, month, and day part of the start time. The file name uses the hour part of the start time. The setting **external**: **true** informs the Data Factory service that this table is external to the data factory and is not produced by an activity in the data factory. --```json -{ - "name": "AzureBlobInput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}", - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - } - ], - "format": { - "type": "TextFormat", - "columnDelimiter": ",", - "rowDelimiter": "\n" - } - }, - "external": true, - "availability": { - "frequency": "Day", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --**Oracle output dataset** --The sample assumes you have created a table named **MyTable** in Oracle. Create the table in Oracle with the same number of columns that you expect the blob CSV file to contain. New rows are added to the table every hour. --```json -{ - "name": "OracleOutput", - "properties": { - "type": "OracleTable", - "linkedServiceName": "OnPremisesOracleLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "availability": { - "frequency": "Day", - "interval": "1" - } - } -} -``` --**Pipeline with a copy activity** --The pipeline contains a copy activity that's configured to use the input and output datasets and scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **BlobSource** and the **sink** type is set to **OracleSink**. --```json -{ - "name":"SamplePipeline", - "properties":{ - "start":"2014-06-01T18:00:00", - "end":"2014-06-05T19:00:00", - "description":"pipeline with a copy activity", - "activities":[ - { - "name": "AzureBlobtoOracle", - "description": "Copy Activity", - "type": "Copy", - "inputs": [ - { - "name": "AzureBlobInput" - } - ], - "outputs": [ - { - "name": "OracleOutput" - } - ], - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "OracleSink" - } - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` ---## Troubleshooting tips --### Problem 1: .NET Framework Data Provider --**Error message** --```text -Copy activity met invalid parameters: 'UnknownParameterName', Detailed message: Unable to find the requested .NET Framework Data Provider. It may not be installed. -``` --**Possible causes** --* The .NET Framework Data Provider for Oracle wasn't installed. -* The .NET Framework Data Provider for Oracle was installed to .NET Framework 2.0 and isn't found in the .NET Framework 4.0 folders. --**Resolution** --* If you haven't installed the .NET Provider for Oracle, [install it](https://www.oracle.com/technetwork/topics/dotnet/downloads/), and then retry the scenario. -* If you see the error message even after you install the provider, complete the following steps: - 1. Open the machine config file for .NET 2.0 from the folder <system disk\>:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config. - 2. Search for **Oracle Data Provider for .NET**. You should be able to find an entry as shown in the following sample under **system.data** > **DbProviderFactories**: - `<add name="Oracle Data Provider for .NET" invariant="Oracle.DataAccess.Client" description="Oracle Data Provider for .NET" type="Oracle.DataAccess.Client.OracleClientFactory, Oracle.DataAccess, Version=2.112.3.0, Culture=neutral, PublicKeyToken=89b483f429c47342" />` -* Copy this entry to the machine.config file in the following .NET 4.0 folder: <system disk\>:\Windows\Microsoft.NET\Framework64\v4.0.30319\Config\machine.config. Then, change the version to 4.xxx.x.x. -* Install <ODP.NET Installed Path\>\11.2.0\client_1\odp.net\bin\4\Oracle.DataAccess.dll in the global assembly cache (GAC) by running **gacutil /i [provider path]**. --### Problem 2: Date/time formatting --**Error message** --```text -Message=Operation failed in Oracle Database with the following error: 'ORA-01861: literal does not match format string'.,Source=,''Type=Oracle.DataAccess.Client.OracleException,Message=ORA-01861: literal does not match format string,Source=Oracle Data Provider for .NET,'. -``` --**Resolution** --You might need to adjust the query string in your copy activity based on how dates are configured in your Oracle database. Here's an example (using the **to_date** function): --```console -"oracleReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= to_date(\\'{0:MM-dd-yyyy HH:mm}\\',\\'MM/DD/YYYY HH24:MI\\') AND timestampcolumn < to_date(\\'{1:MM-dd-yyyy HH:mm}\\',\\'MM/DD/YYYY HH24:MI\\') ', WindowStart, WindowEnd)" -``` ---## Type mapping for Oracle --As mentioned in [Data movement activities](data-factory-data-movement-activities.md), Copy Activity performs automatic type conversions from source types to sink types by using the following two-step approach: --1. Convert from native source types to the .NET type. -2. Convert from the .NET type to the native sink type. --When you move data from Oracle, the following mappings are used from the Oracle data type to the .NET type and vice versa: --| Oracle data type | .NET Framework data type | -| | | -| BFILE |Byte[] | -| BLOB |Byte[]<br/>(only supported on Oracle 10g and later versions when you use a Microsoft driver) | -| CHAR |String | -| CLOB |String | -| DATE |DateTime | -| FLOAT |Decimal, String (if precision > 28) | -| INTEGER |Decimal, String (if precision > 28) | -| INTERVAL YEAR TO MONTH |Int32 | -| INTERVAL DAY TO SECOND |TimeSpan | -| LONG |String | -| LONG RAW |Byte[] | -| NCHAR |String | -| NCLOB |String | -| NUMBER |Decimal, String (if precision > 28) | -| NVARCHAR2 |String | -| RAW |Byte[] | -| ROWID |String | -| TIMESTAMP |DateTime | -| TIMESTAMP WITH LOCAL TIME ZONE |DateTime | -| TIMESTAMP WITH TIME ZONE |DateTime | -| UNSIGNED INTEGER |Number | -| VARCHAR2 |String | -| XML |String | --> [!NOTE] -> Data types **INTERVAL YEAR TO MONTH** and **INTERVAL DAY TO SECOND** aren't supported when you use a Microsoft driver. --## Map source to sink columns --To learn more about mapping columns in the source dataset to columns in the sink dataset, see [Mapping dataset columns in Data Factory](data-factory-map-columns.md). --## Repeatable read from relational sources --When you copy data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can manually rerun a slice. You can also configure a retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun, either manually or by a retry policy, make sure that the same data is read no matter how many times a slice is run. For more information, see [Repeatable read from relational sources](data-factory-repeatable-copy.md#repeatable-read-from-relational-sources). --## Performance and tuning --See the [Copy Activity performance and tuning guide](data-factory-copy-activity-performance.md) to learn about key factors that affect the performance of data movement (Copy Activity) in Azure Data Factory. You can also learn about various ways to optimize it. |
data-factory | Data Factory Onprem Postgresql Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-onprem-postgresql-connector.md | - Title: Move data From PostgreSQL using Azure Data Factory -description: Learn about how to move data from PostgreSQL Database using Azure Data Factory. ----- Previously updated : 04/12/2023---# Move data from PostgreSQL using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-onprem-postgresql-connector.md) -> * [Version 2 (current version)](../connector-postgresql.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [PostgreSQL connector in V2](../connector-postgresql.md). ---This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises PostgreSQL database. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --You can copy data from an on-premises PostgreSQL data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see [supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats). Data factory currently supports moving data from a PostgreSQL database to other data stores, but not for moving data from other data stores to an PostgreSQL database. --## Prerequisites --Data Factory service supports connecting to on-premises PostgreSQL sources using the Data Management Gateway. See [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. --Gateway is required even if the PostgreSQL database is hosted in an Azure IaaS VM. You can install gateway on the same IaaS VM as the data store or on a different VM as long as the gateway can connect to the database. --> [!NOTE] -> See [Troubleshoot gateway issues](data-factory-data-management-gateway.md#troubleshooting-gateway-issues) for tips on troubleshooting connection/gateway related issues. --## Supported versions and installation -For Data Management Gateway to connect to the PostgreSQL Database, install the [Ngpsql data provider for PostgreSQL](https://go.microsoft.com/fwlink/?linkid=282716) with version between 2.0.12 and 3.1.9 on the same system as the Data Management Gateway. PostgreSQL version 7.4 and above is supported. --## Getting started -You can create a pipeline with a copy activity that moves data from an on-premises PostgreSQL data store by using different tools/APIs. --- The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard.-- You can also use the following tools to create a pipeline:- - Visual Studio - - Azure PowerShell - - Azure Resource Manager template - - .NET API - - REST API -- See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises PostgreSQL data store, see [JSON example: Copy data from PostgreSQL to Azure Blob](#json-example-copy-data-from-postgresql-to-azure-blob) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to a PostgreSQL data store: --## Linked service properties -The following table provides description for JSON elements specific to PostgreSQL linked service. --| Property | Description | Required | -| | | | -| type |The type property must be set to: **OnPremisesPostgreSql** |Yes | -| server |Name of the PostgreSQL server. |Yes | -| database |Name of the PostgreSQL database. |Yes | -| schema |Name of the schema in the database. The schema name is case-sensitive. |No | -| authenticationType |Type of authentication used to connect to the PostgreSQL database. Possible values are: Anonymous, Basic, and Windows. |Yes | -| username |Specify user name if you are using Basic or Windows authentication. |No | -| password |Specify password for the user account you specified for the username. |No | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the on-premises PostgreSQL database. |Yes | --## Dataset properties -For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types. --The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type **RelationalTable** (which includes PostgreSQL dataset) has the following properties: --| Property | Description | Required | -| | | | -| tableName |Name of the table in the PostgreSQL Database instance that linked service refers to. The tableName is case-sensitive. |No (if **query** of **RelationalSource** is specified) | --## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and policy are available for all types of activities. --Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --When source is of type **RelationalSource** (which includes PostgreSQL), the following properties are available in typeProperties section: --| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL query string. For example: `"query": "select * from \"MySchema\".\"MyTable\""`. |No (if **tableName** of **dataset** is specified) | --> [!NOTE] -> Schema and table names are case-sensitive. Enclose them in `""` (double quotes) in the query. --**Example:** -- `"query": "select * from \"MySchema\".\"MyTable\""` --## JSON example: Copy data from PostgreSQL to Azure Blob -This example provides sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). They show how to copy data from PostgreSQL database to Azure Blob Storage. However, data can be copied to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --> [!IMPORTANT] -> This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article for step-by-step instructions. --The sample has the following data factory entities: --1. A linked service of type [OnPremisesPostgreSql](data-factory-onprem-postgresql-connector.md#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [RelationalTable](data-factory-onprem-postgresql-connector.md#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -5. The [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [RelationalSource](data-factory-onprem-postgresql-connector.md#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data from a query result in PostgreSQL database to a blob every hour. The JSON properties used in these samples are described in sections following the samples. --As a first step, set up the data management gateway. The instructions are in the [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article. --**PostgreSQL linked service:** --```json -{ - "name": "OnPremPostgreSqlLinkedService", - "properties": { - "type": "OnPremisesPostgreSql", - "typeProperties": { - "server": "<server>", - "database": "<database>", - "schema": "<schema>", - "authenticationType": "<authentication type>", - "username": "<username>", - "password": "<password>", - "gatewayName": "<gatewayName>" - } - } -} -``` -**Azure Blob storage linked service:** --```json -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<AccountName>;AccountKey=<AccountKey>" - } - } -} -``` -**PostgreSQL input dataset:** --The sample assumes you have created a table "MyTable" in PostgreSQL and it contains a column called "timestamp" for time series data. --Setting `"external": true` informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. --```json -{ - "name": "PostgreSqlDataSet", - "properties": { - "type": "RelationalTable", - "linkedServiceName": "OnPremPostgreSqlLinkedService", - "typeProperties": {}, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --**Azure Blob output dataset:** --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```json -{ - "name": "AzureBlobPostgreSqlDataSet", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/postgresql/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Pipeline with Copy activity:** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run hourly. In the pipeline JSON definition, the **source** type is set to **RelationalSource** and **sink** type is set to **BlobSink**. The SQL query specified for the **query** property selects the data from the public.usstates table in the PostgreSQL database. --```json -{ - "name": "CopyPostgreSqlToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "select * from \"public\".\"usstates\"" - }, - "sink": { - "type": "BlobSink" - } - }, - "inputs": [ - { - "name": "PostgreSqlDataSet" - } - ], - "outputs": [ - { - "name": "AzureBlobPostgreSqlDataSet" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "PostgreSqlToBlob" - } - ], - "start": "2014-06-01T18:00:00Z", - "end": "2014-06-01T19:00:00Z" - } -} -``` -## Type mapping for PostgreSQL -As mentioned in the [data movement activities](data-factory-data-movement-activities.md) article Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: --1. Convert from native source types to .NET type -2. Convert from .NET type to native sink type --When moving data to PostgreSQL, the following mappings are used from PostgreSQL type to .NET type. --| PostgreSQL Database type | PostgresSQL aliases | .NET Framework type | -| | | | -| abstime | |Datetime | -| bigint |int8 |Int64 | -| bigserial |serial8 |Int64 | -| bit [(n)] | |Byte[], String | -| bit varying [ (n) ] |varbit |Byte[], String | -| boolean |bool |Boolean | -| box | |Byte[], String | -| bytea | |Byte[], String | -| character [(n)] |char [(n)] |String | -| character varying [(n)] |varchar [(n)] |String | -| cid | |String | -| cidr | |String | -| circle | |Byte[], String | -| date | |Datetime | -| daterange | |String | -| double precision |float8 |Double | -| inet | |Byte[], String | -| intarry | |String | -| int4range | |String | -| int8range | |String | -| integer |int, int4 |Int32 | -| interval [fields] [(p)] | |Timespan | -| json | |String | -| jsonb | |Byte[] | -| line | |Byte[], String | -| lseg | |Byte[], String | -| macaddr | |Byte[], String | -| money | |Decimal | -| numeric [(p, s)] |decimal [(p, s)] |Decimal | -| numrange | |String | -| oid | |Int32 | -| path | |Byte[], String | -| pg_lsn | |Int64 | -| point | |Byte[], String | -| polygon | |Byte[], String | -| real |float4 |Single | -| smallint |int2 |Int16 | -| smallserial |serial2 |Int16 | -| serial |serial4 |Int32 | -| text | |String | --## Map source to sink columns -To learn about mapping columns in source dataset to columns in sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Repeatable read from relational sources -When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See [Repeatable read from relational sources](data-factory-repeatable-copy.md#repeatable-read-from-relational-sources). --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory Onprem Sybase Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-onprem-sybase-connector.md | - Title: Move data from Sybase using Azure Data Factory -description: Learn about how to move data from Sybase Database using Azure Data Factory. ----- Previously updated : 04/12/2023---# Move data from Sybase using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-onprem-sybase-connector.md) -> * [Version 2 (current version)](../connector-sybase.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Sybase connector in V2](../connector-sybase.md). --This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises Sybase database. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --You can copy data from an on-premises Sybase data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the [Supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. Data factory currently supports only moving data from a Sybase data store to other data stores, but not for moving data from other data stores to a Sybase data store. --## Prerequisites -Data Factory service supports connecting to on-premises Sybase sources using the Data Management Gateway. See [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. --Gateway is required even if the Sybase database is hosted in an Azure IaaS VM. You can install the gateway on the same IaaS VM as the data store or on a different VM as long as the gateway can connect to the database. --> [!NOTE] -> See [Troubleshoot gateway issues](data-factory-data-management-gateway.md#troubleshooting-gateway-issues) for tips on troubleshooting connection/gateway related issues. --## Supported versions and installation -For Data Management Gateway to connect to the Sybase Database, you need to install the [data provider for Sybase iAnywhere.Data.SQLAnywhere](https://go.microsoft.com/fwlink/?linkid=324846) 16 or above on the same system as the Data Management Gateway. --SAP Sybase SQL Anywhere (ASA) version 16 and above is supported; IQ and ASE are not supported. --## Getting started -You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by using different tools/APIs. --- The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. -- You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises Sybase data store, see [JSON example: Copy data from Sybase to Azure Blob](#json-example-copy-data-from-sybase-to-azure-blob) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to a Sybase data store: --## Linked service properties -The following table provides description for JSON elements specific to Sybase linked service. --| Property | Description | Required | -| | | | -| type |The type property must be set to: **OnPremisesSybase** |Yes | -| server |Name of the Sybase server. |Yes | -| database |Name of the Sybase database. |Yes | -| schema |Name of the schema in the database. |No | -| authenticationType |Type of authentication used to connect to the Sybase database. Possible values are: Anonymous, Basic, and Windows. |Yes | -| username |Specify user name if you are using Basic or Windows authentication. |No | -| password |Specify password for the user account you specified for the username. |No | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the on-premises Sybase database. |Yes | --## Dataset properties -For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The **typeProperties** section for dataset of type **RelationalTable** (which includes Sybase dataset) has the following properties: --| Property | Description | Required | -| | | | -| tableName |Name of the table in the Sybase Database instance that linked service refers to. |No (if **query** of **RelationalSource** is specified) | --## Copy activity properties -For a full list of sections & properties available for defining activities, see [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and policy are available for all types of activities. --Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --When the source is of type **RelationalSource** (which includes Sybase), the following properties are available in **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL query string. For example: select * from MyTable. |No (if **tableName** of **dataset** is specified) | ---## JSON example: Copy data from Sybase to Azure Blob -The following example provides sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). They show how to copy data from Sybase database to Azure Blob Storage. However, data can be copied to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --The sample has the following data factory entities: --1. A linked service of type [OnPremisesSybase](data-factory-onprem-sybase-connector.md#linked-service-properties). -2. A liked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [RelationalTable](data-factory-onprem-sybase-connector.md#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -5. The [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [RelationalSource](data-factory-onprem-sybase-connector.md#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data from a query result in Sybase database to a blob every hour. The JSON properties used in these samples are described in sections following the samples. --As a first step, setup the data management gateway. The instructions are in the [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article. --**Sybase linked service:** --```JSON -{ - "name": "OnPremSybaseLinkedService", - "properties": { - "type": "OnPremisesSybase", - "typeProperties": { - "server": "<server>", - "database": "<database>", - "schema": "<schema>", - "authenticationType": "<authentication type>", - "username": "<username>", - "password": "<password>", - "gatewayName": "<gatewayName>" - } - } -} -``` --**Azure Blob storage linked service:** --```JSON -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorageLinkedService", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<AccountName>;AccountKey=<AccountKey>" - } - } -} -``` --**Sybase input dataset:** --The sample assumes you have created a table "MyTable" in Sybase and it contains a column called "timestamp" for time series data. --Setting "external": true informs the Data Factory service that this dataset is external to the data factory and is not produced by an activity in the data factory. Notice that the **type** of the linked service is set to: **RelationalTable**. --```JSON -{ - "name": "SybaseDataSet", - "properties": { - "type": "RelationalTable", - "linkedServiceName": "OnPremSybaseLinkedService", - "typeProperties": {}, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --**Azure Blob output dataset:** --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```JSON -{ - "name": "AzureBlobSybaseDataSet", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/sybase/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Pipeline with Copy activity:** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run hourly. In the pipeline JSON definition, the **source** type is set to **RelationalSource** and **sink** type is set to **BlobSink**. The SQL query specified for the **query** property selects the data from the DBA.Orders table in the database. --```JSON -{ - "name": "CopySybaseToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "select * from DBA.Orders" - }, - "sink": { - "type": "BlobSink" - } - }, - "inputs": [ - { - "name": "SybaseDataSet" - } - ], - "outputs": [ - { - "name": "AzureBlobSybaseDataSet" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "SybaseToBlob" - } - ], - "start": "2014-06-01T18:00:00Z", - "end": "2014-06-01T19:00:00Z" - } -} -``` --## Type mapping for Sybase -As mentioned in the [Data Movement Activities](data-factory-data-movement-activities.md) article, the Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: --1. Convert from native source types to .NET type -2. Convert from .NET type to native sink type --Sybase supports T-SQL and T-SQL types. For a mapping table from sql types to .NET type, see [Azure SQL Connector](data-factory-azure-sql-connector.md) article. --## Map source to sink columns -To learn about mapping columns in source dataset to columns in sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Repeatable read from relational sources -When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See [Repeatable read from relational sources](data-factory-repeatable-copy.md#repeatable-read-from-relational-sources). --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory Onprem Teradata Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-onprem-teradata-connector.md | - Title: Move data from Teradata using Azure Data Factory -description: Learn about Teradata Connector for the Data Factory service that lets you move data from Teradata Database ---- Previously updated : 04/12/2023----# Move data from Teradata using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-onprem-teradata-connector.md) -> * [Version 2 (current version)](../connector-teradata.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Teradata connector in V2](../connector-teradata.md). --This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises Teradata database. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --You can copy data from an on-premises Teradata data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the [Supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. Data factory currently supports only moving data from a Teradata data store to other data stores, but not for moving data from other data stores to a Teradata data store. --## Prerequisites -Data factory supports connecting to on-premises Teradata sources via the Data Management Gateway. See [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. --Gateway is required even if the Teradata is hosted in an Azure IaaS VM. You can install the gateway on the same IaaS VM as the data store or on a different VM as long as the gateway can connect to the database. --> [!NOTE] -> See [Troubleshoot gateway issues](data-factory-data-management-gateway.md#troubleshooting-gateway-issues) for tips on troubleshooting connection/gateway related issues. --## Supported versions and installation -For Data Management Gateway to connect to the Teradata Database, you need to install the [.NET Data Provider for Teradata](https://go.microsoft.com/fwlink/?LinkId=278886) version 14 or above on the same system as the Data Management Gateway. Teradata version 12 and above is supported. --## Getting started -You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by using different tools/APIs. --- The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard.-- You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity.--Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises Teradata data store, see [JSON example: Copy data from Teradata to Azure Blob](#json-example-copy-data-from-teradata-to-azure-blob) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to a Teradata data store: --## Linked service properties -The following table provides description for JSON elements specific to Teradata linked service. --| Property | Description | Required | -| | | | -| type |The type property must be set to: **OnPremisesTeradata** |Yes | -| server |Name of the Teradata server. |Yes | -| authenticationType |Type of authentication used to connect to the Teradata database. Possible values are: Anonymous, Basic, and Windows. |Yes | -| username |Specify user name if you are using Basic or Windows authentication. |No | -| password |Specify password for the user account you specified for the username. |No | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the on-premises Teradata database. |Yes | --## Dataset properties -For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --The **typeProperties** section is different for each type of dataset and provides information about the location of the data in the data store. Currently, there are no type properties supported for the Teradata dataset. --## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and policies are available for all types of activities. --Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --When the source is of type **RelationalSource** (which includes Teradata), the following properties are available in **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |SQL query string. For example: select * from MyTable. |Yes | --### JSON example: Copy data from Teradata to Azure Blob -The following example provides sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). They show how to copy data from Teradata to Azure Blob Storage. However, data can be copied to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --The sample has the following data factory entities: --1. A linked service of type [OnPremisesTeradata](#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [RelationalTable](#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -5. The [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [RelationalSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data from a query result in Teradata database to a blob every hour. The JSON properties used in these samples are described in sections following the samples. --As a first step, setup the data management gateway. The instructions are in the [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article. --**Teradata linked service:** --```json -{ - "name": "OnPremTeradataLinkedService", - "properties": { - "type": "OnPremisesTeradata", - "typeProperties": { - "server": "<server>", - "authenticationType": "<authentication type>", - "username": "<username>", - "password": "<password>", - "gatewayName": "<gatewayName>" - } - } -} -``` --**Azure Blob storage linked service:** --```json -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorageLinkedService", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<AccountName>;AccountKey=<AccountKey>" - } - } -} -``` --**Teradata input dataset:** --The sample assumes you have created a table "MyTable" in Teradata and it contains a column called "timestamp" for time series data. --Setting "external": true informs the Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. --```json -{ - "name": "TeradataDataSet", - "properties": { - "published": false, - "type": "RelationalTable", - "linkedServiceName": "OnPremTeradataLinkedService", - "typeProperties": { - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --**Azure Blob output dataset:** --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```json -{ - "name": "AzureBlobTeradataDataSet", - "properties": { - "published": false, - "location": { - "type": "AzureBlobLocation", - "folderPath": "mycontainer/teradata/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ], - "linkedServiceName": "AzureStorageLinkedService" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` -**Pipeline with Copy activity:** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run hourly. In the pipeline JSON definition, the **source** type is set to **RelationalSource** and **sink** type is set to **BlobSink**. The SQL query specified for the **query** property selects the data in the past hour to copy. --```json -{ - "name": "CopyTeradataToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', SliceStart, SliceEnd)" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ - { - "name": "TeradataDataSet" - } - ], - "outputs": [ - { - "name": "AzureBlobTeradataDataSet" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "TeradataToBlob" - } - ], - "start": "2014-06-01T18:00:00Z", - "end": "2014-06-01T19:00:00Z", - "isPaused": false - } -} -``` -## Type mapping for Teradata -As mentioned in the [data movement activities](data-factory-data-movement-activities.md) article, the Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: --1. Convert from native source types to .NET type -2. Convert from .NET type to native sink type --When moving data to Teradata, the following mappings are used from Teradata type to .NET type. --| Teradata Database type | .NET Framework type | -| | | -| Char |String | -| Clob |String | -| Graphic |String | -| VarChar |String | -| VarGraphic |String | -| Blob |Byte[] | -| Byte |Byte[] | -| VarByte |Byte[] | -| BigInt |Int64 | -| ByteInt |Int16 | -| Decimal |Decimal | -| Double |Double | -| Integer |Int32 | -| Number |Double | -| SmallInt |Int16 | -| Date |DateTime | -| Time |TimeSpan | -| Time With Time Zone |String | -| Timestamp |DateTime | -| Timestamp With Time Zone |DateTimeOffset | -| Interval Day |TimeSpan | -| Interval Day To Hour |TimeSpan | -| Interval Day To Minute |TimeSpan | -| Interval Day To Second |TimeSpan | -| Interval Hour |TimeSpan | -| Interval Hour To Minute |TimeSpan | -| Interval Hour To Second |TimeSpan | -| Interval Minute |TimeSpan | -| Interval Minute To Second |TimeSpan | -| Interval Second |TimeSpan | -| Interval Year |String | -| Interval Year To Month |String | -| Interval Month |String | -| Period(Date) |String | -| Period(Time) |String | -| Period(Time With Time Zone) |String | -| Period(Timestamp) |String | -| Period(Timestamp With Time Zone) |String | -| Xml |String | --## Map source to sink columns -To learn about mapping columns in source dataset to columns in sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Repeatable read from relational sources -When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See [Repeatable read from relational sources](data-factory-repeatable-copy.md#repeatable-read-from-relational-sources). --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory Pig Activity | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-pig-activity.md | - Title: Transform data using Pig Activity in Azure Data Factory -description: Learn how you can use the Pig Activity in Azure Data Factory v1 to run Pig scripts on an on-demand/your own HDInsight cluster. ------ Previously updated : 04/12/2023---# Transform data using Pig Activity in Azure Data Factory -> [!div class="op_single_selector" title1="Transformation Activities"] -> * [Hive Activity](data-factory-hive-activity.md) -> * [Pig Activity](data-factory-pig-activity.md) -> * [MapReduce Activity](data-factory-map-reduce.md) -> * [Hadoop Streaming Activity](data-factory-hadoop-streaming-activity.md) -> * [Spark Activity](data-factory-spark.md) -> * [ML Studio (classic) Batch Execution Activity](data-factory-azure-ml-batch-execution-activity.md) -> * [ML Studio (classic) Update Resource Activity](data-factory-azure-ml-update-resource-activity.md) -> * [Stored Procedure Activity](data-factory-stored-proc-activity.md) -> * [Data Lake Analytics U-SQL Activity](data-factory-usql-activity.md) -> * [.NET Custom Activity](data-factory-use-custom-activities.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [transform data using Pig activity in Data Factory](../transform-data-using-hadoop-pig.md). ---The HDInsight Pig activity in a Data Factory [pipeline](data-factory-create-pipelines.md) executes Pig queries on [your own](data-factory-compute-linked-services.md#azure-hdinsight-linked-service) or [on-demand](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service) Windows/Linux-based HDInsight cluster. This article builds on the [data transformation activities](data-factory-data-transformation-activities.md) article, which presents a general overview of data transformation and the supported transformation activities. --> [!NOTE] -> If you are new to Azure Data Factory, read through [Introduction to Azure Data Factory](data-factory-introduction.md) and do the tutorial: [Build your first data pipeline](data-factory-build-your-first-pipeline.md) before reading this article. --## Syntax --```JSON -{ - "name": "HiveActivitySamplePipeline", - "properties": { - "activities": [ - { - "name": "Pig Activity", - "description": "description", - "type": "HDInsightPig", - "inputs": [ - { - "name": "input tables" - } - ], - "outputs": [ - { - "name": "output tables" - } - ], - "linkedServiceName": "MyHDInsightLinkedService", - "typeProperties": { - "script": "Pig script", - "scriptPath": "<pathtothePigscriptfileinAzureblobstorage>", - "defines": { - "param1": "param1Value" - } - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - } - } - ] - } -} -``` --## Syntax details --| Property | Description | Required | -| | | | -| name |Name of the activity |Yes | -| description |Text describing what the activity is used for |No | -| type |HDinsightPig |Yes | -| inputs |One or more inputs consumed by the Pig activity |No | -| outputs |One or more outputs produced by the Pig activity |Yes | -| linkedServiceName |Reference to the HDInsight cluster registered as a linked service in Data Factory |Yes | -| script |Specify the Pig script inline |No | -| scriptPath |Store the Pig script in an Azure blob storage and provide the path to the file. Use 'script' or 'scriptPath' property. Both cannot be used together. The file name is case-sensitive. |No | -| defines |Specify parameters as key/value pairs for referencing within the Pig script |No | --## Example -LetΓÇÖs consider an example of game logs analytics where you want to identify the time spent by players playing games launched by your company. --The following sample game log is a comma (,) separated file. It contains the following fields - ProfileID, SessionStart, Duration, SrcIPAddress, and GameType. --``` -1809,2014-05-04 12:04:25.3470000,14,221.117.223.75,CaptureFlag -1703,2014-05-04 06:05:06.0090000,16,12.49.178.247,KingHill -1703,2014-05-04 10:21:57.3290000,10,199.118.18.179,CaptureFlag -1809,2014-05-04 05:24:22.2100000,23,192.84.66.141,KingHill -..... -``` --The **Pig script** to process this data: --``` -PigSampleIn = LOAD 'wasb://adfwalkthrough@anandsub14.blob.core.windows.net/samplein/' USING PigStorage(',') AS (ProfileID:chararray, SessionStart:chararray, Duration:int, SrcIPAddress:chararray, GameType:chararray); --GroupProfile = Group PigSampleIn all; --PigSampleOut = Foreach GroupProfile Generate PigSampleIn.ProfileID, SUM(PigSampleIn.Duration); --Store PigSampleOut into 'wasb://adfwalkthrough@anandsub14.blob.core.windows.net/sampleoutpig/' USING PigStorage (','); -``` --To execute this Pig script in a Data Factory pipeline, do the following steps: --1. Create a linked service to register [your own HDInsight compute cluster](data-factory-compute-linked-services.md#azure-hdinsight-linked-service) or configure [on-demand HDInsight compute cluster](data-factory-compute-linked-services.md#azure-hdinsight-on-demand-linked-service). LetΓÇÖs call this linked service **HDInsightLinkedService**. -2. Create a [linked service](data-factory-azure-blob-connector.md) to configure the connection to Azure Blob storage hosting the data. LetΓÇÖs call this linked service **StorageLinkedService**. -3. Create [datasets](data-factory-create-datasets.md) pointing to the input and the output data. LetΓÇÖs call the input dataset **PigSampleIn** and the output dataset **PigSampleOut**. -4. Copy the Pig query in a file the Azure Blob Storage configured in step #2. If the Azure storage that hosts the data is different from the one that hosts the query file, create a separate Azure Storage linked service. Refer to the linked service in the activity configuration. Use **scriptPath** to specify the path to pig script file and **scriptLinkedService**. - - > [!NOTE] - > You can also provide the Pig script inline in the activity definition by using the **script** property. However, we do not recommend this approach as all special characters in the script needs to be escaped and may cause debugging issues. The best practice is to follow step #4. - > - > -5. Create the pipeline with the HDInsightPig activity. This activity processes the input data by running Pig script on HDInsight cluster. -- ```JSON - { - "name": "PigActivitySamplePipeline", - "properties": { - "activities": [ - { - "name": "PigActivitySample", - "type": "HDInsightPig", - "inputs": [ - { - "name": "PigSampleIn" - } - ], - "outputs": [ - { - "name": "PigSampleOut" - } - ], - "linkedServiceName": "HDInsightLinkedService", - "typeproperties": { - "scriptPath": "adfwalkthrough\\scripts\\enrichlogs.pig", - "scriptLinkedService": "StorageLinkedService" - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - } - } - ] - } - } - ``` -6. Deploy the pipeline. See [Creating pipelines](data-factory-create-pipelines.md) article for details. -7. Monitor the pipeline using the data factory monitoring and management views. See [Monitoring and manage Data Factory pipelines](data-factory-monitor-manage-pipelines.md) article for details. --## Specifying parameters for a Pig script -Consider the following example: game logs are ingested daily into Azure Blob Storage and stored in a folder partitioned based on date and time. You want to parameterize the Pig script and pass the input folder location dynamically during runtime and also produce the output partitioned with date and time. --To use parameterized Pig script, do the following: --* Define the parameters in **defines**. -- ```JSON - { - "name": "PigActivitySamplePipeline", - "properties": { - "activities": [ - { - "name": "PigActivitySample", - "type": "HDInsightPig", - "inputs": [ - { - "name": "PigSampleIn" - } - ], - "outputs": [ - { - "name": "PigSampleOut" - } - ], - "linkedServiceName": "HDInsightLinkedService", - "typeproperties": { - "scriptPath": "adfwalkthrough\\scripts\\samplepig.hql", - "scriptLinkedService": "StorageLinkedService", - "defines": { - "Input": "$$Text.Format('wasb: //adfwalkthrough@<storageaccountname>.blob.core.windows.net/samplein/yearno={0: yyyy}/monthno={0:MM}/dayno={0: dd}/',SliceStart)", - "Output": "$$Text.Format('wasb://adfwalkthrough@<storageaccountname>.blob.core.windows.net/sampleout/yearno={0:yyyy}/monthno={0:MM}/dayno={0:dd}/', SliceStart)" - } - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - } - } - ] - } - } - ``` -* In the Pig Script, refer to the parameters using '**$parameterName**' as shown in the following example: -- ``` - PigSampleIn = LOAD '$Input' USING PigStorage(',') AS (ProfileID:chararray, SessionStart:chararray, Duration:int, SrcIPAddress:chararray, GameType:chararray); - GroupProfile = Group PigSampleIn all; - PigSampleOut = Foreach GroupProfile Generate PigSampleIn.ProfileID, SUM(PigSampleIn.Duration); - Store PigSampleOut into '$Output' USING PigStorage (','); - ``` --## See Also -* [Hive Activity](data-factory-hive-activity.md) -* [MapReduce Activity](data-factory-map-reduce.md) -* [Hadoop Streaming Activity](data-factory-hadoop-streaming-activity.md) -* [Invoke Spark programs](data-factory-spark.md) -* [Invoke R scripts](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/RunRScriptUsingADFSample) |
data-factory | Data Factory Product Reco Usecase | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-product-reco-usecase.md | - Title: Data Factory Use Case - Product Recommendations -description: Learn about an use case implemented by using Azure Data Factory along with other services. ------ Previously updated : 04/12/2023---# Use Case - Product Recommendations -Azure Data Factory is one of many services used to implement the Cortana Intelligence Suite of solution accelerators. See [Cortana Intelligence Suite](https://www.microsoft.com/cortanaanalytics) page for details about this suite. In this document, we describe a common use case that Azure users have already solved and implemented using Azure Data Factory and other Cortana Intelligence component services. --## Scenario -Online retailers commonly want to entice their customers to purchase products by presenting them with products they are most likely to be interested in, and therefore most likely to buy. To accomplish this, online retailers need to customize their userΓÇÖs online experience by using personalized product recommendations for that specific user. These personalized recommendations are to be made based on their current and historical shopping behavior data, product information, newly introduced brands, and product and customer segmentation data. Additionally, they can provide the user product recommendations based on analysis of overall usage behavior from all their users combined. --The goal of these retailers is to optimize for user click-to-sale conversions and earn higher sales revenue. They achieve this conversion by delivering contextual, behavior-based product recommendations based on customer interests and actions. For this use case, we use online retailers as an example of businesses that want to optimize for their customers. However, these principles apply to any business that wants to engage its customers around its goods and services and enhance their customersΓÇÖ buying experience with personalized product recommendations. --## Challenges -There are many challenges that online retailers face when trying to implement this type of use case. --First, data of different sizes and shapes must be ingested from multiple data sources, both on-premises and in the cloud. This data includes product data, historical customer behavior data, and user data as the user browses the online retail site. --Second, personalized product recommendations must be reasonably and accurately calculated and predicted. In addition to product, brand, and customer behavior and browser data, online retailers also need to include customer feedback on past purchases to factor in the determination of the best product recommendations for the user. --Third, the recommendations must be immediately deliverable to the user to provide a seamless browsing and purchasing experience, and provide the most recent and relevant recommendations. --Finally, retailers need to measure the effectiveness of their approach by tracking overall up-sell and cross-sell click-to-conversion sales successes, and adjust to their future recommendations. --## Solution Overview -This example use case has been solved and implemented by real Azure users by using Azure Data Factory and other Cortana Intelligence component services, including [HDInsight](https://azure.microsoft.com/services/hdinsight/) and [Power BI](https://powerbi.microsoft.com/). --The online retailer uses an Azure Blob store, an on-premises SQL Server, Azure SQL Database, and a relational data mart as their data storage options throughout the workflow. The blob store contains customer information, customer behavior data, and product information data. The product information data includes product brand information and a product catalog stored on-premises in a Azure Synapse Analytics. --All the data is combined and fed into a product recommendation system to deliver personalized recommendations based on customer interests and actions, while the user browses products in the catalog on the website. The customers also see products that are related to the product they are looking at based on overall website usage patterns that are not related to any one user. ---Gigabytes of raw web log files are generated daily from the online retailerΓÇÖs website as semi-structured files. The raw web log files and the customer and product catalog information is ingested regularly into an Azure Blob storage using Data FactoryΓÇÖs globally deployed data movement as a service. The raw log files for the day are partitioned (by year and month) in blob storage for long-term storage. [Azure HDInsight](https://azure.microsoft.com/services/hdinsight/) is used to partition the raw log files in the blob store and process the ingested logs at scale using both Hive and Pig scripts. The partitioned web logs data is then processed to extract the needed inputs for a machine learning recommendation system to generate the personalized product recommendations. --The recommendation system used for the machine learning in this example is an open source machine learning recommendation platform from [Apache Mahout](https://mahout.apache.org/). Any [Azure Machine Learning](https://azure.microsoft.com/services/machine-learning/) or custom model can be applied to the scenario. The Mahout model is used to predict the similarity between items on the website based on overall usage patterns, and to generate the personalized recommendations based on the individual user. --Finally, the result set of personalized product recommendations is moved to a relational data mart for consumption by the retailer website. The result set could also be accessed directly from blob storage by another application, or moved to additional stores for other consumers and use cases. --## Benefits -By optimizing their product recommendation strategy and aligning it with business goals, the solution met the online retailerΓÇÖs merchandising and marketing objectives. Additionally, they were able to operationalize and manage the product recommendation workflow in an efficient, reliable, and cost effective manner. The approach made it easy for them to update their model and fine-tune its effectiveness based on the measures of sales click-to-conversion successes. By using Azure Data Factory, they were able to abandon their time consuming and expensive manual cloud resource management and move to on-demand cloud resource management. Therefore, they were able to save time, money, and reduce their time to solution deployment. Data lineage views and operational service health became easy to visualize and troubleshoot with the intuitive Data Factory monitoring and management UI available from the Azure portal. Their solution can now be scheduled and managed so that finished data is reliably produced and delivered to users, and data and processing dependencies are automatically managed without human intervention. --By providing this personalized shopping experience, the online retailer created a more competitive, engaging customer experience and therefore increase sales and overall customer satisfaction. - |
data-factory | Data Factory Repeatable Copy | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-repeatable-copy.md | - Title: Repeatable copy in Azure Data Factory -description: 'Learn how to avoid duplicates even though a slice that copies data is run more than once.' ---- Previously updated : 04/12/2023-----# Repeatable copy in Azure Data Factory --## Repeatable read from relational sources -When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. - -> [!NOTE] -> The following samples are for Azure SQL but are applicable to any data store that supports rectangular datasets. You may have to adjust the **type** of source and the **query** property (for example: query instead of sqlReaderQuery) for the data store. --Usually, when reading from relational stores, you want to read only the data corresponding to that slice. A way to do so would be by using the WindowStart and WindowEnd system variables available in Azure Data Factory. Read about the variables and functions in Azure Data Factory here in the [Azure Data Factory - Functions and System Variables](data-factory-functions-variables.md) article. Example: --```json -"source": { - "type": "SqlSource", - "sqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm\\'', WindowStart, WindowEnd)" -}, -``` -This query reads data that falls in the slice duration range (WindowStart -> WindowEnd) from the table MyTable. Rerun of this slice would also always ensure that the same data is read. --In other cases, you may wish to read the entire table and may define the sqlReaderQuery as follows: --```json -"source": -{ - "type": "SqlSource", - "sqlReaderQuery": "select * from MyTable" -}, -``` --## Repeatable write to SqlSink -When copying data to **Azure SQL/SQL Server** from other data stores, you need to keep repeatability in mind to avoid unintended outcomes. --When copying data to Azure SQL/SQL Server Database, the copy activity appends data to the sink table by default. Say, you are copying data from a CSV (comma-separated values) file containing two records to the following table in an Azure SQL/SQL Server Database. When a slice runs, the two records are copied to the SQL table. --``` -ID Product Quantity ModifiedDate -... ... ... ... -6 Flat Washer 3 2015-05-01 00:00:00 -7 Down Tube 2 2015-05-01 00:00:00 -``` --Suppose you found errors in source file and updated the quantity of Down Tube from 2 to 4. If you rerun the data slice for that period manually, you'll find two new records appended to Azure SQL/SQL Server Database. This example assumes that none of the columns in the table has the primary key constraint. --``` -ID Product Quantity ModifiedDate -... ... ... ... -6 Flat Washer 3 2015-05-01 00:00:00 -7 Down Tube 2 2015-05-01 00:00:00 -6 Flat Washer 3 2015-05-01 00:00:00 -7 Down Tube 4 2015-05-01 00:00:00 -``` --To avoid this behavior, you need to specify UPSERT semantics by using one of the following two mechanisms: --### Mechanism 1: using sqlWriterCleanupScript -You can use the **sqlWriterCleanupScript** property to clean up data from the sink table before inserting the data when a slice is run. --```json -"sink": -{ - "type": "SqlSink", - "sqlWriterCleanupScript": "$$Text.Format('DELETE FROM table WHERE ModifiedDate >= \\'{0:yyyy-MM-dd HH:mm}\\' AND ModifiedDate < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" -} -``` --When a slice runs, the cleanup script is run first to delete data that corresponds to the slice from the SQL table. The copy activity then inserts data into the SQL Table. If the slice is rerun, the quantity is updated as desired. --``` -ID Product Quantity ModifiedDate -... ... ... ... -6 Flat Washer 3 2015-05-01 00:00:00 -7 Down Tube 4 2015-05-01 00:00:00 -``` --Suppose the Flat Washer record is removed from the original csv. Then rerunning the slice would produce the following result: --``` -ID Product Quantity ModifiedDate -... ... ... ... -7 Down Tube 4 2015-05-01 00:00:00 -``` --The copy activity ran the cleanup script to delete the corresponding data for that slice. Then it read the input from the csv (which then contained only one record) and inserted it into the Table. --### Mechanism 2: using sliceIdentifierColumnName -> [!IMPORTANT] -> Currently, sliceIdentifierColumnName is not supported for Azure Synapse Analytics. --The second mechanism to achieve repeatability is by having a dedicated column (sliceIdentifierColumnName) in the target Table. This column would be used by Azure Data Factory to ensure the source and destination stay synchronized. This approach works when there is flexibility in changing or defining the destination SQL Table schema. --This column is used by Azure Data Factory for repeatability purposes and in the process Azure Data Factory does not make any schema changes to the Table. Way to use this approach: --1. Define a column of type **binary (32)** in the destination SQL Table. There should be no constraints on this column. Let's name this column as AdfSliceIdentifier for this example. --- Source table: -- ```sql - CREATE TABLE [dbo].[Student]( - [Id] [varchar](32) NOT NULL, - [Name] [nvarchar](256) NOT NULL - ) - ``` -- Destination table: -- ```sql - CREATE TABLE [dbo].[Student]( - [Id] [varchar](32) NOT NULL, - [Name] [nvarchar](256) NOT NULL, - [AdfSliceIdentifier] [binary](32) NULL - ) - ``` --1. Use it in the copy activity as follows: - - ```json - "sink": - { - - "type": "SqlSink", - "sliceIdentifierColumnName": "AdfSliceIdentifier" - } - ``` --Azure Data Factory populates this column as per its need to ensure the source and destination stay synchronized. The values of this column should not be used outside of this context. --Similar to mechanism 1, Copy Activity automatically cleans up the data for the given slice from the destination SQL Table. It then inserts data from source in to the destination table. --## Next steps -Review the following connector articles that for complete JSON examples: --- [Azure SQL Database](data-factory-azure-sql-connector.md)-- [Azure Synapse Analytics](data-factory-azure-sql-data-warehouse-connector.md)-- [SQL Server](data-factory-sqlserver-connector.md) |
data-factory | Data Factory Salesforce Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-salesforce-connector.md | - Title: Move data from Salesforce by using Data Factory -description: Learn about how to move data from Salesforce by using Azure Data Factory. ---- Previously updated : 04/12/2023----# Move data from Salesforce by using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-salesforce-connector.md) -> * [Version 2 (current version)](../connector-salesforce.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Salesforce connector in V2](../connector-salesforce.md). --This article outlines how you can use Copy Activity in an Azure data factory to copy data from Salesforce to any data store that is listed under the Sink column in the [supported sources and sinks](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. This article builds on the [data movement activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with Copy Activity and supported data store combinations. --Azure Data Factory currently supports only moving data from Salesforce to [supported sink data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats), but does not support moving data from other data stores to Salesforce. --## Supported versions -This connector supports the following editions of Salesforce: Developer Edition, Professional Edition, Enterprise Edition, or Unlimited Edition. And it supports copying from Salesforce production, sandbox and custom domain. --## Prerequisites -* API permission must be enabled. -* To copy data from Salesforce to on-premises data stores, you must have at least Data Management Gateway 2.0 installed in your on-premises environment. --## Salesforce request limits -Salesforce has limits for both total API requests and concurrent API requests. Note the following points: --- If the number of concurrent requests exceeds the limit, throttling occurs and you will see random failures.-- If the total number of requests exceeds the limit, the Salesforce account will be blocked for 24 hours.--You might also receive the "REQUEST_LIMIT_EXCEEDED" error in both scenarios. See the "API Request Limits" section in the [Salesforce Developer Limits](https://resources.docs.salesforce.com/200/20/en-us/sfdc/pdf/salesforce_app_limits_cheatsheet.pdf) article for details. --## Getting started -You can create a pipeline with a copy activity that moves data from Salesforce by using different tools/APIs. --The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. --You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from Salesforce, see [JSON example: Copy data from Salesforce to Azure Blob](#json-example-copy-data-from-salesforce-to-azure-blob) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to Salesforce: --## Linked service properties -The following table provides descriptions for JSON elements that are specific to the Salesforce linked service. --| Property | Description | Required | -| | | | -| type |The type property must be set to: **Salesforce**. |Yes | -| environmentUrl | Specify the URL of Salesforce instance. <br><br> - Default is "https:\//login.salesforce.com". <br> - To copy data from sandbox, specify "https://test.salesforce.com". <br> - To copy data from custom domain, specify, for example, "https://[domain].my.salesforce.com". |No | -| username |Specify a user name for the user account. |Yes | -| password |Specify a password for the user account. |Yes | -| securityToken |Specify a security token for the user account. See [Get security token](https://help.salesforce.com/apex/HTViewHelpDoc?id=user_security_token.htm) for instructions on how to reset/get a security token. To learn about security tokens in general, see [Security and the API](https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/sforce_api_concepts_security.htm). |Yes | --## Dataset properties -For a full list of sections and properties that are available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, and so on). --The **typeProperties** section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for a dataset of the type **RelationalTable** has the following properties: --| Property | Description | Required | -| | | | -| tableName |Name of the table in Salesforce. |No (if a **query** of **RelationalSource** is specified) | --> [!IMPORTANT] -> The "__c" part of the API Name is needed for any custom object. ---## Copy activity properties -For a full list of sections and properties that are available for defining activities, see the [Creating pipelines](data-factory-create-pipelines.md) article. Properties like name, description, input and output tables, and various policies are available for all types of activities. --The properties that are available in the typeProperties section of the activity, on the other hand, vary with each activity type. For Copy Activity, they vary depending on the types of sources and sinks. --In copy activity, when the source is of the type **RelationalSource** (which includes Salesforce), the following properties are available in typeProperties section: --| Property | Description | Allowed values | Required | -| | | | | -| query |Use the custom query to read data. |A SQL-92 query or [Salesforce Object Query Language (SOQL)](https://developer.salesforce.com/docs/atlas.en-us.soql_sosl.meta/soql_sosl/sforce_api_calls_soql.htm) query. For example: `select * from MyTable__c`. |No (if the **tableName** of the **dataset** is specified) | --> [!IMPORTANT] -> The "__c" part of the API Name is needed for any custom object. ---## Query tips -### Retrieving data using where clause on DateTime column -When specify the SOQL or SQL query, pay attention to the DateTime format difference. For example: --* **SOQL sample**: `$$Text.Format('SELECT Id, Name, BillingCity FROM Account WHERE LastModifiedDate >= {0:yyyy-MM-ddTHH:mm:ssZ} AND LastModifiedDate < {1:yyyy-MM-ddTHH:mm:ssZ}', WindowStart, WindowEnd)` -* **SQL sample**: - * **Using copy wizard to specify the query:** `$$Text.Format('SELECT * FROM Account WHERE LastModifiedDate >= {{ts\'{0:yyyy-MM-dd HH:mm:ss}\'}} AND LastModifiedDate < {{ts\'{1:yyyy-MM-dd HH:mm:ss}\'}}', WindowStart, WindowEnd)` - * **Using JSON editing to specify the query (escape char properly):** `$$Text.Format('SELECT * FROM Account WHERE LastModifiedDate >= {{ts\\'{0:yyyy-MM-dd HH:mm:ss}\\'}} AND LastModifiedDate < {{ts\\'{1:yyyy-MM-dd HH:mm:ss}\\'}}', WindowStart, WindowEnd)` --### Retrieving data from Salesforce Report -You can retrieve data from Salesforce reports by specifying query as `{call "<report name>"}`,for example,. `"query": "{call \"TestReport\"}"`. --### Retrieving deleted records from Salesforce Recycle Bin -To query the soft deleted records from Salesforce Recycle Bin, you can specify **"IsDeleted = 1"** in your query. For example, --* To query only the deleted records, specify "select * from MyTable__c **where IsDeleted= 1**" -* To query all the records including the existing and the deleted, specify "select * from MyTable__c **where IsDeleted = 0 or IsDeleted = 1**" --## JSON example: Copy data from Salesforce to Azure Blob -The following example provides sample JSON definitions that you can use to create a pipeline by using the [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). They show how to copy data from Salesforce to Azure Blob Storage. However, data can be copied to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --Here are the Data Factory artifacts that you'll need to create to implement the scenario. The sections that follow the list provide details about these steps. --* A linked service of the type [Salesforce](#linked-service-properties) -* A linked service of the type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties) -* An input [dataset](data-factory-create-datasets.md) of the type [RelationalTable](#dataset-properties) -* An output [dataset](data-factory-create-datasets.md) of the type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties) -* A [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [RelationalSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties) --**Salesforce linked service** --This example uses the **Salesforce** linked service. See the [Salesforce linked service](#linked-service-properties) section for the properties that are supported by this linked service. See [Get security token](https://help.salesforce.com/apex/HTViewHelpDoc?id=user_security_token.htm) for instructions on how to reset/get the security token. --```json -{ - "name": "SalesforceLinkedService", - "properties": - { - "type": "Salesforce", - "typeProperties": - { - "username": "<user name>", - "password": "<password>", - "securityToken": "<security token>" - } - } -} -``` -**Azure Storage linked service** --```json -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -**Salesforce input dataset** --```json -{ - "name": "SalesforceInput", - "properties": { - "linkedServiceName": "SalesforceLinkedService", - "type": "RelationalTable", - "typeProperties": { - "tableName": "AllDataType__c" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` --Setting **external** to **true** informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. --> [!IMPORTANT] -> The "__c" part of the API Name is needed for any custom object. ---**Azure blob output dataset** --Data is written to a new blob every hour (frequency: hour, interval: 1). --```json -{ - "name": "AzureBlobOutput", - "properties": - { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": - { - "folderPath": "adfgetstarted/alltypes_c" - }, - "availability": - { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Pipeline with Copy Activity** --The pipeline contains Copy Activity, which is configured to use the input and output datasets, and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **RelationalSource**, and the **sink** type is set to **BlobSink**. --See [RelationalSource type properties](#copy-activity-properties) for the list of properties that are supported by the RelationalSource. --```json -{ - "name":"SamplePipeline", - "properties":{ - "start":"2016-06-01T18:00:00", - "end":"2016-06-01T19:00:00", - "description":"pipeline with copy activity", - "activities":[ - { - "name": "SalesforceToAzureBlob", - "description": "Copy from Salesforce to an Azure blob", - "type": "Copy", - "inputs": [ - { - "name": "SalesforceInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "SELECT Id, Col_AutoNumber__c, Col_Checkbox__c, Col_Currency__c, Col_Date__c, Col_DateTime__c, Col_Email__c, Col_Number__c, Col_Percent__c, Col_Phone__c, Col_Picklist__c, Col_Picklist_MultiSelect__c, Col_Text__c, Col_Text_Area__c, Col_Text_AreaLong__c, Col_Text_AreaRich__c, Col_URL__c, Col_Text_Encrypt__c, Col_Lookup__c FROM AllDataType__c" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` -> [!IMPORTANT] -> The "__c" part of the API Name is needed for any custom object. ----### Type mapping for Salesforce --| Salesforce type | .NET-based type | -| | | -| Auto Number |String | -| Checkbox |Boolean | -| Currency |Decimal | -| Date |DateTime | -| Date/Time |DateTime | -| Email |String | -| Id |String | -| Lookup Relationship |String | -| Multi-Select Picklist |String | -| Number |Decimal | -| Percent |Decimal | -| Phone |String | -| Picklist |String | -| Text |String | -| Text Area |String | -| Text Area (Long) |String | -| Text Area (Rich) |String | -| Text (Encrypted) |String | -| URL |String | --> [!NOTE] -> To map columns from source dataset to columns from sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). ---## Performance and tuning -See the [Copy Activity performance and tuning guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory Samples | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-samples.md | - Title: Azure Data Factory - Samples -description: Provides details about samples that ship with the Azure Data Factory service. ------- Previously updated : 04/12/2023---# Azure Data Factory - Samples -> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [PowerShell samples in Data Factory](../samples-powershell.md) and [code samples in the Azure Code Samples gallery](https://azure.microsoft.com/resources/samples/?service=data-factory). ---## Samples on GitHub -The [GitHub Azure-DataFactory repository](https://github.com/azure/azure-datafactory) contains several samples that help you quickly ramp up with Azure Data Factory service (or) modify the scripts and use it in own application. The Samples\JSON folder contains JSON snippets for common scenarios. --| Sample | Description | -|: |: | -| [ADF Walkthrough](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/ADFWalkthrough) |This sample provides an end-to-end walkthrough for processing log files using Azure Data Factory to turn data from log files in to insights. <br/><br/>In this walkthrough, the Data Factory pipeline collects sample logs, processes and enriches the data from logs with reference data, and transforms the data to evaluate the effectiveness of a marketing campaign that was recently launched. | -| [JSON samples](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/JSON) |This sample provides JSON examples for common scenarios. | -| [Http Data Downloader Sample](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/HttpDataDownloaderSample) |This sample showcases downloading of data from an HTTP endpoint to Azure Blob Storage using custom .NET activity. | -| [Cross AppDomain Dot Net Activity Sample](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/CrossAppDomainDotNetActivitySample) |This sample allows you to author a custom .NET activity that is not constrained to assembly versions used by the ADF launcher (For example, WindowsAzure.Storage v4.3.0, Newtonsoft.Json v6.0.x, etc.). | -| [Run R script](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/RunRScriptUsingADFSample) |This sample includes the Data Factory custom activity that can be used to invoke RScript.exe. This sample works only with your own (not on-demand) HDInsight cluster that already has R Installed on it. | -| [Invoke Spark jobs on HDInsight Hadoop cluster](../tutorial-transform-data-spark-portal.md) |This sample shows how to use MapReduce activity to invoke a Spark program. The spark program just copies data from one Azure Blob container to another. | -| [Twitter Analysis using ML Studio (classic) Batch Scoring Activity](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/TwitterAnalysisSample-AzureMLBatchScoringActivity) |This sample shows how to use AzureMLBatchScoringActivity to invoke an ML model that performs twitter sentiment analysis, scoring, prediction etc. | -| [Twitter Analysis using custom activity](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/TwitterAnalysisSample-CustomC%23Activity) |This sample shows how to use a custom .NET activity to invoke an ML Studio (classic) model that performs twitter sentiment analysis, scoring, prediction etc. | -| [Parameterized Pipelines for ML Studio (classic)](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/ParameterizedPipelinesForAzureML) |The sample provides an end-to-end C# code to deploy N pipelines for scoring and retraining each with a different region parameter where the list of regions is coming from a parameters.txt file, which is included with this sample. | -| [Reference Data Refresh for Azure Stream Analytics jobs](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/ReferenceDataRefreshForASAJobs) |This sample shows how to use Azure Data Factory and Azure Stream Analytics together to run the queries with reference data and set up the refresh for reference data on a schedule. | -| [Hybrid Pipeline with On-premises Hortonworks Hadoop](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/HybridPipelineWithOnPremisesHortonworksHadoop) |The sample uses an on-premises Hadoop cluster as a compute target for running jobs in Data Factory just like you would add other compute targets like an HDInsight based Hadoop cluster in cloud. | -| [JSON Conversion Tool](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/JSONConversionTool) |This tool allows you to convert JSONs from version prior to 2015-07-01-preview to latest or 2015-07-01-preview (default). | -| [U-SQL sample input file](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/U-SQL%20Sample%20Input%20File) |This file is a sample file used by an U-SQL activity. | -| [Delete blob file](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/DeleteBlobFileFolderCustomActivity) | This sample showcases a C# file which can be used as part of ADF custom .NET activity to delete files from the source Azure Blob location once the files have been copied.| --## Azure Resource Manager templates -You can find the following Azure Resource Manager templates for Data Factory on GitHub. --| Template | Description | -| | | -| [Copy from Azure Blob Storage to Azure SQL Database](https://github.com/Azure/azure-quickstart-templates/tree/master/quickstarts/microsoft.datafactory/data-factory-blob-to-sql-copy) |Deploying this template creates an Azure data factory with a pipeline that copies data from the specified Azure blob storage to Azure SQL Database | -| [Copy from Salesforce to Azure Blob Storage](https://github.com/Azure/azure-quickstart-templates/tree/master/quickstarts/microsoft.datafactory/data-factory-salesforce-to-blob-copy) |Deploying this template creates an Azure data factory with a pipeline that copies data from the specified Salesforce account to the Azure blob storage. | -| [Transform data by running Hive script on an Azure HDInsight cluster](https://github.com/Azure/azure-quickstart-templates/tree/master/quickstarts/microsoft.datafactory/data-factory-hive-transformation) |Deploying this template creates an Azure data factory with a pipeline that transforms data by running the sample Hive script on an Azure HDInsight Hadoop cluster. | --## Samples in Azure portal -You can use the **Sample pipelines** tile on the home page of your data factory to deploy sample pipelines and their associated entities (datasets and linked services) in to your data factory. --1. Create a data factory or open an existing data factory. See [Copy data from Blob Storage to SQL Database using Data Factory](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for steps to create a data factory. -2. In the **DATA FACTORY** blade for the data factory, click the **Sample pipelines** tile. -- :::image type="content" source="./media/data-factory-samples/SamplePipelinesTile.png" alt-text="Sample pipelines tile"::: -3. In the **Sample pipelines** blade, click the **sample** that you want to deploy. -- :::image type="content" source="./media/data-factory-samples/SampleTile.png" alt-text="Sample pipelines blade"::: -4. Specify configuration settings for the sample. For example, your Azure storage account name and account key, logical SQL server name, database, User ID, and password, etc. -- :::image type="content" source="./media/data-factory-samples/SampleBlade.png" alt-text="Sample blade"::: -5. After you are done with specifying the configuration settings, click **Create** to create/deploy the sample pipelines and linked services/tables used by the pipelines. -6. You see the status of deployment on the sample tile you clicked earlier on the **Sample pipelines** blade. -- :::image type="content" source="./media/data-factory-samples/DeploymentStatus.png" alt-text="Deployment status"::: -7. When you see the **Deployment succeeded** message on the tile for the sample, close the **Sample pipelines** blade. -8. On **DATA FACTORY** blade, you see that linked services, data sets, and pipelines are added to your data factory. -- :::image type="content" source="./media/data-factory-samples/DataFactoryBladeAfter.png" alt-text="Data Factory blade"::: --## Samples in Visual Studio -### Prerequisites -You must have the following installed on your computer: --* Visual Studio 2013 or Visual Studio 2015 -* Download Azure SDK for Visual Studio 2013 or Visual Studio 2015. Navigate to [Azure Download Page](https://azure.microsoft.com/downloads/) and click **VS 2013** or **VS 2015** in the **.NET** section. -* Download the latest Azure Data Factory plugin for Visual Studio: [VS 2013](https://visualstudiogallery.msdn.microsoft.com/754d998c-8f92-4aa7-835b-e89c8c954aa5) or [VS 2015](https://visualstudiogallery.msdn.microsoft.com/371a4cf9-0093-40fa-b7dd-be3c74f49005). If you are using Visual Studio 2013, you can also update the plugin by doing the following steps: On the menu, click **Tools** -> **Extensions and Updates** -> **Online** -> **Visual Studio Gallery** -> **Microsoft Azure Data Factory Tools for Visual Studio** -> **Update**. --### Use Data Factory Templates -1. Click **File** on the menu, point to **New**, and click **Project**. -2. In the **New Project** dialog box, do the following steps: -- 1. Select **DataFactory** under **Templates**. - 2. Select **Data Factory Templates** in the right pane. - 3. Enter a **name** for the project. - 4. Select a **location** for the project. - 5. Click **OK**. -- :::image type="content" source="./media/data-factory-samples/vs-new-project-adf-templates.png" alt-text="New project dialog box"::: -3. In the **Data Factory Templates** dialog box, select the sample template from the **Use-Case Templates** section, and click **Next**. The following steps walk you through using the **Customer Profiling** template. Steps are similar for the other samples. -- :::image type="content" source="./media/data-factory-samples/vs-data-factory-templates-dialog.png" alt-text="Data Factory Templates dialog box"::: -4. In the **Data Factory Configuration** dialog, click **Next** on the **Data Factory Basics** page. -5. On the **Configure data factory** page, do the following steps: - 1. Select **Create New Data Factory**. You can also select **Use existing data factory**. - 2. Enter a **name** for the data factory. - 3. Select the **Azure subscription** in which you want the data factory to be created. - 4. Select the **resource group** for the data factory. - 5. Select the **West US**, **East US**, or **North Europe** for the **region**. - 6. Click **Next**. -6. In the **Configure data stores** page, specify an existing **database in Azure SQL Database** and **Azure storage account** (or) create database/storage, and click Next. -7. In the **Configure compute** page, select defaults, and click **Next**. -8. In the **Summary** page, review all settings, and click **Next**. -9. In the **Deployment Status** page, wait until the deployment is finished, and click **Finish**. -10. Right-click project in the Solution Explorer, and click **Publish**. -11. If you see **Sign in to your Microsoft account** dialog box, enter your credentials for the account that has Azure subscription, and click **sign in**. -12. You should see the following dialog box: -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-vs/publish.png" alt-text="Publish dialog box"::: -13. In the **Configure data factory** page, do the following steps: -- 1. Confirm that **Use existing data factory** option. - 2. Select the **data factory** you had select when using the template. - 3. Click **Next** to switch to the **Publish Items** page. (Press **TAB** to move out of the Name field to if the **Next** button is disabled.) -14. In the **Publish Items** page, ensure that all the Data Factories entities are selected, and click **Next** to switch to the **Summary** page. -15. Review the summary and click **Next** to start the deployment process and view the **Deployment Status**. -16. In the **Deployment Status** page, you should see the status of the deployment process. Click Finish after the deployment is done. --See [Build your first data factory (Visual Studio)](data-factory-build-your-first-pipeline-using-vs.md) for details about using Visual Studio to author Data Factory entities and publishing them to Azure. |
data-factory | Data Factory Sap Business Warehouse Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-sap-business-warehouse-connector.md | - Title: Move data from SAP Business Warehouse using Azure Data Factory -description: Learn about how to move data from SAP Business Warehouse using Azure Data Factory. ----- Previously updated : 04/12/2023---# Move data From SAP Business Warehouse using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-sap-business-warehouse-connector.md) -> * [Version 2 (current version)](../connector-sap-business-warehouse.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [SAP Business Warehouse connector in V2](../connector-sap-business-warehouse.md). ---This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises SAP Business Warehouse (BW). It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --You can copy data from an on-premises SAP Business Warehouse data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the [Supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. Data factory currently supports only moving data from an SAP Business Warehouse to other data stores, but not for moving data from other data stores to an SAP Business Warehouse. --## Supported versions and installation -This connector supports SAP Business Warehouse version 7.x. It supports copying data from InfoCubes and QueryCubes (including BEx queries) using MDX queries. --To enable the connectivity to the SAP BW instance, install the following components: -- **Data Management Gateway**: Data Factory service supports connecting to on-premises data stores (including SAP Business Warehouse) using a component called Data Management Gateway. To learn about Data Management Gateway and step-by-step instructions for setting up the gateway, see [Moving data between on-premises data store to cloud data store](data-factory-move-data-between-onprem-and-cloud.md) article. Gateway is required even if the SAP Business Warehouse is hosted in an Azure IaaS virtual machine (VM). You can install the gateway on the same VM as the data store or on a different VM as long as the gateway can connect to the database.-- **SAP NetWeaver library** on the gateway machine. You can get the SAP Netweaver library from your SAP administrator, or directly from the [SAP Software Download Center](https://support.sap.com/swdc). Search for the **SAP Note #1025361** to get the download location for the most recent version. Make sure that the architecture for the SAP NetWeaver library (32-bit or 64-bit) matches your gateway installation. Then install all files included in the SAP NetWeaver RFC SDK according to the SAP Note. The SAP NetWeaver library is also included in the SAP Client Tools installation.--> [!TIP] -> Put the dlls extracted from the NetWeaver RFC SDK into system32 folder. --## Getting started -You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by using different tools/APIs. --- The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. -- You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises SAP Business Warehouse, see [JSON example: Copy data from SAP Business Warehouse to Azure Blob](#json-example-copy-data-from-sap-business-warehouse-to-azure-blob) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to an SAP BW data store: --## Linked service properties -The following table provides description for JSON elements specific to SAP Business Warehouse (BW) linked service. --Property | Description | Allowed values | Required | -- | -- | ---server | Name of the server on which the SAP BW instance resides. | string | Yes -systemNumber | System number of the SAP BW system. | Two-digit decimal number represented as a string. | Yes -clientId | Client ID of the client in the SAP W system. | Three-digit decimal number represented as a string. | Yes -username | Name of the user who has access to the SAP server | string | Yes -password | Password for the user. | string | Yes -gatewayName | Name of the gateway that the Data Factory service should use to connect to the on-premises SAP BW instance. | string | Yes -encryptedCredential | The encrypted credential string. | string | No --## Dataset properties -For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --The **typeProperties** section is different for each type of dataset and provides information about the location of the data in the data store. There are no type-specific properties supported for the SAP BW dataset of type **RelationalTable**. ---## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, are policies are available for all types of activities. --Whereas, properties available in the **typeProperties** section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --When source in copy activity is of type **RelationalSource** (which includes SAP BW), the following properties are available in typeProperties section: --| Property | Description | Allowed values | Required | -| | | | | -| query | Specifies the MDX query to read data from the SAP BW instance. | MDX query. | Yes | ---## JSON example: Copy data from SAP Business Warehouse to Azure Blob -The following example provides sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). This sample shows how to copy data from an on-premises SAP Business Warehouse to an Azure Blob Storage. However, data can be copied **directly** to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --> [!IMPORTANT] -> This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article for step-by-step instructions. --The sample has the following data factory entities: --1. A linked service of type [SapBw](#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [RelationalTable](#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -5. A [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [RelationalSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data from an SAP Business Warehouse instance to an Azure blob hourly. The JSON properties used in these samples are described in sections following the samples. --As a first step, setup the data management gateway. The instructions are in the [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article. --### SAP Business Warehouse linked service -This linked service links your SAP BW instance to the data factory. The type property is set to **SapBw**. The typeProperties section provides connection information for the SAP BW instance. --```json -{ - "name": "SapBwLinkedService", - "properties": - { - "type": "SapBw", - "typeProperties": - { - "server": "<server name>", - "systemNumber": "<system number>", - "clientId": "<client id>", - "username": "<SAP user>", - "password": "<Password for SAP user>", - "gatewayName": "<gateway name>" - } - } -} -``` --### Azure Storage linked service -This linked service links your Azure Storage account to the data factory. The type property is set to **AzureStorage**. The typeProperties section provides connection information for the Azure Storage account. --```json -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --### SAP BW input dataset -This dataset defines the SAP Business Warehouse dataset. You set the type of the Data Factory dataset to **RelationalTable**. Currently, you do not specify any type-specific properties for an SAP BW dataset. The query in the Copy Activity definition specifies what data to read from the SAP BW instance. --Setting external property to true informs the Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. --Frequency and interval properties defines the schedule. In this case, the data is read from the SAP BW instance hourly. --```json -{ - "name": "SapBwDataset", - "properties": { - "type": "RelationalTable", - "linkedServiceName": "SapBwLinkedService", - "typeProperties": {}, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true - } -} -``` ----### Azure Blob output dataset -This dataset defines the output Azure Blob dataset. The type property is set to AzureBlob. The typeProperties section provides where the data copied from the SAP BW instance is stored. The data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```json -{ - "name": "AzureBlobDataSet", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/sapbw/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` ---### Pipeline with Copy activity -The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **RelationalSource** (for SAP BW source) and **sink** type is set to **BlobSink**. The query specified for the **query** property selects the data in the past hour to copy. --```json -{ - "name": "CopySapBwToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "<MDX query for SAP BW>" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ - { - "name": "SapBwDataset" - } - ], - "outputs": [ - { - "name": "AzureBlobDataSet" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "SapBwToBlob" - } - ], - "start": "2017-03-01T18:00:00Z", - "end": "2017-03-01T19:00:00Z" - } -} -``` --### Type mapping for SAP BW -As mentioned in the [data movement activities](data-factory-data-movement-activities.md) article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach: --1. Convert from native source types to .NET type -2. Convert from .NET type to native sink type --When moving data from SAP BW, the following mappings are used from SAP BW types to .NET types. --Data type in the ABAP Dictionary | .NET Data Type | ---ACCP | Int -CHAR | String -CLNT | String -CURR | Decimal -CUKY | String -DEC | Decimal -FLTP | Double -INT1 | Byte -INT2 | Int16 -INT4 | Int -LANG | String -LCHR | String -LRAW | Byte[] -PREC | Int16 -QUAN | Decimal -RAW | Byte[] -RAWSTRING | Byte[] -STRING | String -UNIT | String -DATS | String -NUMC | String -TIMS | String --> [!NOTE] -> To map columns from source dataset to columns from sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). ---## Map source to sink columns -To learn about mapping columns in source dataset to columns in sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Repeatable read from relational sources -When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See [Repeatable read from relational sources](data-factory-repeatable-copy.md#repeatable-read-from-relational-sources) --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory Sap Hana Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-sap-hana-connector.md | - Title: Move data from SAP HANA using Azure Data Factory -description: Learn about how to move data from SAP HANA using Azure Data Factory. ----- Previously updated : 04/12/2023---# Move data From SAP HANA using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-sap-hana-connector.md) -> * [Version 2 (current version)](../connector-sap-hana.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [SAP HANA connector in V2](../connector-sap-business-warehouse.md). --This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises SAP HANA. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. --You can copy data from an on-premises SAP HANA data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the [Supported data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats) table. Data factory currently supports only moving data from an SAP HANA to other data stores, but not for moving data from other data stores to an SAP HANA. --## Supported versions and installation -This connector supports any version of SAP HANA database. It supports copying data from HANA information models (such as Analytic and Calculation views) and Row/Column tables using SQL queries. --To enable the connectivity to the SAP HANA instance, install the following components: -- **Data Management Gateway**: Data Factory service supports connecting to on-premises data stores (including SAP HANA) using a component called Data Management Gateway. To learn about Data Management Gateway and step-by-step instructions for setting up the gateway, see [Moving data between on-premises data store to cloud data store](data-factory-move-data-between-onprem-and-cloud.md) article. Gateway is required even if the SAP HANA is hosted in an Azure IaaS virtual machine (VM). You can install the gateway on the same VM as the data store or on a different VM as long as the gateway can connect to the database.-- **SAP HANA ODBC driver** on the gateway machine. You can download the SAP HANA ODBC driver from the [SAP Software Download Center](https://support.sap.com/swdc). Search with the keyword **SAP HANA CLIENT for Windows**. --## Getting started -You can create a pipeline with a copy activity that moves data from an on-premises SAP HANA data store by using different tools/APIs. --- The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. -- You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises SAP HANA, see [JSON example: Copy data from SAP HANA to Azure Blob](#json-example-copy-data-from-sap-hana-to-azure-blob) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to an SAP HANA data store: --## Linked service properties -The following table provides description for JSON elements specific to SAP HANA linked service. --Property | Description | Allowed values | Required | -- | -- | ---server | Name of the server on which the SAP HANA instance resides. If your server is using a customized port, specify `server:port`. | string | Yes -authenticationType | Type of authentication. | string. "Basic" or "Windows" | Yes -username | Name of the user who has access to the SAP server | string | Yes -password | Password for the user. | string | Yes -gatewayName | Name of the gateway that the Data Factory service should use to connect to the on-premises SAP HANA instance. | string | Yes -encryptedCredential | The encrypted credential string. | string | No --## Dataset properties -For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --The **typeProperties** section is different for each type of dataset and provides information about the location of the data in the data store. There are no type-specific properties supported for the SAP HANA dataset of type **RelationalTable**. ---## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, are policies are available for all types of activities. --Whereas, properties available in the **typeProperties** section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --When source in copy activity is of type **RelationalSource** (which includes SAP HANA), the following properties are available in typeProperties section: --| Property | Description | Allowed values | Required | -| | | | | -| query | Specifies the SQL query to read data from the SAP HANA instance. | SQL query. | Yes | --## JSON example: Copy data from SAP HANA to Azure Blob -The following sample provides sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). This sample shows how to copy data from an on-premises SAP HANA to an Azure Blob Storage. However, data can be copied **directly** to any of the sinks listed [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --> [!IMPORTANT] -> This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article for step-by-step instructions. --The sample has the following data factory entities: --1. A linked service of type [SapHana](#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [RelationalTable](#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -5. A [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [RelationalSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data from an SAP HANA instance to an Azure blob hourly. The JSON properties used in these samples are described in sections following the samples. --As a first step, setup the data management gateway. The instructions are in the [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article. --### SAP HANA linked service -This linked service links your SAP HANA instance to the data factory. The type property is set to **SapHana**. The typeProperties section provides connection information for the SAP HANA instance. --```json -{ - "name": "SapHanaLinkedService", - "properties": - { - "type": "SapHana", - "typeProperties": - { - "server": "<server name>", - "authenticationType": "<Basic, or Windows>", - "username": "<SAP user>", - "password": "<Password for SAP user>", - "gatewayName": "<gateway name>" - } - } -} --``` --### Azure Storage linked service -This linked service links your Azure Storage account to the data factory. The type property is set to **AzureStorage**. The typeProperties section provides connection information for the Azure Storage account. --```json -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` --### SAP HANA input dataset --This dataset defines the SAP HANA dataset. You set the type of the Data Factory dataset to **RelationalTable**. Currently, you do not specify any type-specific properties for an SAP HANA dataset. The query in the Copy Activity definition specifies what data to read from the SAP HANA instance. --Setting external property to true informs the Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory. --Frequency and interval properties defines the schedule. In this case, the data is read from the SAP HANA instance hourly. --```json -{ - "name": "SapHanaDataset", - "properties": { - "type": "RelationalTable", - "linkedServiceName": "SapHanaLinkedService", - "typeProperties": {}, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true - } -} -``` --### Azure Blob output dataset -This dataset defines the output Azure Blob dataset. The type property is set to AzureBlob. The typeProperties section provides where the data copied from the SAP HANA instance is stored. The data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```json -{ - "name": "AzureBlobDataSet", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/saphana/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` ---### Pipeline with Copy activity --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **RelationalSource** (for SAP HANA source) and **sink** type is set to **BlobSink**. The SQL query specified for the **query** property selects the data in the past hour to copy. --```json -{ - "name": "CopySapHanaToBlob", - "properties": { - "description": "pipeline for copy activity", - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "RelationalSource", - "query": "<SQL Query for HANA>" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ - { - "name": "SapHanaDataset" - } - ], - "outputs": [ - { - "name": "AzureBlobDataSet" - } - ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1 - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "SapHanaToBlob" - } - ], - "start": "2017-03-01T18:00:00Z", - "end": "2017-03-01T19:00:00Z" - } -} -``` ---### Type mapping for SAP HANA -As mentioned in the [data movement activities](data-factory-data-movement-activities.md) article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach: --1. Convert from native source types to .NET type -2. Convert from .NET type to native sink type --When moving data from SAP HANA, the following mappings are used from SAP HANA types to .NET types. --SAP HANA Type | .NET Based Type -- | -TINYINT | Byte -SMALLINT | Int16 -INT | Int32 -BIGINT | Int64 -REAL | Single -DOUBLE | Single -DECIMAL | Decimal -BOOLEAN | Byte -VARCHAR | String -NVARCHAR | String -CLOB | Byte[] -ALPHANUM | String -BLOB | Byte[] -DATE | DateTime -TIME | TimeSpan -TIMESTAMP | DateTime -SECONDDATE | DateTime --## Known limitations -There are a few known limitations when copying data from SAP HANA: --- NVARCHAR strings are truncated to maximum length of 4000 Unicode characters-- SMALLDECIMAL is not supported-- VARBINARY is not supported-- Valid Dates are between 1899/12/30 and 9999/12/31--## Map source to sink columns -To learn about mapping columns in source dataset to columns in sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Repeatable read from relational sources -When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See [Repeatable read from relational sources](data-factory-repeatable-copy.md#repeatable-read-from-relational-sources) --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory Scheduling And Execution | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-scheduling-and-execution.md | - Title: Scheduling and Execution with Data Factory -description: Learn scheduling and execution aspects of Azure Data Factory application model. ------ Previously updated : 04/12/2023---# Data Factory scheduling and execution -> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [pipeline execution and triggers](../concepts-pipeline-execution-triggers.md) article. --This article explains the scheduling and execution aspects of the Azure Data Factory application model. This article assumes that you understand basics of Data Factory application model concepts, including activity, pipelines, linked services, and datasets. For basic concepts of Azure Data Factory, see the following articles: --* [Introduction to Data Factory](data-factory-introduction.md) -* [Pipelines](data-factory-create-pipelines.md) -* [Datasets](data-factory-create-datasets.md) --## Start and end times of pipeline -A pipeline is active only between its **start** time and **end** time. It is not executed before the start time or after the end time. If the pipeline is paused, it is not executed irrespective of its start and end time. For a pipeline to run, it should not be paused. You find these settings (start, end, paused) in the pipeline definition: --```json -"start": "2017-04-01T08:00:00Z", -"end": "2017-04-01T11:00:00Z" -"isPaused": false -``` --For more information these properties, see [create pipelines](data-factory-create-pipelines.md) article. ---## Specify schedule for an activity -It is not the pipeline that is executed. It is the activities in the pipeline that are executed in the overall context of the pipeline. You can specify a recurring schedule for an activity by using the **scheduler** section of activity JSON. For example, you can schedule an activity to run hourly as follows: --```json -"scheduler": { - "frequency": "Hour", - "interval": 1 -}, -``` --As shown in the following diagram, specifying a schedule for an activity creates a series of tumbling windows with in the pipeline start and end times. Tumbling windows are a series of fixed-size non-overlapping, contiguous time intervals. These logical tumbling windows for an activity are called **activity windows**. ---The **scheduler** property for an activity is optional. If you do specify this property, it must match the cadence you specify in the definition of output dataset for the activity. Currently, output dataset is what drives the schedule. Therefore, you must create an output dataset even if the activity does not produce any output. --## Specify schedule for a dataset -An activity in a Data Factory pipeline can take zero or more input **datasets** and produce one or more output datasets. For an activity, you can specify the cadence at which the input data is available or the output data is produced by using the **availability** section in the dataset definitions. --**Frequency** in the **availability** section specifies the time unit. The allowed values for frequency are: Minute, Hour, Day, Week, and Month. The **interval** property in the availability section specifies a multiplier for frequency. For example: if the frequency is set to Day and interval is set to 1 for an output dataset, the output data is produced daily. If you specify the frequency as minute, we recommend that you set the interval to no less than 15. --In the following example, the input data is available hourly and the output data is produced hourly (`"frequency": "Hour", "interval": 1`). --**Input dataset:** --```json -{ - "name": "AzureSqlInput", - "properties": { - "published": false, - "type": "AzureSqlTable", - "linkedServiceName": "AzureSqlLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": {} - } -} -``` ---**Output dataset** --```json -{ - "name": "AzureBlobOutput", - "properties": { - "published": false, - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mypath/{Year}/{Month}/{Day}/{Hour}", - "format": { - "type": "TextFormat" - }, - "partitionedBy": [ - { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, - { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, - { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, - { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "HH" }} - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --Currently, **output dataset drives the schedule**. In other words, the schedule specified for the output dataset is used to run an activity at runtime. Therefore, you must create an output dataset even if the activity does not produce any output. If the activity doesn't take any input, you can skip creating the input dataset. --In the following pipeline definition, the **scheduler** property is used to specify schedule for the activity. This property is optional. Currently, the schedule for the activity must match the schedule specified for the output dataset. - -```json -{ - "name": "SamplePipeline", - "properties": { - "description": "copy activity", - "activities": [ - { - "type": "Copy", - "name": "AzureSQLtoBlob", - "description": "copy activity", - "typeProperties": { - "source": { - "type": "SqlSource", - "sqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 100000, - "writeBatchTimeout": "00:05:00" - } - }, - "inputs": [ - { - "name": "AzureSQLInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "scheduler": { - "frequency": "Hour", - "interval": 1 - } - } - ], - "start": "2017-04-01T08:00:00Z", - "end": "2017-04-01T11:00:00Z" - } -} -``` --In this example, the activity runs hourly between the start and end times of the pipeline. The output data is produced hourly for three-hour windows (8 AM - 9 AM, 9 AM - 10 AM, and 10 AM - 11 AM). --Each unit of data consumed or produced by an activity run is called a **data slice**. The following diagram shows an example of an activity with one input dataset and one output dataset: ---The diagram shows the hourly data slices for the input and output dataset. The diagram shows three input slices that are ready for processing. The 10-11 AM activity is in progress, producing the 10-11 AM output slice. --You can access the time interval associated with the current slice in the dataset JSON by using variables: [SliceStart](data-factory-functions-variables.md#data-factory-system-variables) and [SliceEnd](data-factory-functions-variables.md#data-factory-system-variables). Similarly, you can access the time interval associated with an activity window by using the WindowStart and WindowEnd. The schedule of an activity must match the schedule of the output dataset for the activity. Therefore, the SliceStart and SliceEnd values are the same as WindowStart and WindowEnd values respectively. For more information on these variables, see [Data Factory functions and system variables](data-factory-functions-variables.md#data-factory-system-variables) articles. --You can use these variables for different purposes in your activity JSON. For example, you can use them to select data from input and output datasets representing time series data (for example: 8 AM to 9 AM). This example also uses **WindowStart** and **WindowEnd** to select relevant data for an activity run and copy it to a blob with the appropriate **folderPath**. The **folderPath** is parameterized to have a separate folder for every hour. --In the preceding example, the schedule specified for input and output datasets is the same (hourly). If the input dataset for the activity is available at a different frequency, say every 15 minutes, the activity that produces this output dataset still runs once an hour as the output dataset is what drives the activity schedule. For more information, see [Model datasets with different frequencies](#model-datasets-with-different-frequencies). --## Dataset availability and policies -You have seen the usage of frequency and interval properties in the availability section of dataset definition. There are a few other properties that affect the scheduling and execution of an activity. --### Dataset availability -The following table describes properties you can use in the **availability** section: --| Property | Description | Required | Default | -| | | | | -| frequency |Specifies the time unit for dataset slice production.<br/><br/><b>Supported frequency</b>: Minute, Hour, Day, Week, Month |Yes |NA | -| interval |Specifies a multiplier for frequency<br/><br/>ΓÇ¥Frequency x intervalΓÇ¥ determines how often the slice is produced.<br/><br/>If you need the dataset to be sliced on an hourly basis, you set <b>Frequency</b> to <b>Hour</b>, and <b>interval</b> to <b>1</b>.<br/><br/><b>Note</b>: If you specify Frequency as Minute, we recommend that you set the interval to no less than 15 |Yes |NA | -| style |Specifies whether the slice should be produced at the start/end of the interval.<ul><li>StartOfInterval</li><li>EndOfInterval</li></ul><br/><br/>If Frequency is set to Month and style is set to EndOfInterval, the slice is produced on the last day of month. If the style is set to StartOfInterval, the slice is produced on the first day of month.<br/><br/>If Frequency is set to Day and style is set to EndOfInterval, the slice is produced in the last hour of the day.<br/><br/>If Frequency is set to Hour and style is set to EndOfInterval, the slice is produced at the end of the hour. For example, for a slice for 1 PM - 2 PM period, the slice is produced at 2 PM. |No |EndOfInterval | -| anchorDateTime |Defines the absolute position in time used by scheduler to compute dataset slice boundaries. <br/><br/><b>Note</b>: If the AnchorDateTime has date parts that are more granular than the frequency then the more granular parts are ignored. <br/><br/>For example, if the <b>interval</b> is <b>hourly</b> (frequency: hour and interval: 1) and the <b>AnchorDateTime</b> contains <b>minutes and seconds</b>, then the <b>minutes and seconds</b> parts of the AnchorDateTime are ignored. |No |01/01/0001 | -| offset |Timespan by which the start and end of all dataset slices are shifted. <br/><br/><b>Note</b>: If both anchorDateTime and offset are specified, the result is the combined shift. |No |NA | --### offset example -By default, daily (`"frequency": "Day", "interval": 1`) slices start at 12 AM UTC time (midnight). If you want the start time to be 6 AM UTC time instead, set the offset as shown in the following snippet: --```json -"availability": -{ - "frequency": "Day", - "interval": 1, - "offset": "06:00:00" -} -``` -### anchorDateTime example -In the following example, the dataset is produced once every 23 hours. The first slice starts at the time specified by the anchorDateTime, which is set to `2017-04-19T08:00:00` (UTC time). --```json -"availability": -{ - "frequency": "Hour", - "interval": 23, - "anchorDateTime":"2017-04-19T08:00:00" -} -``` --### offset/style Example -The following dataset is a monthly dataset and is produced on 3rd of every month at 8:00 AM (`3.08:00:00`): --```json -"availability": { - "frequency": "Month", - "interval": 1, - "offset": "3.08:00:00", - "style": "StartOfInterval" -} -``` --### Dataset policy -A dataset can have a validation policy defined that specifies how the data generated by a slice execution can be validated before it is ready for consumption. In such cases, after the slice has finished execution, the output slice status is changed to **Waiting** with a substatus of **Validation**. After the slices are validated, the slice status changes to **Ready**. If a data slice has been produced but did not pass the validation, activity runs for downstream slices that depend on this slice are not processed. [Monitor and manage pipelines](data-factory-monitor-manage-pipelines.md) covers the various states of data slices in Data Factory. --The **policy** section in dataset definition defines the criteria or the condition that the dataset slices must fulfill. The following table describes properties you can use in the **policy** section: --| Policy Name | Description | Applied To | Required | Default | -| | | | | | -| minimumSizeMB | Validates that the data in an **Azure blob** meets the minimum size requirements (in megabytes). |Azure Blob |No |NA | -| minimumRows | Validates that the data in **Azure SQL Database** or **Azure table** contains the minimum number of rows. |<ul><li>Azure SQL Database</li><li>Azure Table</li></ul> |No |NA | --#### Examples -**minimumSizeMB:** --```json -"policy": --{ - "validation": - { - "minimumSizeMB": 10.0 - } -} -``` --**minimumRows** --```json -"policy": -{ - "validation": - { - "minimumRows": 100 - } -} -``` --For more information about these properties and examples, see [Create datasets](data-factory-create-datasets.md) article. --## Activity policies -Policies affect the run-time behavior of an activity, specifically when the slice of a table is processed. The following table provides the details. --| Property | Permitted values | Default Value | Description | -| | | | | -| concurrency |Integer <br/><br/>Max value: 10 |1 |Number of concurrent executions of the activity.<br/><br/>It determines the number of parallel activity executions that can happen on different slices. For example, if an activity needs to go through a large set of available data, having a larger concurrency value speeds up the data processing. | -| executionPriorityOrder |NewestFirst<br/><br/>OldestFirst |OldestFirst |Determines the ordering of data slices that are being processed.<br/><br/>For example, if you have 2 slices (one happening at 4pm, and another one at 5pm), and both are pending execution. If you set the executionPriorityOrder to be NewestFirst, the slice at 5 PM is processed first. Similarly if you set the executionPriorityORder to be OldestFIrst, then the slice at 4 PM is processed. | -| retry |Integer<br/><br/>Max value can be 10 |0 |Number of retries before the data processing for the slice is marked as Failure. Activity execution for a data slice is retried up to the specified retry count. The retry is done as soon as possible after the failure. | -| timeout |TimeSpan |00:00:00 |Timeout for the activity. Example: 00:10:00 (implies timeout 10 mins)<br/><br/>If a value is not specified or is 0, the timeout is infinite.<br/><br/>If the data processing time on a slice exceeds the timeout value, it is canceled, and the system attempts to retry the processing. The number of retries depends on the retry property. When timeout occurs, the status is set to TimedOut. | -| delay |TimeSpan |00:00:00 |Specify the delay before data processing of the slice starts.<br/><br/>The execution of activity for a data slice is started after the Delay is past the expected execution time.<br/><br/>Example: 00:10:00 (implies delay of 10 mins) | -| longRetry |Integer<br/><br/>Max value: 10 |1 |The number of long retry attempts before the slice execution is failed.<br/><br/>longRetry attempts are spaced by longRetryInterval. So if you need to specify a time between retry attempts, use longRetry. If both Retry and longRetry are specified, each longRetry attempt includes Retry attempts and the max number of attempts is Retry * longRetry.<br/><br/>For example, if we have the following settings in the activity policy:<br/>Retry: 3<br/>longRetry: 2<br/>longRetryInterval: 01:00:00<br/><br/>Assume there is only one slice to execute (status is Waiting) and the activity execution fails every time. Initially there would be 3 consecutive execution attempts. After each attempt, the slice status would be Retry. After first 3 attempts are over, the slice status would be LongRetry.<br/><br/>After an hour (that is, longRetryIntevalΓÇÖs value), there would be another set of 3 consecutive execution attempts. After that, the slice status would be Failed and no more retries would be attempted. Hence overall 6 attempts were made.<br/><br/>If any execution succeeds, the slice status would be Ready and no more retries are attempted.<br/><br/>longRetry may be used in situations where dependent data arrives at non-deterministic times or the overall environment is flaky under which data processing occurs. In such cases, doing retries one after another may not help and doing so after an interval of time results in the desired output.<br/><br/>Word of caution: do not set high values for longRetry or longRetryInterval. Typically, higher values imply other systemic issues. | -| longRetryInterval |TimeSpan |00:00:00 |The delay between long retry attempts | --For more information, see [Pipelines](data-factory-create-pipelines.md) article. --## Parallel processing of data slices -You can set the start date for the pipeline in the past. When you do so, Data Factory automatically calculates (back fills) all data slices in the past and begins processing them. For example: if you create a pipeline with start date 2017-04-01 and the current date is 2017-04-10. If the cadence of the output dataset is daily, then Data Factory starts processing all the slices from 2017-04-01 to 2017-04-09 immediately because the start date is in the past. The slice from 2017-04-10 is not processed yet because the value of style property in the availability section is EndOfInterval by default. The oldest slice is processed first as the default value of executionPriorityOrder is OldestFirst. For a description of the style property, see [dataset availability](#dataset-availability) section. For a description of the executionPriorityOrder section, see the [activity policies](#activity-policies) section. --You can configure back-filled data slices to be processed in parallel by setting the **concurrency** property in the **policy** section of the activity JSON. This property determines the number of parallel activity executions that can happen on different slices. The default value for the concurrency property is 1. Therefore, one slice is processed at a time by default. The maximum value is 10. When a pipeline needs to go through a large set of available data, having a larger concurrency value speeds up the data processing. --## Rerun a failed data slice -When an error occurs while processing a data slice, you can find out why the processing of a slice failed by using Azure portal blades or Monitor and Manage App. See [Monitoring and managing pipelines using Azure portal blades](data-factory-monitor-manage-pipelines.md) or [Monitoring and Management app](data-factory-monitor-manage-app.md) for details. --Consider the following example, which shows two activities. Activity1 and Activity 2. Activity1 consumes a slice of Dataset1 and produces a slice of Dataset2, which is consumed as an input by Activity2 to produce a slice of the Final Dataset. ---The diagram shows that out of three recent slices, there was a failure producing the 9-10 AM slice for Dataset2. Data Factory automatically tracks dependency for the time series dataset. As a result, it does not start the activity run for the 9-10 AM downstream slice. --Data Factory monitoring and management tools allow you to drill into the diagnostic logs for the failed slice to easily find the root cause for the issue and fix it. After you have fixed the issue, you can easily start the activity run to produce the failed slice. For more information on how to rerun and understand state transitions for data slices, see [Monitoring and managing pipelines using Azure portal blades](data-factory-monitor-manage-pipelines.md) or [Monitoring and Management app](data-factory-monitor-manage-app.md). --After you rerun the 9-10 AM slice for **Dataset2**, Data Factory starts the run for the 9-10 AM dependent slice on the final dataset. ---## Multiple activities in a pipeline -You can have more than one activity in a pipeline. If you have multiple activities in a pipeline and the output of an activity is not an input of another activity, the activities may run in parallel if input data slices for the activities are ready. --You can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. The activities can be in the same pipeline or in different pipelines. The second activity executes only when the first one finishes successfully. --For example, consider the following case where a pipeline has two activities: --1. Activity A1 that requires external input dataset D1, and produces output dataset D2. -2. Activity A2 that requires input from dataset D2, and produces output dataset D3. --In this scenario, activities A1 and A2 are in the same pipeline. The activity A1 runs when the external data is available and the scheduled availability frequency is reached. The activity A2 runs when the scheduled slices from D2 become available and the scheduled availability frequency is reached. If there is an error in one of the slices in dataset D2, A2 does not run for that slice until it becomes available. --The Diagram view with both activities in the same pipeline would look like the following diagram: ---As mentioned earlier, the activities could be in different pipelines. In such a scenario, the diagram view would look like the following diagram: ---See the copy sequentially section in the appendix for an example. --## Model datasets with different frequencies -In the samples, the frequencies for input and output datasets and the activity schedule window were the same. Some scenarios require the ability to produce output at a frequency different than the frequencies of one or more inputs. Data Factory supports modeling these scenarios. --### Sample 1: Produce a daily output report for input data that is available every hour -Consider a scenario in which you have input measurement data from sensors available every hour in Azure Blob storage. You want to produce a daily aggregate report with statistics such as mean, maximum, and minimum for the day with [Data Factory hive activity](data-factory-hive-activity.md). --Here is how you can model this scenario with Data Factory: --**Input dataset** --The hourly input files are dropped in the folder for the given day. Availability for input is set at **Hour** (frequency: Hour, interval: 1). --```json -{ - "name": "AzureBlobInput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/", - "partitionedBy": [ - { "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}}, - { "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}}, - { "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}} - ], - "format": { - "type": "TextFormat" - } - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` -**Output dataset** --One output file is created every day in the day's folder. Availability of output is set at **Day** (frequency: Day and interval: 1). --```json -{ - "name": "AzureBlobOutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/", - "partitionedBy": [ - { "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}}, - { "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}}, - { "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}} - ], - "format": { - "type": "TextFormat" - } - }, - "availability": { - "frequency": "Day", - "interval": 1 - } - } -} -``` --**Activity: hive activity in a pipeline** --The hive script receives the appropriate *DateTime* information as parameters that use the **WindowStart** variable as shown in the following snippet. The hive script uses this variable to load the data from the correct folder for the day and run the aggregation to generate the output. --```json -{ - "name":"SamplePipeline", - "properties":{ - "start":"2015-01-01T08:00:00", - "end":"2015-01-01T11:00:00", - "description":"hive activity", - "activities": [ - { - "name": "SampleHiveActivity", - "inputs": [ - { - "name": "AzureBlobInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "linkedServiceName": "HDInsightLinkedService", - "type": "HDInsightHive", - "typeProperties": { - "scriptPath": "adftutorial\\hivequery.hql", - "scriptLinkedService": "StorageLinkedService", - "defines": { - "Year": "$$Text.Format('{0:yyyy}',WindowStart)", - "Month": "$$Text.Format('{0:MM}',WindowStart)", - "Day": "$$Text.Format('{0:dd}',WindowStart)" - } - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 2, - "timeout": "01:00:00" - } - } - ] - } -} -``` --The following diagram shows the scenario from a data-dependency point of view. ---The output slice for every day depends on 24 hourly slices from an input dataset. Data Factory computes these dependencies automatically by figuring out the input data slices that fall in the same time period as the output slice to be produced. If any of the 24 input slices is not available, Data Factory waits for the input slice to be ready before starting the daily activity run. --### Sample 2: Specify dependency with expressions and Data Factory functions -LetΓÇÖs consider another scenario. Suppose you have a hive activity that processes two input datasets. One of them has new data daily, but one of them gets new data every week. Suppose you wanted to do a join across the two inputs and produce an output every day. --The simple approach in which Data Factory automatically figures out the right input slices to process by aligning to the output data sliceΓÇÖs time period does not work. --You must specify that for every activity run, the Data Factory should use last weekΓÇÖs data slice for the weekly input dataset. You use Azure Data Factory functions as shown in the following snippet to implement this behavior. --**Input1: Azure blob** --The first input is the Azure blob being updated daily. --```json -{ - "name": "AzureBlobInputDaily", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/", - "partitionedBy": [ - { "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}}, - { "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}}, - { "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}} - ], - "format": { - "type": "TextFormat" - } - }, - "external": true, - "availability": { - "frequency": "Day", - "interval": 1 - } - } -} -``` --**Input2: Azure blob** --Input2 is the Azure blob being updated weekly. --```json -{ - "name": "AzureBlobInputWeekly", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/", - "partitionedBy": [ - { "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}}, - { "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}}, - { "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}} - ], - "format": { - "type": "TextFormat" - } - }, - "external": true, - "availability": { - "frequency": "Day", - "interval": 7 - } - } -} -``` --**Output: Azure blob** --One output file is created every day in the folder for the day. Availability of output is set to **day** (frequency: Day, interval: 1). --```json -{ - "name": "AzureBlobOutputDaily", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/{Year}/{Month}/{Day}/", - "partitionedBy": [ - { "name": "Year", "value": {"type": "DateTime","date": "SliceStart","format": "yyyy"}}, - { "name": "Month","value": {"type": "DateTime","date": "SliceStart","format": "MM"}}, - { "name": "Day","value": {"type": "DateTime","date": "SliceStart","format": "dd"}} - ], - "format": { - "type": "TextFormat" - } - }, - "availability": { - "frequency": "Day", - "interval": 1 - } - } -} -``` --**Activity: hive activity in a pipeline** --The hive activity takes the two inputs and produces an output slice every day. You can specify every dayΓÇÖs output slice to depend on the previous weekΓÇÖs input slice for weekly input as follows. --```json -{ - "name":"SamplePipeline", - "properties":{ - "start":"2015-01-01T08:00:00", - "end":"2015-01-01T11:00:00", - "description":"hive activity", - "activities": [ - { - "name": "SampleHiveActivity", - "inputs": [ - { - "name": "AzureBlobInputDaily" - }, - { - "name": "AzureBlobInputWeekly", - "startTime": "Date.AddDays(SliceStart, - Date.DayOfWeek(SliceStart))", - "endTime": "Date.AddDays(SliceEnd, -Date.DayOfWeek(SliceEnd))" - } - ], - "outputs": [ - { - "name": "AzureBlobOutputDaily" - } - ], - "linkedServiceName": "HDInsightLinkedService", - "type": "HDInsightHive", - "typeProperties": { - "scriptPath": "adftutorial\\hivequery.hql", - "scriptLinkedService": "StorageLinkedService", - "defines": { - "Year": "$$Text.Format('{0:yyyy}',WindowStart)", - "Month": "$$Text.Format('{0:MM}',WindowStart)", - "Day": "$$Text.Format('{0:dd}',WindowStart)" - } - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 2, - "timeout": "01:00:00" - } - } - ] - } -} -``` --See [Data Factory functions and system variables](data-factory-functions-variables.md) for a list of functions and system variables that Data Factory supports. --## Appendix --### Example: copy sequentially -It is possible to run multiple copy operations one after another in a sequential/ordered manner. For example, you might have two copy activities in a pipeline (CopyActivity1 and CopyActivity2) with the following input data output datasets: --CopyActivity1 --Input: Dataset. Output: Dataset2. --CopyActivity2 --Input: Dataset2. Output: Dataset3. --CopyActivity2 would run only if the CopyActivity1 has run successfully and Dataset2 is available. --Here is the sample pipeline JSON: --```json -{ - "name": "ChainActivities", - "properties": { - "description": "Run activities in sequence", - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "BlobSink", - "copyBehavior": "PreserveHierarchy", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ - { - "name": "Dataset1" - } - ], - "outputs": [ - { - "name": "Dataset2" - } - ], - "policy": { - "timeout": "01:00:00" - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "CopyFromBlob1ToBlob2", - "description": "Copy data from a blob to another" - }, - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ - { - "name": "Dataset2" - } - ], - "outputs": [ - { - "name": "Dataset3" - } - ], - "policy": { - "timeout": "01:00:00" - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "CopyFromBlob2ToBlob3", - "description": "Copy data from a blob to another" - } - ], - "start": "2016-08-25T01:00:00Z", - "end": "2016-08-25T01:00:00Z", - "isPaused": false - } -} -``` --Notice that in the example, the output dataset of the first copy activity (Dataset2) is specified as input for the second activity. Therefore, the second activity runs only when the output dataset from the first activity is ready. --In the example, CopyActivity2 can have a different input, such as Dataset3, but you specify Dataset2 as an input to CopyActivity2, so the activity does not run until CopyActivity1 finishes. For example: --CopyActivity1 --Input: Dataset1. Output: Dataset2. --CopyActivity2 --Inputs: Dataset3, Dataset2. Output: Dataset4. --```json -{ - "name": "ChainActivities", - "properties": { - "description": "Run activities in sequence", - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "BlobSink", - "copyBehavior": "PreserveHierarchy", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ - { - "name": "Dataset1" - } - ], - "outputs": [ - { - "name": "Dataset2" - } - ], - "policy": { - "timeout": "01:00:00" - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "CopyFromBlobToBlob", - "description": "Copy data from a blob to another" - }, - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ - { - "name": "Dataset3" - }, - { - "name": "Dataset2" - } - ], - "outputs": [ - { - "name": "Dataset4" - } - ], - "policy": { - "timeout": "01:00:00" - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "CopyFromBlob3ToBlob4", - "description": "Copy data from a blob to another" - } - ], - "start": "2017-04-25T01:00:00Z", - "end": "2017-04-25T01:00:00Z", - "isPaused": false - } -} -``` --Notice that in the example, two input datasets are specified for the second copy activity. When multiple inputs are specified, only the first input dataset is used for copying data, but other datasets are used as dependencies. CopyActivity2 would start only after the following conditions are met: --* CopyActivity1 has successfully completed and Dataset2 is available. This dataset is not used when copying data to Dataset4. It only acts as a scheduling dependency for CopyActivity2. -* Dataset3 is available. This dataset represents the data that is copied to the destination. |
data-factory | Data Factory Sftp Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-sftp-connector.md | - Title: Move data from SFTP server using Azure Data Factory -description: Learn about how to move data from an on-premises or a cloud SFTP server using Azure Data Factory. ----- Previously updated : 04/12/2023---# Move data from an SFTP server using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-sftp-connector.md) -> * [Version 2 (current version)](../connector-sftp.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [SFTPconnector in V2](../connector-sftp.md). --This article outlines how to use the Copy Activity in Azure Data Factory to move data from an on-premises/cloud SFTP server to a supported sink data store. This article builds on the [data movement activities](data-factory-data-movement-activities.md) article that presents a general overview of data movement with copy activity and the list of data stores supported as sources/sinks. --Data factory currently supports only moving data from an SFTP server to other data stores, but not for moving data from other data stores to an SFTP server. It supports both on-premises and cloud SFTP servers. --> [!NOTE] -> Copy Activity does not delete the source file after it is successfully copied to the destination. If you need to delete the source file after a successful copy, create a custom activity to delete the file and use the activity in the pipeline. --## Supported scenarios and authentication types -You can use this SFTP connector to copy data from **both cloud SFTP servers and on-premises SFTP servers**. **Basic** and **SshPublicKey** authentication types are supported when connecting to the SFTP server. --When copying data from an on-premises SFTP server, you need install a Data Management Gateway in the on-premises environment/Azure VM. See [Data Management Gateway](data-factory-data-management-gateway.md) for details on the gateway. See [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article for step-by-step instructions on setting up the gateway and using it. --## Getting started -You can create a pipeline with a copy activity that moves data from an SFTP source by using different tools/APIs. --- The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard.--- You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. For JSON samples to copy data from SFTP server to Azure Blob Storage, see [JSON Example: Copy data from SFTP server to Azure blob](#json-example-copy-data-from-sftp-server-to-azure-blob) section of this article.--## Linked service properties -The following table provides description for JSON elements specific to FTP linked service. --| Property | Description | Required | -| | | | -| type | The type property must be set to `Sftp`. |Yes | -| host | Name or IP address of the SFTP server. |Yes | -| port |Port on which the SFTP server is listening. The default value is: 21 |No | -| authenticationType |Specify authentication type. Allowed values: **Basic**, **SshPublicKey**. <br><br> Refer to [Using basic authentication](#using-basic-authentication) and [Using SSH public key authentication](#using-ssh-public-key-authentication) sections on more properties and JSON samples respectively. |Yes | -| skipHostKeyValidation | Specify whether to skip host key validation. | No. The default value: false | -| hostKeyFingerprint | Specify the finger print of the host key. | Yes if the `skipHostKeyValidation` is set to false. | -| gatewayName |Name of the Data Management Gateway to connect to an on-premises SFTP server. | Yes if copying data from an on-premises SFTP server. | -| encryptedCredential | Encrypted credential to access the SFTP server. Auto-generated when you specify basic authentication (username + password) or SshPublicKey authentication (username + private key path or content) in copy wizard or the ClickOnce popup dialog. | No. Apply only when copying data from an on-premises SFTP server. | --### Using basic authentication --To use basic authentication, set `authenticationType` as `Basic`, and specify the following properties besides the SFTP connector generic ones introduced in the last section: --| Property | Description | Required | -| | | | -| username | User who has access to the SFTP server. |Yes | -| password | Password for the user (username). | Yes | --#### Example: Basic authentication -```json -{ - "name": "SftpLinkedService", - "properties": { - "type": "Sftp", - "typeProperties": { - "host": "mysftpserver", - "port": 22, - "authenticationType": "Basic", - "username": "xxx", - "password": "xxx", - "skipHostKeyValidation": false, - "hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00", - "gatewayName": "mygateway" - } - } -} -``` --#### Example: Basic authentication with encrypted credential --```JSON -{ - "name": "SftpLinkedService", - "properties": { - "type": "Sftp", - "typeProperties": { - "host": "mysftpserver", - "port": 22, - "authenticationType": "Basic", - "username": "xxx", - "encryptedCredential": "xxxxxxxxxxxxxxxxx", - "skipHostKeyValidation": false, - "hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00", - "gatewayName": "mygateway" - } - } -} -``` --### Using SSH public key authentication --To use SSH public key authentication, set `authenticationType` as `SshPublicKey`, and specify the following properties besides the SFTP connector generic ones introduced in the last section: --| Property | Description | Required | -| | | | -| username |User who has access to the SFTP server |Yes | -| privateKeyPath | Specify absolute path to the private key file that gateway can access. | Specify either the `privateKeyPath` or `privateKeyContent`. <br><br> Apply only when copying data from an on-premises SFTP server. | -| privateKeyContent | A serialized string of the private key content. The Copy Wizard can read the private key file and extract the private key content automatically. If you are using any other tool/SDK, use the privateKeyPath property instead. | Specify either the `privateKeyPath` or `privateKeyContent`. | -| passPhrase | Specify the pass phrase/password to decrypt the private key if the key file is protected by a pass phrase. | Yes if the private key file is protected by a pass phrase. | --> [!NOTE] -> SFTP connector supports RSA/DSA OpenSSH key. Make sure your key file content starts with "--BEGIN [RSA/DSA] PRIVATE KEY--". If the private key file is a ppk-format file, please use Putty tool to convert from .ppk to OpenSSH format. --#### Example: SshPublicKey authentication using private key filePath --```json -{ - "name": "SftpLinkedServiceWithPrivateKeyPath", - "properties": { - "type": "Sftp", - "typeProperties": { - "host": "mysftpserver", - "port": 22, - "authenticationType": "SshPublicKey", - "username": "xxx", - "privateKeyPath": "D:\\privatekey_openssh", - "passPhrase": "xxx", - "skipHostKeyValidation": true, - "gatewayName": "mygateway" - } - } -} -``` --#### Example: SshPublicKey authentication using private key content --```json -{ - "name": "SftpLinkedServiceWithPrivateKeyContent", - "properties": { - "type": "Sftp", - "typeProperties": { - "host": "mysftpserver.westus.cloudapp.azure.com", - "port": 22, - "authenticationType": "SshPublicKey", - "username": "xxx", - "privateKeyContent": "<base64 string of the private key content>", - "passPhrase": "xxx", - "skipHostKeyValidation": true - } - } -} -``` --## Dataset properties -For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types. --The **typeProperties** section is different for each type of dataset. It provides information that is specific to the dataset type. The typeProperties section for a dataset of type **FileShare** dataset has the following properties: --| Property | Description | Required | -| | | | -| folderPath |Sub path to the folder. Use escape character ' \ ' for special characters in the string. See Sample linked service and dataset definitions for examples.<br/><br/>You can combine this property with **partitionBy** to have folder paths based on slice start/end date-times. |Yes | -| fileName |Specify the name of the file in the **folderPath** if you want the table to refer to a specific file in the folder. If you do not specify any value for this property, the table points to all files in the folder.<br/><br/>When fileName is not specified for an output dataset, the name of the generated file would be in the following this format: <br/><br/>`Data.<Guid>.txt` (Example: Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt |No | -| fileFilter |Specify a filter to be used to select a subset of files in the folderPath rather than all files.<br/><br/>Allowed values are: `*` (multiple characters) and `?` (single character).<br/><br/>Examples 1: `"fileFilter": "*.log"`<br/>Example 2: `"fileFilter": 2014-1-?.txt"`<br/><br/> fileFilter is applicable for an input FileShare dataset. This property is not supported with HDFS. |No | -| partitionedBy |partitionedBy can be used to specify a dynamic folderPath, filename for time series data. For example, folderPath parameterized for every hour of data. |No | -| format | The following format types are supported: **TextFormat**, **JsonFormat**, **AvroFormat**, **OrcFormat**, **ParquetFormat**. Set the **type** property under format to one of these values. For more information, see [Text Format](data-factory-supported-file-and-compression-formats.md#text-format), [Json Format](data-factory-supported-file-and-compression-formats.md#json-format), [Avro Format](data-factory-supported-file-and-compression-formats.md#avro-format), [Orc Format](data-factory-supported-file-and-compression-formats.md#orc-format), and [Parquet Format](data-factory-supported-file-and-compression-formats.md#parquet-format) sections. <br><br> If you want to **copy files as-is** between file-based stores (binary copy), skip the format section in both input and output dataset definitions. |No | -| compression | Specify the type and level of compression for the data. Supported types are: **GZip**, **Deflate**, **BZip2**, and **ZipDeflate**. Supported levels are: **Optimal** and **Fastest**. For more information, see [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md#compression-support). |No | -| useBinaryTransfer |Specify whether use Binary transfer mode. True for binary mode and false ASCII. Default value: True. This property can only be used when associated linked service type is of type: FtpServer. |No | --> [!NOTE] -> filename and fileFilter cannot be used simultaneously. --### Using partionedBy property -As mentioned in the previous section, you can specify a dynamic folderPath, filename for time series data with partitionedBy. You can do so with the Data Factory macros and the system variable SliceStart, SliceEnd that indicate the logical time period for a given data slice. --To learn about time series datasets, scheduling, and slices, See [Creating Datasets](data-factory-create-datasets.md), [Scheduling & Execution](data-factory-scheduling-and-execution.md), and [Creating Pipelines](data-factory-create-pipelines.md) articles. --#### Sample 1: --```json -"folderPath": "wikidatagateway/wikisampledataout/{Slice}", -"partitionedBy": -[ - { "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } }, -], -``` -In this example {Slice} is replaced with the value of Data Factory system variable SliceStart in the format (YYYYMMDDHH) specified. The SliceStart refers to start time of the slice. The folderPath is different for each slice. Example: wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104. --#### Sample 2: --```json -"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}", -"fileName": "{Hour}.csv", -"partitionedBy": -[ - { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } }, - { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } }, - { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } }, - { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } } -], -``` -In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by folderPath and fileName properties. --## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and policies are available for all types of activities. --Whereas, the properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, the type properties vary depending on the types of sources and sinks. ---## Supported file and compression formats -See [File and compression formats in Azure Data Factory](data-factory-supported-file-and-compression-formats.md) article on details. --## JSON Example: Copy data from SFTP server to Azure blob -The following example provides sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). They show how to copy data from SFTP source to Azure Blob Storage. However, data can be copied **directly** from any of sources to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --> [!IMPORTANT] -> This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article for step-by-step instructions. --The sample has the following data factory entities: --* A linked service of type [sftp](#linked-service-properties). -* A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -* An input [dataset](data-factory-create-datasets.md) of type [FileShare](#dataset-properties). -* An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -* A [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [FileSystemSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data from an SFTP server to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples. --**SFTP linked service** --This example uses the basic authentication with user name and password in plain text. You can also use one of the following ways: --* Basic authentication with encrypted credentials -* SSH public key authentication --See [FTP linked service](#linked-service-properties) section for different types of authentication you can use. --```JSON --{ - "name": "SftpLinkedService", - "properties": { - "type": "Sftp", - "typeProperties": { - "host": "mysftpserver", - "port": 22, - "authenticationType": "Basic", - "username": "myuser", - "password": "mypassword", - "skipHostKeyValidation": false, - "hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00", - "gatewayName": "mygateway" - } - } -} -``` -**Azure Storage linked service** --```JSON -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -**SFTP input dataset** --This dataset refers to the SFTP folder `mysharedfolder` and file `test.csv`. The pipeline copies the file to the destination. --Setting "external": "true" informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. --```JSON -{ - "name": "SFTPFileInput", - "properties": { - "type": "FileShare", - "linkedServiceName": "SftpLinkedService", - "typeProperties": { - "folderPath": "mysharedfolder", - "fileName": "test.csv" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Azure Blob output dataset** --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```JSON -{ - "name": "AzureBlobOutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/sftp/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - }, - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Pipeline with Copy activity** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **FileSystemSource** and **sink** type is set to **BlobSink**. --```JSON -{ - "name": "pipeline", - "properties": { - "activities": [{ - "name": "SFTPToBlobCopy", - "inputs": [{ - "name": "SFTPFileInput" - }], - "outputs": [{ - "name": "AzureBlobOutput" - }], - "type": "Copy", - "typeProperties": { - "source": { - "type": "FileSystemSource" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1, - "timeout": "00:05:00" - } - }], - "start": "2017-02-20T18:00:00Z", - "end": "2017-02-20T19:00:00Z" - } -} -``` --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. --## Next steps -See the following articles: --* [Copy Activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions for creating a pipeline with a Copy Activity. |
data-factory | Data Factory Spark | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-spark.md | - Title: Invoke Spark programs from Azure Data Factory -description: Learn how to invoke Spark programs from an Azure data factory by using the MapReduce activity. ------ Previously updated : 04/12/2023---# Invoke Spark programs from Azure Data Factory pipelines --> [!div class="op_single_selector" title1="Transformation Activities"] -> * [Hive activity](data-factory-hive-activity.md) -> * [Pig activity](data-factory-pig-activity.md) -> * [MapReduce activity](data-factory-map-reduce.md) -> * [Hadoop Streaming activity](data-factory-hadoop-streaming-activity.md) -> * [Spark activity](data-factory-spark.md) -> * [ML Studio (classic) Batch Execution activity](data-factory-azure-ml-batch-execution-activity.md) -> * [ML Studio (classic) Update Resource activity](data-factory-azure-ml-update-resource-activity.md) -> * [Stored procedure activity](data-factory-stored-proc-activity.md) -> * [Data Lake Analytics U-SQL activity](data-factory-usql-activity.md) -> * [.NET custom activity](data-factory-use-custom-activities.md) --> [!NOTE] -> This article applies to version 1 of Azure Data Factory, which is generally available. If you use the current version of the Data Factory service, see [Transform data by using the Apache Spark activity in Data Factory](../transform-data-using-spark.md). --## Introduction -The Spark activity is one of the [data transformation activities](data-factory-data-transformation-activities.md) supported by Data Factory. This activity runs the specified Spark program on your Spark cluster in Azure HDInsight. --> [!IMPORTANT] -> - The Spark activity doesn't support HDInsight Spark clusters that use Azure Data Lake Store as primary storage. -> - The Spark activity supports only existing (your own) HDInsight Spark clusters. It doesn't support an on-demand HDInsight linked service. --## Walkthrough: Create a pipeline with a Spark activity -Here are the typical steps to create a data factory pipeline with a Spark activity: --* Create a data factory. -* Create an Azure Storage linked service to link your storage that is associated with your HDInsight Spark cluster to the data factory. -* Create an HDInsight linked service to link your Spark cluster in HDInsight to the data factory. -* Create a dataset that refers to the Storage linked service. Currently, you must specify an output dataset for an activity even if there is no output being produced. -* Create a pipeline with Spark activity that refers to the HDInsight linked service you created. The activity is configured with the dataset you created in the previous step as an output dataset. The output dataset is what drives the schedule (hourly, daily). Therefore, you must specify the output dataset even though the activity doesn't really produce an output. --### Prerequisites -1. Create a general-purpose storage account by following the instructions in [Create a storage account](../../storage/common/storage-account-create.md). --1. Create a Spark cluster in HDInsight by following the instructions in the tutorial [Create a Spark cluster in HDInsight](../../hdinsight/spark/apache-spark-jupyter-spark-sql.md). Associate the storage account you created in step 1 with this cluster. --1. Download and review the Python script file **test.py** located at ```https://adftutorialfiles.blob.core.windows.net/sparktutorial/test.py```. --1. Upload **test.py** to the **pyFiles** folder in the **adfspark** container in your blob storage. Create the container and the folder if they don't exist. --### Create a data factory -To create a data factory, follow these steps: --1. Sign in to the [Azure portal](https://portal.azure.com/). --1. Select **New** > **Data + Analytics** > **Data Factory**. --1. On the **New data factory** blade, under **Name**, enter **SparkDF**. -- > [!IMPORTANT] - > The name of the Azure data factory must be globally unique. If you see the error "Data factory name SparkDF is not available," change the name of the data factory. For example, use yournameSparkDFdate, and create the data factory again. For more information on naming rules, see [Data Factory: Naming rules](data-factory-naming-rules.md). --1. Under **Subscription**, select the Azure subscription where you want the data factory to be created. --1. Select an existing resource group, or create an Azure resource group. --1. Select the **Pin to dashboard** check box. --1. Select **Create**. -- > [!IMPORTANT] - > To create Data Factory instances, you must be a member of the [Data Factory contributor](../../role-based-access-control/built-in-roles.md#data-factory-contributor) role at the subscription/resource group level. --1. You see the data factory as it is created in the dashboard of the Azure portal. --1. After the data factory is created, you see the **Data factory** page, which shows you the contents of the data factory. If you don't see the **Data factory** page, select the tile for your data factory on the dashboard. -- :::image type="content" source="./media/data-factory-spark/data-factory-blade.png" alt-text="Data Factory blade"::: --### Create linked services -In this step, you create two linked services. One service links your Spark cluster to your data factory, and the other service links your storage to your data factory. --#### Create a Storage linked service -In this step, you link your storage account to your data factory. A dataset you create in a step later in this walkthrough refers to this linked service. The HDInsight linked service that you define in the next step refers to this linked service too. --1. On the **Data factory** blade, select **Author and deploy**. The Data Factory Editor appears. --1. Select **New data store**, and choose **Azure Storage**. -- :::image type="content" source="./media/data-factory-spark/new-data-store-azure-storage-menu.png" alt-text="New data store"::: --1. The JSON script you use to create a Storage linked service appears in the editor. -- :::image type="content" source="./media/data-factory-build-your-first-pipeline-using-editor/azure-storage-linked-service.png" alt-text="AzureStorageLinkedService"::: --1. Replace **account name** and **account key** with the name and access key of your storage account. To learn how to get your storage access key, see [Manage storage account access keys](../../storage/common/storage-account-keys-manage.md). --1. To deploy the linked service, select **Deploy** on the command bar. After the linked service is deployed successfully, the Draft-1 window disappears. You see **AzureStorageLinkedService** in the tree view on the left. --#### Create an HDInsight linked service -In this step, you create an HDInsight linked service to link your HDInsight Spark cluster to the data factory. The HDInsight cluster is used to run the Spark program specified in the Spark activity of the pipeline in this sample. --1. In the Data Factory Editor, select **More** > **New compute** > **HDInsight cluster**. -- :::image type="content" source="media/data-factory-spark/new-hdinsight-linked-service.png" alt-text="Create HDInsight linked service"::: --1. Copy and paste the following snippet to the Draft-1 window. In the JSON editor, take the following steps: -- 1. Specify the URI for the HDInsight Spark cluster. For example: `https://<sparkclustername>.azurehdinsight.net/`. -- 1. Specify the name of the user who has access to the Spark cluster. -- 1. Specify the password for the user. -- 1. Specify the Storage linked service that is associated with the HDInsight Spark cluster. In this example, it's AzureStorageLinkedService. -- ```json - { - "name": "HDInsightLinkedService", - "properties": { - "type": "HDInsight", - "typeProperties": { - "clusterUri": "https://<sparkclustername>.azurehdinsight.net/", - "userName": "admin", - "password": "**********", - "linkedServiceName": "AzureStorageLinkedService" - } - } - } - ``` -- > [!IMPORTANT] - > - The Spark activity doesn't support HDInsight Spark clusters that use Azure Data Lake Store as primary storage. - > - The Spark activity supports only existing (your own) HDInsight Spark clusters. It doesn't support an on-demand HDInsight linked service. -- For more information about the HDInsight linked service, see [HDInsight linked service](data-factory-compute-linked-services.md#azure-hdinsight-linked-service). --1. To deploy the linked service, select **Deploy** on the command bar. --### Create the output dataset -The output dataset is what drives the schedule (hourly, daily). Therefore, you must specify an output dataset for the Spark activity in the pipeline even though the activity doesn't produce any output. Specifying an input dataset for the activity is optional. --1. In the Data Factory Editor, select **More** > **New dataset** > **Azure Blob storage**. --1. Copy and paste the following snippet to the Draft-1 window. The JSON snippet defines a dataset called **OutputDataset**. In addition, you specify that the results are stored in the blob container called **adfspark** and the folder called **pyFiles/output**. As mentioned previously, this dataset is a dummy dataset. The Spark program in this example doesn't produce any output. The **availability** section specifies that the output dataset is produced daily. -- ```json - { - "name": "OutputDataset", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "fileName": "sparkoutput.txt", - "folderPath": "adfspark/pyFiles/output", - "format": { - "type": "TextFormat", - "columnDelimiter": "\t" - } - }, - "availability": { - "frequency": "Day", - "interval": 1 - } - } - } - ``` -1. To deploy the dataset, select **Deploy** on the command bar. ---### Create a pipeline -In this step, you create a pipeline with an HDInsightSpark activity. Currently, the output dataset is what drives the schedule, so you must create an output dataset even if the activity doesn't produce any output. If the activity doesn't take any input, you can skip creating the input dataset. Therefore, no input dataset is specified in this example. --1. In the Data Factory Editor, select **More** > **New pipeline**. --1. Replace the script in the Draft-1 window with the following script: -- ```json - { - "name": "SparkPipeline", - "properties": { - "activities": [ - { - "type": "HDInsightSpark", - "typeProperties": { - "rootPath": "adfspark\\pyFiles", - "entryFilePath": "test.py", - "getDebugInfo": "Always" - }, - "outputs": [ - { - "name": "OutputDataset" - } - ], - "name": "MySparkActivity", - "linkedServiceName": "HDInsightLinkedService" - } - ], - "start": "2017-02-05T00:00:00Z", - "end": "2017-02-06T00:00:00Z" - } - } - ``` -- Note the following points: -- 1. The **type** property is set to **HDInsightSpark**. -- 1. The **rootPath** property is set to **adfspark\\pyFiles** where adfspark is the blob container and pyFiles is file folder in that container. In this example, the blob storage is the one that is associated with the Spark cluster. You can upload the file to a different storage account. If you do so, create a Storage linked service to link that storage account to the data factory. Then, specify the name of the linked service as a value for the **sparkJobLinkedService** property. For more information about this property and other properties supported by the Spark activity, see [Spark activity properties](#spark-activity-properties). -- 1. The **entryFilePath** property is set to **test.py**, which is the Python file. -- 1. The **getDebugInfo** property is set to **Always**, which means the log files are always generated (success or failure). -- > [!IMPORTANT] - > We recommend not setting this property to `Always` in a production environment unless you're troubleshooting an issue. -- 1. The **outputs** section has one output dataset. You must specify an output dataset even if the Spark program doesn't produce any output. The output dataset drives the schedule for the pipeline (hourly, daily). -- For more information about the properties supported by the Spark activity, see the section [Spark activity properties](#spark-activity-properties). --1. To deploy the pipeline, select **Deploy** on the command bar. --### Monitor a pipeline -1. On the **Data factory** blade, select **Monitor & Manage** to start the monitoring application in another tab. -- :::image type="content" source="media/data-factory-spark/monitor-and-manage-tile.png" alt-text="Monitor & Manage tile"::: --1. Change the **Start time** filter at the top to **2/1/2017**, and select **Apply**. --1. Only one activity window appears because there is only one day between the start (2017-02-01) and end times (2017-02-02) of the pipeline. Confirm that the data slice is in the **Ready** state. -- :::image type="content" source="media/data-factory-spark/monitor-and-manage-app.png" alt-text="Monitor the pipeline"::: --1. In the **Activity windows** list, select an activity run to see details about it. If there is an error, you see details about it in the right pane. --### Verify the results --1. Start the Jupyter Notebook for your HDInsight Spark cluster by going to `https://CLUSTERNAME.azurehdinsight.net/jupyter`. You also can open a cluster dashboard for your HDInsight Spark cluster, and then start the Jupyter Notebook. --1. Select **New** > **PySpark** to start a new notebook. -- :::image type="content" source="media/data-factory-spark/jupyter-new-book.png" alt-text="Jupyter new notebook"::: --1. Run the following command by copying and pasting the text and pressing Shift+Enter at the end of the second statement: -- ```sql - %%sql -- SELECT buildingID, (targettemp - actualtemp) AS temp_diff, date FROM hvac WHERE date = \"6/1/13\" - ``` -1. Confirm that you see the data from the hvac table. -- :::image type="content" source="media/data-factory-spark/jupyter-notebook-results.png" alt-text="Jupyter query results"::: --<!-- Removed bookmark #run-a-hive-query-using-spark-sql since it doesn't exist in the target article --> -For detailed instructions, see the section [Run a Spark SQL query](../../hdinsight/spark/apache-spark-jupyter-spark-sql.md). --### Troubleshooting -Because you set getDebugInfo to **Always**, you see a log subfolder in the pyFiles folder in your blob container. The log file in the log folder provides additional information. This log file is especially useful when there is an error. In a production environment, you might want to set it to **Failure**. --For further troubleshooting, take the following steps: ---1. Go to `https://<CLUSTERNAME>.azurehdinsight.net/yarnui/hn/cluster`. -- :::image type="content" source="media/data-factory-spark/yarnui-application.png" alt-text="YARN UI application"::: --1. Select **Logs** for one of the run attempts. -- :::image type="content" source="media/data-factory-spark/yarn-applications.png" alt-text="Application page"::: --1. You see the following additional error information in the log page: -- :::image type="content" source="media/data-factory-spark/yarnui-application-error.png" alt-text="Log error"::: --The following sections provide information about the data factory entities to use Spark cluster and Spark activity in your data factory. --## Spark activity properties -Here is the sample JSON definition of a pipeline with a Spark activity: --```json -{ - "name": "SparkPipeline", - "properties": { - "activities": [ - { - "type": "HDInsightSpark", - "typeProperties": { - "rootPath": "adfspark\\pyFiles", - "entryFilePath": "test.py", - "arguments": [ "arg1", "arg2" ], - "sparkConfig": { - "spark.python.worker.memory": "512m" - }, - "getDebugInfo": "Always" - }, - "outputs": [ - { - "name": "OutputDataset" - } - ], - "name": "MySparkActivity", - "description": "This activity invokes the Spark program", - "linkedServiceName": "HDInsightLinkedService" - } - ], - "start": "2017-02-01T00:00:00Z", - "end": "2017-02-02T00:00:00Z" - } -} -``` --The following table describes the JSON properties used in the JSON definition. --| Property | Description | Required | -| -- | -- | -- | -| name | Name of the activity in the pipeline. | Yes | -| description | Text that describes what the activity does. | No | -| type | This property must be set to HDInsightSpark. | Yes | -| linkedServiceName | Name of the HDInsight linked service on which the Spark program runs. | Yes | -| rootPath | The blob container and folder that contains the Spark file. The file name is case sensitive. | Yes | -| entryFilePath | Relative path to the root folder of the Spark code/package. | Yes | -| className | Application's Java/Spark main class. | No | -| arguments | A list of command-line arguments to the Spark program. | No | -| proxyUser | The user account to impersonate to execute the Spark program. | No | -| sparkConfig | Specify values for the Spark configuration properties listed in [Spark configuration: Application properties](https://spark.apache.org/docs/latest/configuration.html#available-properties). | No | -| getDebugInfo | Specifies when the Spark log files are copied to the storage used by the HDInsight cluster (or) specified by sparkJobLinkedService. Allowed values are None, Always, or Failure. The default value is None. | No | -| sparkJobLinkedService | The Storage linked service that holds the Spark job file, dependencies, and logs. If you don't specify a value for this property, the storage associated with the HDInsight cluster is used. | No | --## Folder structure -The Spark activity doesn't support an inline script as Pig and Hive activities do. Spark jobs are also more extensible than Pig/Hive jobs. For Spark jobs, you can provide multiple dependencies such as jar packages (placed in the Java CLASSPATH), Python files (placed on the PYTHONPATH), and any other files. --Create the following folder structure in the blob storage referenced by the HDInsight linked service. Then, upload dependent files to the appropriate subfolders in the root folder represented by **entryFilePath**. For example, upload Python files to the pyFiles subfolder and jar files to the jars subfolder of the root folder. At runtime, the Data Factory service expects the following folder structure in the blob storage: --| Path | Description | Required | Type | -| - | -- | -- | - | -| . | The root path of the Spark job in the storage linked service. | Yes | Folder | -| <user defined > | The path that points to the entry file of the Spark job. | Yes | File | -| ./jars | All files under this folder are uploaded and placed on the Java classpath of the cluster. | No | Folder | -| ./pyFiles | All files under this folder are uploaded and placed on the PYTHONPATH of the cluster. | No | Folder | -| ./files | All files under this folder are uploaded and placed on the executor working directory. | No | Folder | -| ./archives | All files under this folder are uncompressed. | No | Folder | -| ./logs | The folder where logs from the Spark cluster are stored.| No | Folder | --Here is an example for storage that contains two Spark job files in the blob storage referenced by the HDInsight linked service: --``` -SparkJob1 - main.jar - files - input1.txt - input2.txt - jars - package1.jar - package2.jar - logs --SparkJob2 - main.py - pyFiles - scrip1.py - script2.py - logs -``` |
data-factory | Data Factory Sqlserver Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-sqlserver-connector.md | - Title: Move data to and from SQL Server -description: Learn about how to move data to/from SQL Server database that is on-premises or in an Azure VM using Azure Data Factory. ----- Previously updated : 04/12/2023---# Move data to and from SQL Server using Azure Data Factory --> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-sqlserver-connector.md) -> * [Version 2 (current version)](../connector-sql-server.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [SQL Server connector in V2](../connector-sql-server.md). --This article explains how to use the Copy Activity in Azure Data Factory to move data to/from a SQL Server database. It builds on the [Data Movement Activities](data-factory-data-movement-activities.md) article, which presents a general overview of data movement with the copy activity. ---## Supported scenarios -You can copy data **from a SQL Server database** to the following data stores: ---You can copy data from the following data stores **to a SQL Server database**: ---## Supported SQL Server versions -This SQL Server connector support copying data from/to the following versions of instance hosted on-premises or in Azure IaaS using both SQL authentication and Windows authentication: SQL Server 2016, SQL Server 2014, SQL Server 2012, SQL Server 2008 R2, SQL Server 2008, SQL Server 2005 --## Enabling connectivity -The concepts and steps needed for connecting with SQL Server hosted on-premises or in Azure IaaS (Infrastructure-as-a-Service) VMs are the same. In both cases, you need to use Data Management Gateway for connectivity. --See [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. Setting up a gateway instance is a pre-requisite for connecting with SQL Server. --While you can install gateway on the same on-premises machine or cloud VM instance as the SQL Server for better performance, we recommended that you install them on separate machines. Having the gateway and SQL Server on separate machines reduces resource contention. --## Getting started -You can create a pipeline with a copy activity that moves data to/from a SQL Server database by using different tools/APIs. --The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. --You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create a **data factory**. A data factory may contain one or more pipelines. -2. Create **linked services** to link input and output data stores to your data factory. For example, if you are copying data from a SQL Server database to an Azure blob storage, you create two linked services to link your SQL Server database and Azure storage account to your data factory. For linked service properties that are specific to SQL Server database, see [linked service properties](#linked-service-properties) section. -3. Create **datasets** to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the SQL table in your SQL Server database that contains the input data. And, you create another dataset to specify the blob container and the folder that holds the data copied from the SQL Server database. For dataset properties that are specific to SQL Server database, see [dataset properties](#dataset-properties) section. -4. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use SqlSource as a source and BlobSink as a sink for the copy activity. Similarly, if you are copying from Azure Blob Storage to SQL Server Database, you use BlobSource and SqlSink in the copy activity. For copy activity properties that are specific to SQL Server Database, see [copy activity properties](#copy-activity-properties) section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from a SQL Server database, see [JSON examples](#json-examples-for-copying-data-from-and-to-sql-server) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to SQL Server: --## Linked service properties -You create a linked service of type **OnPremisesSqlServer** to link a SQL Server database to a data factory. The following table provides description for JSON elements specific to SQL Server linked service. --The following table provides description for JSON elements specific to SQL Server linked service. --| Property | Description | Required | -| | | | -| type |The type property should be set to: **OnPremisesSqlServer**. |Yes | -| connectionString |Specify connectionString information needed to connect to the SQL Server database using either SQL authentication or Windows authentication. |Yes | -| gatewayName |Name of the gateway that the Data Factory service should use to connect to the SQL Server database. |Yes | -| username |Specify user name if you are using Windows Authentication. Example: **domainname\\username**. |No | -| password |Specify password for the user account you specified for the username. |No | --You can encrypt credentials using the **New-AzDataFactoryEncryptValue** cmdlet and use them in the connection string as shown in the following example (**EncryptedCredential** property): --```JSON -"connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated Security=True;EncryptedCredential=<encrypted credential>", -``` --### Samples -**JSON for using SQL Authentication** --```json -{ - "name": "MyOnPremisesSQLDB", - "properties": - { - "type": "OnPremisesSqlServer", - "typeProperties": { - "connectionString": "Data Source=<servername>;Initial Catalog=MarketingCampaigns;Integrated Security=False;User ID=<username>;Password=<password>;", - "gatewayName": "<gateway name>" - } - } -} -``` -**JSON for using Windows Authentication** --Data Management Gateway will impersonate the specified user account to connect to the SQL Server database. --```json -{ - "Name": " MyOnPremisesSQLDB", - "Properties": - { - "type": "OnPremisesSqlServer", - "typeProperties": { - "ConnectionString": "Data Source=<servername>;Initial Catalog=MarketingCampaigns;Integrated Security=True;", - "username": "<domain\\username>", - "password": "<password>", - "gatewayName": "<gateway name>" - } - } -} -``` --## Dataset properties -In the samples, you have used a dataset of type **SqlServerTable** to represent a table in a SQL Server database. --For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (SQL Server, Azure blob, Azure table, etc.). --The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The **typeProperties** section for the dataset of type **SqlServerTable** has the following properties: --| Property | Description | Required | -| | | | -| tableName |Name of the table or view in the SQL Server Database instance that linked service refers to. |Yes | --## Copy activity properties -If you are moving data from a SQL Server database, you set the source type in the copy activity to **SqlSource**. Similarly, if you are moving data to a SQL Server database, you set the sink type in the copy activity to **SqlSink**. This section provides a list of properties supported by SqlSource and SqlSink. --For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and policies are available for all types of activities. --> [!NOTE] -> The Copy Activity takes only one input and produces only one output. --Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --### SqlSource -When source in a copy activity is of type **SqlSource**, the following properties are available in **typeProperties** section: --| Property | Description | Allowed values | Required | -| | | | | -| sqlReaderQuery |Use the custom query to read data. |SQL query string. For example: select * from MyTable. May reference multiple tables from the database referenced by the input dataset. If not specified, the SQL statement that is executed: select from MyTable. |No | -| sqlReaderStoredProcedureName |Name of the stored procedure that reads data from the source table. |Name of the stored procedure. The last SQL statement must be a SELECT statement in the stored procedure. |No | -| storedProcedureParameters |Parameters for the stored procedure. |Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. |No | --If the **sqlReaderQuery** is specified for the SqlSource, the Copy Activity runs this query against the SQL Server Database source to get the data. --Alternatively, you can specify a stored procedure by specifying the **sqlReaderStoredProcedureName** and **storedProcedureParameters** (if the stored procedure takes parameters). --If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section are used to build a select query to run against the SQL Server Database. If the dataset definition does not have the structure, all columns are selected from the table. --> [!NOTE] -> When you use **sqlReaderStoredProcedureName**, you still need to specify a value for the **tableName** property in the dataset JSON. There are no validations performed against this table though. --### SqlSink -**SqlSink** supports the following properties: --| Property | Description | Allowed values | Required | -| | | | | -| writeBatchTimeout |Wait time for the batch insert operation to complete before it times out. |timespan<br/><br/> Example: "00:30:00" (30 minutes). |No | -| writeBatchSize |Inserts data into the SQL table when the buffer size reaches writeBatchSize. |Integer (number of rows) |No (default: 10000) | -| sqlWriterCleanupScript |Specify query for Copy Activity to execute such that data of a specific slice is cleaned up. For more information, see [repeatable copy](#repeatable-copy) section. |A query statement. |No | -| sliceIdentifierColumnName |Specify column name for Copy Activity to fill with auto generated slice identifier, which is used to clean up data of a specific slice when rerun. For more information, see [repeatable copy](#repeatable-copy) section. |Column name of a column with data type of binary(32). |No | -| sqlWriterStoredProcedureName |Name of the stored procedure that defines how to apply source data into target table, e.g. to do upserts or transform using your own business logic. <br/><br/>Note this stored procedure will be **invoked per batch**. If you want to do operation that only runs once and has nothing to do with source data e.g. delete/truncate, use `sqlWriterCleanupScript` property. |Name of the stored procedure. |No | -| storedProcedureParameters |Parameters for the stored procedure. |Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. |No | -| sqlWriterTableType |Specify table type name to be used in the stored procedure. Copy activity makes the data being moved available in a temp table with this table type. Stored procedure code can then merge the data being copied with existing data. |A table type name. |No | ---## JSON examples for copying data from and to SQL Server -The following examples provide sample JSON definitions that you can use to create a pipeline by using [Visual Studio](data-factory-copy-activity-tutorial-using-visual-studio.md) or [Azure PowerShell](data-factory-copy-activity-tutorial-using-powershell.md). The following samples show how to copy data to and from SQL Server and Azure Blob Storage. However, data can be copied **directly** from any of sources to any of the sinks stated [here](data-factory-data-movement-activities.md#supported-data-stores-and-formats) using the Copy Activity in Azure Data Factory. --## Example: Copy data from SQL Server to Azure Blob -The following sample shows: --1. A linked service of type [OnPremisesSqlServer](#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [SqlServerTable](#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -5. The [pipeline](data-factory-create-pipelines.md) with Copy activity that uses [SqlSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies time-series data from a SQL Server table to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples. --As a first step, setup the data management gateway. The instructions are in the [moving data between on-premises locations and cloud](data-factory-move-data-between-onprem-and-cloud.md) article. --**SQL Server linked service** -```json -{ - "Name": "SqlServerLinkedService", - "properties": { - "type": "OnPremisesSqlServer", - "typeProperties": { - "connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated Security=False;User ID=<username>;Password=<password>;", - "gatewayName": "<gatewayname>" - } - } -} -``` -**Azure Blob storage linked service** --```json -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -**SQL Server input dataset** --The sample assumes you have created a table "MyTable" in SQL Server and it contains a column called "timestampcolumn" for time series data. You can query over multiple tables within the same database using a single dataset, but a single table must be used for the dataset's tableName typeProperty. --Setting "external": "true" informs Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. --```json -{ - "name": "SqlServerInput", - "properties": { - "type": "SqlServerTable", - "linkedServiceName": "SqlServerLinkedService", - "typeProperties": { - "tableName": "MyTable" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` -**Azure Blob output dataset** --Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time. --```json -{ - "name": "AzureBlobOutput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}", - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ], - "format": { - "type": "TextFormat", - "columnDelimiter": "\t", - "rowDelimiter": "\n" - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` -**Pipeline with Copy activity** --The pipeline contains a Copy Activity that is configured to use these input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **SqlSource** and **sink** type is set to **BlobSink**. The SQL query specified for the **SqlReaderQuery** property selects the data in the past hour to copy. --```json -{ - "name":"SamplePipeline", - "properties":{ - "start":"2016-06-01T18:00:00", - "end":"2016-06-01T19:00:00", - "description":"pipeline for copy activity", - "activities":[ - { - "name": "SqlServertoBlob", - "description": "copy activity", - "type": "Copy", - "inputs": [ - { - "name": " SqlServerInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "typeProperties": { - "source": { - "type": "SqlSource", - "SqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` -In this example, **sqlReaderQuery** is specified for the SqlSource. The Copy Activity runs this query against the SQL Server Database source to get the data. Alternatively, you can specify a stored procedure by specifying the **sqlReaderStoredProcedureName** and **storedProcedureParameters** (if the stored procedure takes parameters). The sqlReaderQuery can reference multiple tables within the database referenced by the input dataset. It is not limited to only the table set as the dataset's tableName typeProperty. --If you do not specify sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section are used to build a select query to run against the SQL Server Database. If the dataset definition does not have the structure, all columns are selected from the table. --See the [Sql Source](#sqlsource) section and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties) for the list of properties supported by SqlSource and BlobSink. --## Example: Copy data from Azure Blob to SQL Server -The following sample shows: --1. The linked service of type [OnPremisesSqlServer](#linked-service-properties). -2. The linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [SqlServerTable](data-factory-sqlserver-connector.md#dataset-properties). -5. The [pipeline](data-factory-create-pipelines.md) with Copy activity that uses [BlobSource](data-factory-azure-blob-connector.md#copy-activity-properties) and SqlSink. --The sample copies time-series data from an Azure blob to a SQL Server table every hour. The JSON properties used in these samples are described in sections following the samples. --**SQL Server linked service** --```json -{ - "Name": "SqlServerLinkedService", - "properties": { - "type": "OnPremisesSqlServer", - "typeProperties": { - "connectionString": "Data Source=<servername>;Initial Catalog=<databasename>;Integrated Security=False;User ID=<username>;Password=<password>;", - "gatewayName": "<gatewayname>" - } - } -} -``` -**Azure Blob storage linked service** --```json -{ - "name": "StorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>" - } - } -} -``` -**Azure Blob input dataset** --Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. "external": "true" setting informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. --```json -{ - "name": "AzureBlobInput", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}", - "fileName": "{Hour}.csv", - "partitionedBy": [ - { - "name": "Year", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy" - } - }, - { - "name": "Month", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "MM" - } - }, - { - "name": "Day", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "dd" - } - }, - { - "name": "Hour", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "HH" - } - } - ], - "format": { - "type": "TextFormat", - "columnDelimiter": ",", - "rowDelimiter": "\n" - } - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "externalData": { - "retryInterval": "00:01:00", - "retryTimeout": "00:10:00", - "maximumRetry": 3 - } - } - } -} -``` -**SQL Server output dataset** --The sample copies data to a table named "MyTable" in SQL Server. Create the table in SQL Server with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour. --```json -{ - "name": "SqlServerOutput", - "properties": { - "type": "SqlServerTable", - "linkedServiceName": "SqlServerLinkedService", - "typeProperties": { - "tableName": "MyOutputTable" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` -**Pipeline with Copy activity** --The pipeline contains a Copy Activity that is configured to use these input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **BlobSource** and **sink** type is set to **SqlSink**. --```json -{ - "name":"SamplePipeline", - "properties":{ - "start":"2014-06-01T18:00:00", - "end":"2014-06-01T19:00:00", - "description":"pipeline with copy activity", - "activities":[ - { - "name": "AzureBlobtoSQL", - "description": "Copy Activity", - "type": "Copy", - "inputs": [ - { - "name": "AzureBlobInput" - } - ], - "outputs": [ - { - "name": " SqlServerOutput " - } - ], - "typeProperties": { - "source": { - "type": "BlobSource", - "blobColumnSeparators": "," - }, - "sink": { - "type": "SqlSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` --## Troubleshooting connection issues -1. Configure your SQL Server to accept remote connections. Launch **SQL Server Management Studio**, right-click **server**, and click **Properties**. Select **Connections** from the list and check **Allow remote connections to the server**. -- :::image type="content" source="./media/data-factory-sqlserver-connector/AllowRemoteConnections.png" alt-text="Enable remote connections"::: -- See [Configure the remote access Server Configuration Option](/sql/database-engine/configure-windows/configure-the-remote-access-server-configuration-option) for detailed steps. -2. Launch **SQL Server Configuration Manager**. Expand **SQL Server Network Configuration** for the instance you want, and select **Protocols for MSSQLSERVER**. You should see protocols in the right-pane. Enable TCP/IP by right-clicking **TCP/IP** and clicking **Enable**. -- :::image type="content" source="./media/data-factory-sqlserver-connector/EnableTCPProptocol.png" alt-text="Enable TCP/IP"::: -- See [Enable or Disable a Server Network Protocol](/sql/database-engine/configure-windows/enable-or-disable-a-server-network-protocol) for details and alternate ways of enabling TCP/IP protocol. -3. In the same window, double-click **TCP/IP** to launch **TCP/IP Properties** window. -4. Switch to the **IP Addresses** tab. Scroll down to see **IPAll** section. Note down the **TCP Port**(default is **1433**). -5. Create a **rule for the Windows Firewall** on the machine to allow incoming traffic through this port. -6. **Verify connection**: To connect to the SQL Server using fully qualified name, use SQL Server Management Studio from a different machine. For example: "\<machine\>.\<domain\>.corp.\<company\>.com,1433." -- > [!IMPORTANT] - > - > See [Move data between on-premises sources and the cloud with Data Management Gateway](data-factory-move-data-between-onprem-and-cloud.md) for detailed information. - > - > See [Troubleshoot gateway issues](data-factory-data-management-gateway.md#troubleshooting-gateway-issues) for tips on troubleshooting connection/gateway related issues. ---## Identity columns in the target database -This section provides an example that copies data from a source table with no identity column to a destination table with an identity column. --**Source table:** --```sql -create table dbo.SourceTbl -( - name varchar(100), - age int -) -``` -**Destination table:** --```sql -create table dbo.TargetTbl -( - identifier int identity(1,1), - name varchar(100), - age int -) -``` --Notice that the target table has an identity column. --**Source dataset JSON definition** --```json -{ - "name": "SampleSource", - "properties": { - "published": false, - "type": " SqlServerTable", - "linkedServiceName": "TestIdentitySQL", - "typeProperties": { - "tableName": "SourceTbl" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": {} - } -} -``` -**Destination dataset JSON definition** --```json -{ - "name": "SampleTarget", - "properties": { - "structure": [ - { "name": "name" }, - { "name": "age" } - ], - "published": false, - "type": "AzureSqlTable", - "linkedServiceName": "TestIdentitySQLSource", - "typeProperties": { - "tableName": "TargetTbl" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": false, - "policy": {} - } -} -``` --Notice that as your source and target table have different schema (target has an additional column with identity). In this scenario, you need to specify **structure** property in the target dataset definition, which doesn't include the identity column. --## Invoke stored procedure from SQL sink -See [Invoke stored procedure for SQL sink in copy activity](data-factory-invoke-stored-procedure-from-copy-activity.md) article for an example of invoking a stored procedure from SQL sink in a copy activity of a pipeline. --## Type mapping for SQL server -As mentioned in the [data movement activities](data-factory-data-movement-activities.md) article, the Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach: --1. Convert from native source types to .NET type -2. Convert from .NET type to native sink type --When moving data to & from SQL server, the following mappings are used from SQL type to .NET type and vice versa. --The mapping is same as the SQL Server Data Type Mapping for ADO.NET. --| SQL Server Database Engine type | .NET Framework type | -| | | -| bigint |Int64 | -| binary |Byte[] | -| bit |Boolean | -| char |String, Char[] | -| date |DateTime | -| Datetime |DateTime | -| datetime2 |DateTime | -| Datetimeoffset |DateTimeOffset | -| Decimal |Decimal | -| FILESTREAM attribute (varbinary(max)) |Byte[] | -| Float |Double | -| image |Byte[] | -| int |Int32 | -| money |Decimal | -| nchar |String, Char[] | -| ntext |String, Char[] | -| numeric |Decimal | -| nvarchar |String, Char[] | -| real |Single | -| rowversion |Byte[] | -| smalldatetime |DateTime | -| smallint |Int16 | -| smallmoney |Decimal | -| sql_variant |Object * | -| text |String, Char[] | -| time |TimeSpan | -| timestamp |Byte[] | -| tinyint |Byte | -| uniqueidentifier |Guid | -| varbinary |Byte[] | -| varchar |String, Char[] | -| xml |Xml | --## Mapping source to sink columns -To map columns from source dataset to columns from sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Repeatable copy -When copying data to SQL Server Database, the copy activity appends data to the sink table by default. To perform an UPSERT instead, See [Repeatable write to SqlSink](data-factory-repeatable-copy.md#repeatable-write-to-sqlsink) article. --When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See [Repeatable read from relational sources](data-factory-repeatable-copy.md#repeatable-read-from-relational-sources). --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | Data Factory Stored Proc Activity | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-stored-proc-activity.md | - Title: SQL Server Stored Procedure Activity -description: Learn how you can use the SQL Server Stored Procedure Activity to invoke a stored procedure in an Azure SQL Database or Azure Synapse Analytics from a Data Factory pipeline. --- Previously updated : 04/12/2023-----# SQL Server Stored Procedure Activity -> [!div class="op_single_selector" title1="Transformation Activities"] -> * [Hive Activity](data-factory-hive-activity.md) -> * [Pig Activity](data-factory-pig-activity.md) -> * [MapReduce Activity](data-factory-map-reduce.md) -> * [Hadoop Streaming Activity](data-factory-hadoop-streaming-activity.md) -> * [Spark Activity](data-factory-spark.md) -> * [ML Studio (classic) Batch Execution Activity](data-factory-azure-ml-batch-execution-activity.md) -> * [ML Studio (classic) Update Resource Activity](data-factory-azure-ml-update-resource-activity.md) -> * [Stored Procedure Activity](data-factory-stored-proc-activity.md) -> * [Data Lake Analytics U-SQL Activity](data-factory-usql-activity.md) -> * [.NET Custom Activity](data-factory-use-custom-activities.md) --> [!NOTE] -> This article applies to version 1 of Azure Data Factory. If you are using the current version of the Data Factory service, see [transform data using stored procedure activity in Data Factory](../transform-data-using-stored-procedure.md). --## Overview -You use data transformation activities in a Data Factory [pipeline](data-factory-create-pipelines.md) to transform and process raw data into predictions and insights. The Stored Procedure Activity is one of the transformation activities that Data Factory supports. This article builds on the [data transformation activities](data-factory-data-transformation-activities.md) article, which presents a general overview of data transformation and the supported transformation activities in Data Factory. --You can use the Stored Procedure Activity to invoke a stored procedure in one of the following data stores in your enterprise or on an Azure virtual machine (VM): --- Azure SQL Database-- Azure Synapse Analytics-- SQL Server Database. If you are using SQL Server, install Data Management Gateway on the same machine that hosts the database or on a separate machine that has access to the database. Data Management Gateway is a component that connects data sources on-premises/on Azure VM with cloud services in a secure and managed way. See [Data Management Gateway](data-factory-data-management-gateway.md) article for details.--> [!IMPORTANT] -> When copying data into Azure SQL Database or SQL Server, you can configure the **SqlSink** in copy activity to invoke a stored procedure by using the **sqlWriterStoredProcedureName** property. For more information, see [Invoke stored procedure from copy activity](data-factory-invoke-stored-procedure-from-copy-activity.md). For details about the property, see following connector articles: [Azure SQL Database](data-factory-azure-sql-connector.md#copy-activity-properties), [SQL Server](data-factory-sqlserver-connector.md#copy-activity-properties). Invoking a stored procedure while copying data into Azure Synapse Analytics by using a copy activity is not supported. But, you can use the stored procedure activity to invoke a stored procedure in Azure Synapse Analytics. -> -> When copying data from Azure SQL Database or SQL Server or Azure Synapse Analytics, you can configure **SqlSource** in copy activity to invoke a stored procedure to read data from the source database by using the **sqlReaderStoredProcedureName** property. For more information, see the following connector articles: [Azure SQL Database](data-factory-azure-sql-connector.md#copy-activity-properties), [SQL Server](data-factory-sqlserver-connector.md#copy-activity-properties), [Azure Synapse Analytics](data-factory-azure-sql-data-warehouse-connector.md#copy-activity-properties) --The following walkthrough uses the Stored Procedure Activity in a pipeline to invoke a stored procedure in Azure SQL Database. --## Walkthrough -### Sample table and stored procedure --1. Create the following **table** in your Azure SQL Database using SQL Server Management Studio or any other tool you are comfortable with. The datetimestamp column is the date and time when the corresponding ID is generated. -- ```SQL - CREATE TABLE dbo.sampletable - ( - Id uniqueidentifier, - datetimestamp nvarchar(127) - ) - GO -- CREATE CLUSTERED INDEX ClusteredID ON dbo.sampletable(Id); - GO - ``` - `Id` is the unique identifier, and the `datetimestamp` column is the date and time when the corresponding ID is generated. - - :::image type="content" source="./media/data-factory-stored-proc-activity/sample-data.png" alt-text="Sample data"::: -- In this sample, the stored procedure is in an Azure SQL Database. If the stored procedure is in Azure Synapse Analytics and SQL Server Database, the approach is similar. For a SQL Server database, you must install a [Data Management Gateway](data-factory-data-management-gateway.md). - -2. Create the following **stored procedure** that inserts data in to the **sampletable**. -- ```SQL - CREATE PROCEDURE usp_sample @DateTime nvarchar(127) - AS -- BEGIN - INSERT INTO [sampletable] - VALUES (newid(), @DateTime) - END - ``` -- > [!IMPORTANT] - > **Name** and **casing** of the parameter (DateTime in this example) must match that of parameter specified in the pipeline/activity JSON. In the stored procedure definition, ensure that **\@** is used as a prefix for the parameter. --### Create a data factory -1. Log in to [Azure portal](https://portal.azure.com/). -2. Click **NEW** on the left menu, click **Intelligence + Analytics**, and click **Data Factory**. -- :::image type="content" source="media/data-factory-stored-proc-activity/new-data-factory.png" alt-text="New data factory 1"::: -3. In the **New data factory** blade, enter **SProcDF** for the Name. Azure Data Factory names are **globally unique**. You need to prefix the name of the data factory with your name, to enable the successful creation of the factory. -- :::image type="content" source="media/data-factory-stored-proc-activity/new-data-factory-blade.png" alt-text="New data factory 2"::: -4. Select your **Azure subscription**. -5. For **Resource Group**, do one of the following steps: - 1. Click **Create new** and enter a name for the resource group. - 2. Click **Use existing** and select an existing resource group. -6. Select the **location** for the data factory. -7. Select **Pin to dashboard** so that you can see the data factory on the dashboard next time you log in. -8. Click **Create** on the **New data factory** blade. -9. You see the data factory being created in the **dashboard** of the Azure portal. After the data factory has been created successfully, you see the data factory page, which shows you the contents of the data factory. -- :::image type="content" source="media/data-factory-stored-proc-activity/data-factory-home-page.png" alt-text="Data Factory home page"::: --### Create an Azure SQL linked service -After creating the data factory, you create an Azure SQL linked service that links your database in Azure SQL Database, which contains the sampletable table and usp_sample stored procedure, to your data factory. --1. Click **Author and deploy** on the **Data Factory** blade for **SProcDF** to launch the Data Factory Editor. -2. Click **New data store** on the command bar and choose **Azure SQL Database**. You should see the JSON script for creating an Azure SQL linked service in the editor. -- :::image type="content" source="media/data-factory-stored-proc-activity/new-data-store.png" alt-text="New data store 1"::: -3. In the JSON script, make the following changes: -- 1. Replace `<servername>` with the name of your server. - 2. Replace `<databasename>` with the database in which you created the table and the stored procedure. - 3. Replace `<username@servername>` with the user account that has access to the database. - 4. Replace `<password>` with the password for the user account. -- :::image type="content" source="media/data-factory-stored-proc-activity/azure-sql-linked-service.png" alt-text="New data store 2"::: -4. To deploy the linked service, click **Deploy** on the command bar. Confirm that you see the AzureSqlLinkedService in the tree view on the left. -- :::image type="content" source="media/data-factory-stored-proc-activity/tree-view.png" alt-text="tree view with linked service 1"::: --### Create an output dataset -You must specify an output dataset for a stored procedure activity even if the stored procedure does not produce any data. That's because it's the output dataset that drives the schedule of the activity (how often the activity is run - hourly, daily, etc.). The output dataset must use a **linked service** that refers to an Azure SQL Database or Azure Synapse Analytics or a SQL Server Database in which you want the stored procedure to run. The output dataset can serve as a way to pass the result of the stored procedure for subsequent processing by another activity ([chaining activities](data-factory-scheduling-and-execution.md#multiple-activities-in-a-pipeline) in the pipeline. However, Data Factory does not automatically write the output of a stored procedure to this dataset. It is the stored procedure that writes to a SQL table that the output dataset points to. In some cases, the output dataset can be a **dummy dataset** (a dataset that points to a table that does not really hold output of the stored procedure). This dummy dataset is used only to specify the schedule for running the stored procedure activity. --1. Click **... More** on the toolbar, click **New dataset**, and click **Azure SQL**. **New dataset** on the command bar and select **Azure SQL**. -- :::image type="content" source="media/data-factory-stored-proc-activity/new-dataset.png" alt-text="tree view with linked service 2"::: -2. Copy/paste the following JSON script in to the JSON editor. -- ```JSON - { - "name": "sprocsampleout", - "properties": { - "type": "AzureSqlTable", - "linkedServiceName": "AzureSqlLinkedService", - "typeProperties": { - "tableName": "sampletable" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } - } - ``` -3. To deploy the dataset, click **Deploy** on the command bar. Confirm that you see the dataset in the tree view. -- :::image type="content" source="media/data-factory-stored-proc-activity/tree-view-2.png" alt-text="tree view with linked services"::: --### Create a pipeline with SqlServerStoredProcedure activity -Now, let's create a pipeline with a stored procedure activity. --Notice the following properties: --- The **type** property is set to **SqlServerStoredProcedure**.-- The **storedProcedureName** in type properties is set to **usp_sample** (name of the stored procedure).-- The **storedProcedureParameters** section contains one parameter named **DateTime**. Name and casing of the parameter in JSON must match the name and casing of the parameter in the stored procedure definition. If you need pass null for a parameter, use the syntax: `"param1": null` (all lowercase).--1. Click **... More** on the command bar and click **New pipeline**. -2. Copy/paste the following JSON snippet: -- ```JSON - { - "name": "SprocActivitySamplePipeline", - "properties": { - "activities": [ - { - "type": "SqlServerStoredProcedure", - "typeProperties": { - "storedProcedureName": "usp_sample", - "storedProcedureParameters": { - "DateTime": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)" - } - }, - "outputs": [ - { - "name": "sprocsampleout" - } - ], - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "SprocActivitySample" - } - ], - "start": "2017-04-02T00:00:00Z", - "end": "2017-04-02T05:00:00Z", - "isPaused": false - } - } - ``` -3. To deploy the pipeline, click **Deploy** on the toolbar. --### Monitor the pipeline -1. Click **X** to close Data Factory Editor blades and to navigate back to the Data Factory blade, and click **Diagram**. -- :::image type="content" source="media/data-factory-stored-proc-activity/data-factory-diagram-tile.png" alt-text="diagram tile 1"::: -2. In the **Diagram View**, you see an overview of the pipelines, and datasets used in this tutorial. -- :::image type="content" source="media/data-factory-stored-proc-activity/data-factory-diagram-view.png" alt-text="diagram tile 2"::: -3. In the Diagram View, double-click the dataset `sprocsampleout`. You see the slices in Ready state. There should be five slices because a slice is produced for each hour between the start time and end time from the JSON. -- :::image type="content" source="media/data-factory-stored-proc-activity/data-factory-slices.png" alt-text="diagram tile 3"::: -4. When a slice is in **Ready** state, run a `select * from sampletable` query against the database to verify that the data was inserted in to the table by the stored procedure. -- :::image type="content" source="./media/data-factory-stored-proc-activity/output.png" alt-text="Output data"::: -- See [Monitor the pipeline](data-factory-monitor-manage-pipelines.md) for detailed information about monitoring Azure Data Factory pipelines. --## Specify an input dataset -In the walkthrough, stored procedure activity does not have any input datasets. If you specify an input dataset, the stored procedure activity does not run until the slice of input dataset is available (in Ready state). The dataset can be an external dataset (that is not produced by another activity in the same pipeline) or an internal dataset that is produced by an upstream activity (the activity that runs before this activity). You can specify multiple input datasets for the stored procedure activity. If you do so, the stored procedure activity runs only when all the input dataset slices are available (in Ready state). The input dataset cannot be consumed in the stored procedure as a parameter. It is only used to check the dependency before starting the stored procedure activity. --## Chaining with other activities -If you want to chain an upstream activity with this activity, specify the output of the upstream activity as an input of this activity. When you do so, the stored procedure activity does not run until the upstream activity completes and the output dataset of the upstream activity is available (in Ready status). You can specify output datasets of multiple upstream activities as input datasets of the stored procedure activity. When you do so, the stored procedure activity runs only when all the input dataset slices are available. --In the following example, the output of the copy activity is: OutputDataset, which is an input of the stored procedure activity. Therefore, the stored procedure activity does not run until the copy activity completes and the OutputDataset slice is available (in Ready state). If you specify multiple input datasets, the stored procedure activity does not run until all the input dataset slices are available (in Ready state). The input datasets cannot be used directly as parameters to the stored procedure activity. --For more information on chaining activities, see [multiple activities in a pipeline](data-factory-create-pipelines.md#multiple-activities-in-a-pipeline) --```json -{ - "name": "ADFTutorialPipeline", - "properties": { - "description": "Copy data from a blob to blob", - "activities": [ - { - "type": "Copy", - "typeProperties": { - "source": { - "type": "BlobSource" - }, - "sink": { - "type": "BlobSink", - "writeBatchSize": 0, - "writeBatchTimeout": "00:00:00" - } - }, - "inputs": [ { "name": "InputDataset" } ], - "outputs": [ { "name": "OutputDataset" } ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1, - "executionPriorityOrder": "NewestFirst" - }, - "name": "CopyFromBlobToSQL" - }, - { - "type": "SqlServerStoredProcedure", - "typeProperties": { - "storedProcedureName": "SPSproc" - }, - "inputs": [ { "name": "OutputDataset" } ], - "outputs": [ { "name": "SQLOutputDataset" } ], - "policy": { - "timeout": "01:00:00", - "concurrency": 1, - "retry": 3 - }, - "name": "RunStoredProcedure" - } - ], - "start": "2017-04-12T00:00:00Z", - "end": "2017-04-13T00:00:00Z", - "isPaused": false, - } -} -``` --Similarly, to link the store procedure activity with **downstream activities** (the activities that run after the stored procedure activity completes), specify the output dataset of the stored procedure activity as an input of the downstream activity in the pipeline. --> [!IMPORTANT] -> When copying data into Azure SQL Database or SQL Server, you can configure the **SqlSink** in copy activity to invoke a stored procedure by using the **sqlWriterStoredProcedureName** property. For more information, see [Invoke stored procedure from copy activity](data-factory-invoke-stored-procedure-from-copy-activity.md). For details about the property, see the following connector articles: [Azure SQL Database](data-factory-azure-sql-connector.md#copy-activity-properties), [SQL Server](data-factory-sqlserver-connector.md#copy-activity-properties). -> -> When copying data from Azure SQL Database or SQL Server or Azure Synapse Analytics, you can configure **SqlSource** in copy activity to invoke a stored procedure to read data from the source database by using the **sqlReaderStoredProcedureName** property. For more information, see the following connector articles: [Azure SQL Database](data-factory-azure-sql-connector.md#copy-activity-properties), [SQL Server](data-factory-sqlserver-connector.md#copy-activity-properties), [Azure Synapse Analytics](data-factory-azure-sql-data-warehouse-connector.md#copy-activity-properties) --## JSON format -Here is the JSON format for defining a Stored Procedure Activity: --```JSON -{ - "name": "SQLSPROCActivity", - "description": "description", - "type": "SqlServerStoredProcedure", - "inputs": [ { "name": "inputtable" } ], - "outputs": [ { "name": "outputtable" } ], - "typeProperties": - { - "storedProcedureName": "<name of the stored procedure>", - "storedProcedureParameters": - { - "param1": "param1Value" - … - } - } -} -``` --The following table describes these JSON properties: --| Property | Description | Required | -| | | | -| name | Name of the activity |Yes | -| description |Text describing what the activity is used for |No | -| type | Must be set to: **SqlServerStoredProcedure** | Yes | -| inputs | Optional. If you do specify an input dataset, it must be available (in 'Ready' status) for the stored procedure activity to run. The input dataset cannot be consumed in the stored procedure as a parameter. It is only used to check the dependency before starting the stored procedure activity. |No | -| outputs | You must specify an output dataset for a stored procedure activity. Output dataset specifies the **schedule** for the stored procedure activity (hourly, weekly, monthly, etc.). <br/><br/>The output dataset must use a **linked service** that refers to an Azure SQL Database or Azure Synapse Analytics or a SQL Server Database in which you want the stored procedure to run. <br/><br/>The output dataset can serve as a way to pass the result of the stored procedure for subsequent processing by another activity ([chaining activities](data-factory-scheduling-and-execution.md#multiple-activities-in-a-pipeline) in the pipeline. However, Data Factory does not automatically write the output of a stored procedure to this dataset. It is the stored procedure that writes to a SQL table that the output dataset points to. <br/><br/>In some cases, the output dataset can be a **dummy dataset**, which is used only to specify the schedule for running the stored procedure activity. |Yes | -| storedProcedureName |Specify the name of the stored procedure in Azure SQL Database, Azure Synapse Analytics, or SQL Server that is represented by the linked service that the output table uses. |Yes | -| storedProcedureParameters |Specify values for stored procedure parameters. If you need to pass null for a parameter, use the syntax: "param1": null (all lower case). See the following sample to learn about using this property. |No | --## Passing a static value -Now, let's consider adding another column named 'Scenario' in the table containing a static value called 'Document sample'. ---**Table:** --```SQL -CREATE TABLE dbo.sampletable2 -( - Id uniqueidentifier, - datetimestamp nvarchar(127), - scenario nvarchar(127) -) -GO --CREATE CLUSTERED INDEX ClusteredID ON dbo.sampletable2(Id); -``` --**Stored procedure:** --```SQL -CREATE PROCEDURE usp_sample2 @DateTime nvarchar(127) , @Scenario nvarchar(127) --AS --BEGIN - INSERT INTO [sampletable2] - VALUES (newid(), @DateTime, @Scenario) -END -``` --Now, pass the **Scenario** parameter and the value from the stored procedure activity. The **typeProperties** section in the preceding sample looks like the following snippet: --```JSON -"typeProperties": -{ - "storedProcedureName": "usp_sample", - "storedProcedureParameters": - { - "DateTime": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)", - "Scenario": "Document sample" - } -} -``` --**Data Factory dataset:** --```JSON -{ - "name": "sprocsampleout2", - "properties": { - "published": false, - "type": "AzureSqlTable", - "linkedServiceName": "AzureSqlLinkedService", - "typeProperties": { - "tableName": "sampletable2" - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --**Data Factory pipeline** --```JSON -{ - "name": "SprocActivitySamplePipeline2", - "properties": { - "activities": [ - { - "type": "SqlServerStoredProcedure", - "typeProperties": { - "storedProcedureName": "usp_sample2", - "storedProcedureParameters": { - "DateTime": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)", - "Scenario": "Document sample" - } - }, - "outputs": [ - { - "name": "sprocsampleout2" - } - ], - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "name": "SprocActivitySample" - } - ], - "start": "2016-10-02T00:00:00Z", - "end": "2016-10-02T05:00:00Z" - } -} -``` |
data-factory | Data Factory Supported File And Compression Formats | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-supported-file-and-compression-formats.md | - Title: File and compression formats in Azure Data Factory -description: 'Learn about the file formats supported by Azure Data Factory.' ---- Previously updated : 04/12/2023-----# File and compression formats supported by Azure Data Factory -*This topic applies to the following connectors: [Amazon S3](data-factory-amazon-simple-storage-service-connector.md), [Azure Blob](data-factory-azure-blob-connector.md), [Azure Data Lake Store](data-factory-azure-datalake-connector.md), [File System](data-factory-onprem-file-system-connector.md), [FTP](data-factory-ftp-connector.md), [HDFS](data-factory-hdfs-connector.md), [HTTP](data-factory-http-connector.md), and [SFTP](data-factory-sftp-connector.md).* --> [!NOTE] -> This article applies to version 1 of Azure Data Factory. If you are using the current version of the Data Factory service, see [supported file formats and compression codecs in Data Factory](../supported-file-formats-and-compression-codecs.md). --Azure Data Factory supports the following file format types: --* [Text format](#text-format) -* [JSON format](#json-format) -* [Avro format](#avro-format) -* [ORC format](#orc-format) -* [Parquet format](#parquet-format) --## Text format -If you want to read from a text file or write to a text file, set the `type` property in the `format` section of the dataset to **TextFormat**. You can also specify the following **optional** properties in the `format` section. See [TextFormat example](#textformat-example) section on how to configure. --| Property | Description | Allowed values | Required | -| | | | | -| columnDelimiter |The character used to separate columns in a file. You can consider to use a rare unprintable char that may not likely exists in your data. For example, specify "\u0001", which represents Start of Heading (SOH). |Only one character is allowed. The **default** value is **comma (',')**. <br/><br/>To use a Unicode character, refer to [Unicode Characters](https://en.wikipedia.org/wiki/List_of_Unicode_characters) to get the corresponding code for it. |No | -| rowDelimiter |The character used to separate rows in a file. |Only one character is allowed. The **default** value is any of the following values on read: **["\r\n", "\r", "\n"]** and **"\r\n"** on write. |No | -| escapeChar |The special character used to escape a column delimiter in the content of input file. <br/><br/>You cannot specify both escapeChar and quoteChar for a table. |Only one character is allowed. No default value. <br/><br/>Example: if you have comma (',') as the column delimiter but you want to have the comma character in the text (example: "Hello, world"), you can define '$' as the escape character and use string "Hello$, world" in the source. |No | -| quoteChar |The character used to quote a string value. The column and row delimiters inside the quote characters would be treated as part of the string value. This property is applicable to both input and output datasets.<br/><br/>You cannot specify both escapeChar and quoteChar for a table. |Only one character is allowed. No default value. <br/><br/>For example, if you have comma (',') as the column delimiter but you want to have comma character in the text (example: <Hello, world>), you can define " (double quote) as the quote character and use the string "Hello, world" in the source. |No | -| nullValue |One or more characters used to represent a null value. |One or more characters. The **default** values are **"\N" and "NULL"** on read and **"\N"** on write. |No | -| encodingName |Specify the encoding name. |A valid encoding name. see [Encoding.EncodingName Property](/dotnet/api/system.text.encoding). Example: windows-1250 or shift_jis. The **default** value is **UTF-8**. |No | -| firstRowAsHeader |Specifies whether to consider the first row as a header. For an input dataset, Data Factory reads first row as a header. For an output dataset, Data Factory writes first row as a header. <br/><br/>See [Scenarios for using `firstRowAsHeader` and `skipLineCount`](#scenarios-for-using-firstrowasheader-and-skiplinecount) for sample scenarios. |True<br/><b>False (default)</b> |No | -| skipLineCount |Indicates the number of rows to skip when reading data from input files. If both skipLineCount and firstRowAsHeader are specified, the lines are skipped first and then the header information is read from the input file. <br/><br/>See [Scenarios for using `firstRowAsHeader` and `skipLineCount`](#scenarios-for-using-firstrowasheader-and-skiplinecount) for sample scenarios. |Integer |No | -| treatEmptyAsNull |Specifies whether to treat null or empty string as a null value when reading data from an input file. |**True (default)**<br/>False |No | --### TextFormat example -In the following JSON definition for a dataset, some of the optional properties are specified. --```json -"typeProperties": -{ - "folderPath": "mycontainer/myfolder", - "fileName": "myblobname", - "format": - { - "type": "TextFormat", - "columnDelimiter": ",", - "rowDelimiter": ";", - "quoteChar": "\"", - "NullValue": "NaN", - "firstRowAsHeader": true, - "skipLineCount": 0, - "treatEmptyAsNull": true - } -}, -``` --To use an `escapeChar` instead of `quoteChar`, replace the line with `quoteChar` with the following escapeChar: --```json -"escapeChar": "$", -``` --### Scenarios for using firstRowAsHeader and skipLineCount -* You are copying from a non-file source to a text file and would like to add a header line containing the schema metadata (for example: SQL schema). Specify `firstRowAsHeader` as true in the output dataset for this scenario. -* You are copying from a text file containing a header line to a non-file sink and would like to drop that line. Specify `firstRowAsHeader` as true in the input dataset. -* You are copying from a text file and want to skip a few lines at the beginning that contain no data or header information. Specify `skipLineCount` to indicate the number of lines to be skipped. If the rest of the file contains a header line, you can also specify `firstRowAsHeader`. If both `skipLineCount` and `firstRowAsHeader` are specified, the lines are skipped first and then the header information is read from the input file --## JSON format -To **import/export a JSON file as-is into/from Azure Cosmos DB**, the see [Import/export JSON documents](data-factory-azure-documentdb-connector.md#importexport-json-documents) section in [Move data to/from Azure Cosmos DB](data-factory-azure-documentdb-connector.md) article. --If you want to parse the JSON files or write the data in JSON format, set the `type` property in the `format` section to **JsonFormat**. You can also specify the following **optional** properties in the `format` section. See [JsonFormat example](#jsonformat-example) section on how to configure. --| Property | Description | Required | -| | | | -| filePattern |Indicate the pattern of data stored in each JSON file. Allowed values are: **setOfObjects** and **arrayOfObjects**. The **default** value is **setOfObjects**. See [JSON file patterns](#json-file-patterns) section for details about these patterns. |No | -| jsonNodeReference | If you want to iterate and extract data from the objects inside an array field with the same pattern, specify the JSON path of that array. This property is supported only when copying data from JSON files. | No | -| jsonPathDefinition | Specify the JSON path expression for each column mapping with a customized column name (start with lowercase). This property is supported only when copying data from JSON files, and you can extract data from object or array. <br/><br/> For fields under root object, start with root $; for fields inside the array chosen by `jsonNodeReference` property, start from the array element. See [JsonFormat example](#jsonformat-example) section on how to configure. | No | -| encodingName |Specify the encoding name. For the list of valid encoding names, see: [Encoding.EncodingName](/dotnet/api/system.text.encoding) Property. For example: windows-1250 or shift_jis. The **default** value is: **UTF-8**. |No | -| nestingSeparator |Character that is used to separate nesting levels. The default value is '.' (dot). |No | --### JSON file patterns --Copy activity can parse the following patterns of JSON files: --- **Type I: setOfObjects**-- Each file contains single object, or line-delimited/concatenated multiple objects. When this option is chosen in an output dataset, copy activity produces a single JSON file with each object per line (line-delimited). -- * **single object JSON example** -- ```json - { - "time": "2015-04-29T07:12:20.9100000Z", - "callingimsi": "466920403025604", - "callingnum1": "678948008", - "callingnum2": "567834760", - "switch1": "China", - "switch2": "Germany" - } - ``` -- * **line-delimited JSON example** -- ```json - {"time":"2015-04-29T07:12:20.9100000Z","callingimsi":"466920403025604","callingnum1":"678948008","callingnum2":"567834760","switch1":"China","switch2":"Germany"} - {"time":"2015-04-29T07:13:21.0220000Z","callingimsi":"466922202613463","callingnum1":"123436380","callingnum2":"789037573","switch1":"US","switch2":"UK"} - {"time":"2015-04-29T07:13:21.4370000Z","callingimsi":"466923101048691","callingnum1":"678901578","callingnum2":"345626404","switch1":"Germany","switch2":"UK"} - ``` -- * **concatenated JSON example** -- ```json - { - "time": "2015-04-29T07:12:20.9100000Z", - "callingimsi": "466920403025604", - "callingnum1": "678948008", - "callingnum2": "567834760", - "switch1": "China", - "switch2": "Germany" - } - { - "time": "2015-04-29T07:13:21.0220000Z", - "callingimsi": "466922202613463", - "callingnum1": "123436380", - "callingnum2": "789037573", - "switch1": "US", - "switch2": "UK" - } - { - "time": "2015-04-29T07:13:21.4370000Z", - "callingimsi": "466923101048691", - "callingnum1": "678901578", - "callingnum2": "345626404", - "switch1": "Germany", - "switch2": "UK" - } - ``` --- **Type II: arrayOfObjects**-- Each file contains an array of objects. -- ```json - [ - { - "time": "2015-04-29T07:12:20.9100000Z", - "callingimsi": "466920403025604", - "callingnum1": "678948008", - "callingnum2": "567834760", - "switch1": "China", - "switch2": "Germany" - }, - { - "time": "2015-04-29T07:13:21.0220000Z", - "callingimsi": "466922202613463", - "callingnum1": "123436380", - "callingnum2": "789037573", - "switch1": "US", - "switch2": "UK" - }, - { - "time": "2015-04-29T07:13:21.4370000Z", - "callingimsi": "466923101048691", - "callingnum1": "678901578", - "callingnum2": "345626404", - "switch1": "Germany", - "switch2": "UK" - } - ] - ``` --### JsonFormat example --**Case 1: Copying data from JSON files** --See the following two samples when copying data from JSON files. The generic points to note: --**Sample 1: extract data from object and array** --In this sample, you expect one root JSON object maps to single record in tabular result. If you have a JSON file with the following content: --```json -{ - "id": "ed0e4960-d9c5-11e6-85dc-d7996816aad3", - "context": { - "device": { - "type": "PC" - }, - "custom": { - "dimensions": [ - { - "TargetResourceType": "Microsoft.Compute/virtualMachines" - }, - { - "ResourceManagementProcessRunId": "827f8aaa-ab72-437c-ba48-d8917a7336a3" - }, - { - "OccurrenceTime": "1/13/2017 11:24:37 AM" - } - ] - } - } -} -``` -and you want to copy it into an Azure SQL table in the following format, by extracting data from both objects and array: --| ID | deviceType | targetResourceType | resourceManagementProcessRunId | occurrenceTime | -| | | | | | -| ed0e4960-d9c5-11e6-85dc-d7996816aad3 | PC | Microsoft.Compute/virtualMachines | 827f8aaa-ab72-437c-ba48-d8917a7336a3 | 1/13/2017 11:24:37 AM | --The input dataset with **JsonFormat** type is defined as follows: (partial definition with only the relevant parts). More specifically: --- `structure` section defines the customized column names and the corresponding data type while converting to tabular data. This section is **optional** unless you need to do column mapping. See [Map source dataset columns to destination dataset columns](data-factory-map-columns.md) section for more details.-- `jsonPathDefinition` specifies the JSON path for each column indicating where to extract the data from. To copy data from array, you can use **array[x].property** to extract value of the given property from the xth object, or you can use **array[*].property** to find the value from any object containing such property.--```json -"properties": { - "structure": [ - { - "name": "id", - "type": "String" - }, - { - "name": "deviceType", - "type": "String" - }, - { - "name": "targetResourceType", - "type": "String" - }, - { - "name": "resourceManagementProcessRunId", - "type": "String" - }, - { - "name": "occurrenceTime", - "type": "DateTime" - } - ], - "typeProperties": { - "folderPath": "mycontainer/myfolder", - "format": { - "type": "JsonFormat", - "filePattern": "setOfObjects", - "jsonPathDefinition": {"id": "$.id", "deviceType": "$.context.device.type", "targetResourceType": "$.context.custom.dimensions[0].TargetResourceType", "resourceManagementProcessRunId": "$.context.custom.dimensions[1].ResourceManagementProcessRunId", "occurrenceTime": " $.context.custom.dimensions[2].OccurrenceTime"} - } - } -} -``` --**Sample 2: cross apply multiple objects with the same pattern from array** --In this sample, you expect to transform one root JSON object into multiple records in tabular result. If you have a JSON file with the following content: --```json -{ - "ordernumber": "01", - "orderdate": "20170122", - "orderlines": [ - { - "prod": "p1", - "price": 23 - }, - { - "prod": "p2", - "price": 13 - }, - { - "prod": "p3", - "price": 231 - } - ], - "city": [ { "sanmateo": "No 1" } ] -} -``` -and you want to copy it into an Azure SQL table in the following format, by flattening the data inside the array and cross join with the common root info: --| ordernumber | orderdate | order_pd | order_price | city | -| | | | | | -| 01 | 20170122 | P1 | 23 | [{"sanmateo":"No 1"}] | -| 01 | 20170122 | P2 | 13 | [{"sanmateo":"No 1"}] | -| 01 | 20170122 | P3 | 231 | [{"sanmateo":"No 1"}] | --The input dataset with **JsonFormat** type is defined as follows: (partial definition with only the relevant parts). More specifically: --- `structure` section defines the customized column names and the corresponding data type while converting to tabular data. This section is **optional** unless you need to do column mapping. See [Map source dataset columns to destination dataset columns](data-factory-map-columns.md) section for more details.-- `jsonNodeReference` indicates to iterate and extract data from the objects with the same pattern under **array** orderlines.-- `jsonPathDefinition` specifies the JSON path for each column indicating where to extract the data from. In this example, "ordernumber", "orderdate" and "city" are under root object with JSON path starting with "$.", while "order_pd" and "order_price" are defined with path derived from the array element without "$.".--```json -"properties": { - "structure": [ - { - "name": "ordernumber", - "type": "String" - }, - { - "name": "orderdate", - "type": "String" - }, - { - "name": "order_pd", - "type": "String" - }, - { - "name": "order_price", - "type": "Int64" - }, - { - "name": "city", - "type": "String" - } - ], - "typeProperties": { - "folderPath": "mycontainer/myfolder", - "format": { - "type": "JsonFormat", - "filePattern": "setOfObjects", - "jsonNodeReference": "$.orderlines", - "jsonPathDefinition": {"ordernumber": "$.ordernumber", "orderdate": "$.orderdate", "order_pd": "prod", "order_price": "price", "city": " $.city"} - } - } -} -``` --**Note the following points:** --* If the `structure` and `jsonPathDefinition` are not defined in the Data Factory dataset, the Copy Activity detects the schema from the first object and flatten the whole object. -* If the JSON input has an array, by default the Copy Activity converts the entire array value into a string. You can choose to extract data from it using `jsonNodeReference` and/or `jsonPathDefinition`, or skip it by not specifying it in `jsonPathDefinition`. -* If there are duplicate names at the same level, the Copy Activity picks the last one. -* Property names are case-sensitive. Two properties with same name but different casings are treated as two separate properties. --**Case 2: Writing data to JSON file** --If you have the following table in SQL Database: --| ID | order_date | order_price | order_by | -| | | | | -| 1 | 20170119 | 2000 | David | -| 2 | 20170120 | 3500 | Patrick | -| 3 | 20170121 | 4000 | Jason | --and for each record, you expect to write to a JSON object in the following format: -```json -{ - "id": "1", - "order": { - "date": "20170119", - "price": 2000, - "customer": "David" - } -} -``` --The output dataset with **JsonFormat** type is defined as follows: (partial definition with only the relevant parts). More specifically, `structure` section defines the customized property names in destination file, `nestingSeparator` (default is ".") are used to identify the nest layer from the name. This section is **optional** unless you want to change the property name comparing with source column name, or nest some of the properties. --```json -"properties": { - "structure": [ - { - "name": "id", - "type": "String" - }, - { - "name": "order.date", - "type": "String" - }, - { - "name": "order.price", - "type": "Int64" - }, - { - "name": "order.customer", - "type": "String" - } - ], - "typeProperties": { - "folderPath": "mycontainer/myfolder", - "format": { - "type": "JsonFormat" - } - } -} -``` --## AVRO format -If you want to parse the Avro files or write the data in Avro format, set the `format` `type` property to **AvroFormat**. You do not need to specify any properties in the Format section within the typeProperties section. Example: --```json -"format": -{ - "type": "AvroFormat", -} -``` --To use Avro format in a Hive table, you can refer to [Apache Hive's tutorial](https://cwiki.apache.org/confluence/display/Hive/AvroSerDe). --Note the following points: --* [Complex data types](https://avro.apache.org/docs/current/spec.html#schema_complex) are not supported (records, enums, arrays, maps, unions, and fixed). --## ORC format -If you want to parse the ORC files or write the data in ORC format, set the `format` `type` property to **OrcFormat**. You do not need to specify any properties in the Format section within the typeProperties section. Example: --```json -"format": -{ - "type": "OrcFormat" -} -``` --> [!IMPORTANT] -> If you are not copying ORC files **as-is** between on-premises and cloud data stores, you need to install the JRE 8 (Java Runtime Environment) on your gateway machine. A 64-bit gateway requires 64-bit JRE and 32-bit gateway requires 32-bit JRE. You can find both versions from [here](https://go.microsoft.com/fwlink/?LinkId=808605). Choose the appropriate one. -> -> --Note the following points: --* Complex data types are not supported (STRUCT, MAP, LIST, UNION) -* ORC file has three [compression-related options](https://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/): NONE, ZLIB, SNAPPY. Data Factory supports reading data from ORC file in any of these compressed formats. It uses the compression codec is in the metadata to read the data. However, when writing to an ORC file, Data Factory chooses ZLIB, which is the default for ORC. Currently, there is no option to override this behavior. --## Parquet format -If you want to parse the Parquet files or write the data in Parquet format, set the `format` `type` property to **ParquetFormat**. You do not need to specify any properties in the Format section within the typeProperties section. Example: --```json -"format": -{ - "type": "ParquetFormat" -} -``` -> [!IMPORTANT] -> If you are not copying Parquet files **as-is** between on-premises and cloud data stores, you need to install the JRE 8 (Java Runtime Environment) on your gateway machine. A 64-bit gateway requires 64-bit JRE and 32-bit gateway requires 32-bit JRE. You can find both versions from [here](https://go.microsoft.com/fwlink/?LinkId=808605). Choose the appropriate one. -> -> --Note the following points: --* Complex data types are not supported (MAP, LIST) -* Parquet file has the following compression-related options: NONE, SNAPPY, GZIP, and LZO. Data Factory supports reading data from ORC file in any of these compressed formats. It uses the compression codec in the metadata to read the data. However, when writing to a Parquet file, Data Factory chooses SNAPPY, which is the default for Parquet format. Currently, there is no option to override this behavior. --## Compression support -Processing large data sets can cause I/O and network bottlenecks. Therefore, compressed data in stores can not only speed up data transfer across the network and save disk space, but also bring significant performance improvements in processing big data. Currently, compression is supported for file-based data stores such as Azure Blob or On-premises File System. --To specify compression for a dataset, use the **compression** property in the dataset JSON as in the following example: --```json -{ - "name": "AzureBlobDataSet", - "properties": { - "availability": { - "frequency": "Day", - "interval": 1 - }, - "type": "AzureBlob", - "linkedServiceName": "StorageLinkedService", - "typeProperties": { - "fileName": "pagecounts.csv.gz", - "folderPath": "compression/file/", - "compression": { - "type": "GZip", - "level": "Optimal" - } - } - } -} -``` --Suppose the sample dataset is used as the output of a copy activity, the copy activity compresses the output data with GZIP codec using optimal ratio and then write the compressed data into a file named pagecounts.csv.gz in the Azure Blob Storage. --> [!NOTE] -> Compression settings are not supported for data in the **AvroFormat**, **OrcFormat**, or **ParquetFormat**. When reading files in these formats, Data Factory detects and uses the compression codec in the metadata. When writing to files in these formats, Data Factory chooses the default compression codec for that format. For example, ZLIB for OrcFormat and SNAPPY for ParquetFormat. --The **compression** section has two properties: --* **Type:** the compression codec, which can be **GZIP**, **Deflate**, **BZIP2**, or **ZipDeflate**. -* **Level:** the compression ratio, which can be **Optimal** or **Fastest**. -- * **Fastest:** The compression operation should complete as quickly as possible, even if the resulting file is not optimally compressed. - * **Optimal**: The compression operation should be optimally compressed, even if the operation takes a longer time to complete. -- For more information, see [Compression Level](/dotnet/api/system.io.compression.compressionlevel) topic. --When you specify `compression` property in an input dataset JSON, the pipeline can read compressed data from the source; and when you specify the property in an output dataset JSON, the copy activity can write compressed data to the destination. Here are a few sample scenarios: --* Read GZIP compressed data from an Azure blob, decompress it, and write result data to Azure SQL Database. You define the input Azure Blob dataset with the `compression` `type` JSON property as GZIP. -* Read data from a plain-text file from on-premises File System, compress it using GZip format, and write the compressed data to an Azure blob. You define an output Azure Blob dataset with the `compression` `type` JSON property as GZip. -* Read .zip file from FTP server, decompress it to get the files inside, and land those files into Azure Data Lake Store. You define an input FTP dataset with the `compression` `type` JSON property as ZipDeflate. -* Read a GZIP-compressed data from an Azure blob, decompress it, compress it using BZIP2, and write result data to an Azure blob. You define the input Azure Blob dataset with `compression` `type` set to GZIP and the output dataset with `compression` `type` set to BZIP2 in this case. ---## Next steps -See the following articles for file-based data stores supported by Azure Data Factory: --- [Azure Blob Storage](data-factory-azure-blob-connector.md)-- [Azure Data Lake Store](data-factory-azure-datalake-connector.md)-- [FTP](data-factory-ftp-connector.md)-- [HDFS](data-factory-hdfs-connector.md)-- [File System](data-factory-onprem-file-system-connector.md)-- [Amazon S3](data-factory-amazon-simple-storage-service-connector.md) |
data-factory | Data Factory Troubleshoot Gateway Issues | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-troubleshoot-gateway-issues.md | - Title: Troubleshoot Data Management Gateway issues -description: Provides tips to troubleshoot issues related to Data Management Gateway. ---- Previously updated : 04/12/2023----# Troubleshoot issues with using Data Management Gateway -This article provides information on troubleshooting issues with using Data Management Gateway. --> [!NOTE] -> This article applies to version 1 of Azure Data Factory. If you are using the current version of the Data Factory service, see [self-hosted integration runtime in Data Factory](../create-self-hosted-integration-runtime.md). --See the [Data Management Gateway](data-factory-data-management-gateway.md) article for detailed information about the gateway. See the [Move data between on-premises and cloud](data-factory-move-data-between-onprem-and-cloud.md) article for a walkthrough of moving data from a SQL Server database to Microsoft Azure Blob storage by using the gateway. --## Failed to install or register gateway -### 1. Problem -You see this error message when installing and registering a gateway, specifically, while downloading the gateway installation file. --`Unable to connect to the remote server". Please check your local settings (Error Code: 10003).` --#### Cause -The machine on which you are trying to install the gateway has failed to download the latest gateway installation file from the download center due to a network issue. --#### Resolution -Check your firewall proxy server settings to see whether the settings block the network connection from the computer to the [download center](https://download.microsoft.com/), and update the settings accordingly. --Alternatively, you can download the installation file for the latest gateway from the [download center](https://www.microsoft.com/download/details.aspx?id=39717) on other machines that can access the download center. You can then copy the installer file to the gateway host computer and run it manually to install and update the gateway. --### 2. Problem -You see this error when you're attempting to install a gateway by clicking **install directly on this computer** in the Azure portal. --`Error: Abort installing a new gateway on this computer because this computer has an existing installed gateway and a computer without any installed gateway is required for installing a new gateway.` --#### Cause -A gateway is already installed on the machine. --#### Resolution -Uninstall the existing gateway on the machine and click the **install directly on this computer** link again. --### 3. Problem -You might see this error when registering a new gateway. --`Error: The gateway has encountered an error during registration.` --#### Cause -You might see this message for one of the following reasons: --* The format of the gateway key is invalid. -* The gateway key has been invalidated. -* The gateway key has been regenerated from the portal. --#### Resolution -Verify whether you are using the right gateway key from the portal. If needed, regenerate a key and use the key to register the gateway. --### 4. Problem -You might see the following error message when you're registering a gateway. --`Error: The content or format of the gateway key "{gatewayKey}" is invalid, please go to azure portal to create one new gateway or regenerate the gateway key.` -----#### Cause -The content or format of the input gateway key is incorrect. One of the reasons can be that you copied only a portion of the key from the portal or you're using an invalid key. --#### Resolution -Generate a gateway key in the portal, and use the copy button to copy the whole key. Then paste it in this window to register the gateway. --### 5. Problem -You might see the following error message when you're registering a gateway. --`Error: The gateway key is invalid or empty. Specify a valid gateway key from the portal.` ---#### Cause -The gateway key has been regenerated or the gateway has been deleted in the Azure portal. It can also happen if the Data Management Gateway setup is not latest. --#### Resolution -Check if the Data Management Gateway setup is the latest version, you can find the latest version on the Microsoft [download center](https://go.microsoft.com/fwlink/p/?LinkId=271260). --If setup is current/ latest and gateway still exists on Portal, regenerate the gateway key in the Azure portal, and use the copy button to copy the whole key, and then paste it in this window to register the gateway. Otherwise, recreate the gateway and start over. --### 6. Problem -You might see the following error message when you're registering a gateway. --`Error: Gateway has been online for a while, then shows "Gateway is not registered" with the status "Gateway key is invalid"` ---#### Cause -This error might happen because either the gateway has been deleted or the associated gateway key has been regenerated. --#### Resolution -If the gateway has been deleted, re-create the gateway from the portal, click **Register**, copy the key from the portal, paste it, and try to register the gateway. --If the gateway still exists but its key has been regenerated, use the new key to register the gateway. If you don't have the key, regenerate the key again from the portal. --### 7. Problem -When you're registering a gateway, you might need to enter path and password for a certificate. ---#### Cause -The gateway has been registered on other machines before. During the initial registration of a gateway, an encryption certificate has been associated with the gateway. The certificate can either be self-generated by the gateway or provided by the user. This certificate is used to encrypt credentials of the data store (linked service). ---When restoring the gateway on a different host machine, the registration wizard asks for this certificate to decrypt credentials previously encrypted with this certificate. Without this certificate, the credentials cannot be decrypted by the new gateway and subsequent copy activity executions associated with this new gateway will fail. --#### Resolution -If you have exported the credential certificate from the original gateway machine by using the **Export** button on the **Settings** tab in Data Management Gateway Configuration Manager, use the certificate here. --You cannot skip this stage when recovering a gateway. If the certificate is missing, you need to delete the gateway from the portal and re-create a new gateway. In addition, update all linked services that are related to the gateway by reentering their credentials. --### 8. Problem -You might see the following error message. --`Error: The remote server returned an error: (407) Proxy Authentication Required.` --#### Cause -This error happens when your gateway is in an environment that requires an HTTP proxy to access Internet resources, or your proxy's authentication password is changed but it's not updated accordingly in your gateway. --#### Resolution -Follow the instructions in the Proxy server considerations section of this article, and configure proxy settings with Data Management Gateway Configuration Manager. --## Gateway is online with limited functionality -### 1. Problem -You see the status of the gateway as online with limited functionality. --#### Cause -You see the status of the gateway as online with limited functionality for one of the following reasons: --* Gateway cannot connect to cloud service through Azure Service Bus. -* Cloud service cannot connect to gateway through Service Bus. --When the gateway is online with limited functionality, you might not be able to use the Data Factory Copy Wizard to create data pipelines for copying data to or from on-premises data stores. As a workaround, you can use Data Factory Editor in the portal, Visual Studio, or Azure PowerShell. --#### Resolution -Resolution for this issue (online with limited functionality) is based on whether the gateway cannot connect to the cloud service or the other way. The following sections provide these resolutions. --### 2. Problem -You see the following error. --`Error: Gateway cannot connect to cloud service through service bus` ---#### Cause -Gateway cannot connect to the cloud service through Service Bus. --#### Resolution -Follow these steps to get the gateway back online: --1. Allow IP address outbound rules on the gateway machine and the corporate firewall. You can find IP addresses from the Windows Event Log (ID == 401): An attempt was made to access a socket in a way forbidden by its access permissions XX.XX.XX.XX:9350. -1. Configure proxy settings on the gateway. See the Proxy server considerations section for details. -1. Enable outbound ports 5671 and 9350-9354 on both the Windows Firewall on the gateway machine and the corporate firewall. See the Ports and firewall section for details. This step is optional, but we recommend it for performance consideration. --### 3. Problem -You see the following error. --`Error: Cloud service cannot connect to gateway through service bus.` --#### Cause -A transient error in network connectivity. --#### Resolution -Follow these steps to get the gateway back online: --1. Wait for a couple of minutes, the connectivity will be automatically recovered when the error is gone. -1. If the error persists, restart the gateway service. --## Failed to author linked service -### Problem -You might see this error when you try to use Credential Manager in the portal to input credentials for a new linked service, or update credentials for an existing linked service. --`Error: The data store '<Server>/<Database>' cannot be reached. Check connection settings for the data source.` --When you see this error, the settings page of Data Management Gateway Configuration Manager might look like the following screenshot. ---#### Cause -The TLS/SSL certificate might have been lost on the gateway machine. The gateway computer cannot load the certificate currently that is used for TLS encryption. You might also see an error message in the event log that is similar to the following message. -- `Unable to get the gateway settings from cloud service. Check the gateway key and the network connection. (Certificate with thumbprint cannot be loaded.)` --#### Resolution -Follow these steps to solve the problem: --1. Start Data Management Gateway Configuration Manager. -2. Switch to the **Settings** tab. -3. Click the **Change** button to change the TLS/SSL certificate. -- :::image type="content" source="media/data-factory-troubleshoot-gateway-issues/change-button-ssl-certificate.png" alt-text="Change certificate button"::: -4. Select a new certificate as the TLS/SSL certificate. You can use any TLS/SSL certificate that is generated by you or any organization. -- :::image type="content" source="media/data-factory-troubleshoot-gateway-issues/specify-http-end-point.png" alt-text="Specify certificate"::: --## Copy activity fails -### Problem -You might notice the following "UserErrorFailedToConnectToSqlserver" failure after you set up a pipeline in the portal. --`Error: Copy activity encountered a user error: ErrorCode=UserErrorFailedToConnectToSqlServer,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Cannot connect to SQL Server` --#### Cause -This can happen for different reasons, and mitigation varies accordingly. --#### Resolution -Allow outbound TCP connections over port TCP/1433 on the Data Management Gateway client side before connecting to a SQL database. --If the target database is in Azure SQL Database, check SQL Server firewall settings for Azure as well. --See the following section to test the connection to the on-premises data store. --## Data store connection or driver-related errors -If you see data store connection or driver-related errors, complete the following steps: --1. Start Data Management Gateway Configuration Manager on the gateway machine. -2. Switch to the **Diagnostics** tab. -3. In **Test Connection**, add the gateway group values. -4. Click **Test** to see if you can connect to the on-premises data source from the gateway machine by using the connection information and credentials. If the test connection still fails after you install a driver, restart the gateway for it to pick up the latest change. ---## Gateway logs -### Send gateway logs to Microsoft -When you contact Microsoft Support to get help with troubleshooting gateway issues, you might be asked to share your gateway logs. With the release of the gateway, you can share required gateway logs with two button clicks in Data Management Gateway Configuration Manager. --1. Switch to the **Diagnostics** tab in Data Management Gateway Configuration Manager. -- :::image type="content" source="media/data-factory-troubleshoot-gateway-issues/data-management-gateway-diagnostics-tab.png" alt-text="Data Management Gateway Diagnostics tab"::: -2. Click **Send Logs** to see the following dialog box. -- :::image type="content" source="media/data-factory-troubleshoot-gateway-issues/data-management-gateway-send-logs-dialog.png" alt-text="Data Management Gateway Send logs"::: -3. (Optional) Click **view logs** to review logs in the event viewer. -4. (Optional) Click **privacy** to review Microsoft web services privacy statement. -5. When you are satisfied with what you are about to upload, click **Send Logs** to actually send the logs from the last seven days to Microsoft for troubleshooting. You should see the status of the send-logs operation as shown in the following screenshot. -- :::image type="content" source="media/data-factory-troubleshoot-gateway-issues/data-management-gateway-send-logs-status.png" alt-text="Screenshot that shows where to view the status of the send-logs operation."::: -6. After the operation is complete, you see a dialog box as shown in the following screenshot. -- :::image type="content" source="media/data-factory-troubleshoot-gateway-issues/data-management-gateway-send-logs-result.png" alt-text="Data Management Gateway Send logs status"::: -7. Save the **Report ID** and share it with Microsoft Support. The report ID is used to locate the gateway logs that you uploaded for troubleshooting. The report ID is also saved in the event viewer. You can find it by looking at the event ID "25", and check the date and time. -- :::image type="content" source="media/data-factory-troubleshoot-gateway-issues/data-management-gateway-send-logs-report-id.png" alt-text="Data Management Gateway Send logs report ID"::: --### Archive gateway logs on gateway host machine -There are some scenarios where you have gateway issues and you cannot share gateway logs directly: --* You manually install the gateway and register the gateway. -* You try to register the gateway with a regenerated key in Data Management Gateway Configuration Manager. -* You try to send logs and the gateway host service cannot be connected. --For these scenarios, you can save gateway logs as a zip file and share it when you contact Microsoft support. For example, if you receive an error while you register the gateway as shown in the following screenshot. ---Click the **Archive gateway logs** link to archive and save logs, and then share the zip file with Microsoft support. ---### Locate gateway logs -You can find detailed gateway log information in the Windows event logs. --1. Start Windows **Event Viewer**. -2. Locate logs in the **Application and Services Logs** > **Data Management Gateway** folder. -- When you're troubleshooting gateway-related issues, look for error level events in the event viewer. - |
data-factory | Data Factory Troubleshoot | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-troubleshoot.md | - Title: Troubleshoot Azure Data Factory issues -description: Learn how to troubleshoot issues with using Azure Data Factory. --- Previously updated : 04/12/2023-------# Troubleshoot Data Factory issues -> [!NOTE] -> This article applies to version 1 of Azure Data Factory. --This article provides troubleshooting tips for issues when using Azure Data Factory. This article does not list all the possible issues when using the service, but it covers some issues and general troubleshooting tips. ---## Troubleshooting tips -### Error: The subscription is not registered to use namespace 'Microsoft.DataFactory' -If you receive this error, the Azure Data Factory resource provider has not been registered on your machine. Do the following: --1. Launch Azure PowerShell. -2. Log in to your Azure account using the following command. -- ```powershell - Connect-AzAccount - ``` --3. Run the following command to register the Azure Data Factory provider. -- ```powershell - Register-AzResourceProvider -ProviderNamespace Microsoft.DataFactory - ``` --### Problem: Unauthorized error when running a Data Factory cmdlet -You are probably not using the right Azure account or subscription with the Azure PowerShell. Use the following cmdlets to select the right Azure account and subscription to use with the Azure PowerShell. --1. Connect-AzAccount - Use the right user ID and password -2. Get-AzSubscription - View all the subscriptions for the account. -3. Select-AzSubscription <subscription name> - Select the right subscription. Use the same one you use to create a data factory on the Azure portal. --### Problem: Fail to launch Data Management Gateway Express Setup from Azure portal -The Express setup for the Data Management Gateway requires Internet Explorer or a Microsoft ClickOnce compatible web browser. If the Express Setup fails to start, do one of the following: --* Use Internet Explorer or a Microsoft ClickOnce compatible web browser. -- If you are using Chrome, go to the [Chrome web store](https://chrome.google.com/webstore/), search with "ClickOnce" keyword, choose one of the ClickOnce extensions, and install it. -- Do the same for Firefox (install add-in). Click Open Menu button on the toolbar (three horizontal lines in the top-right corner), click Add-ons, search with "ClickOnce" keyword, choose one of the ClickOnce extensions, and install it. -* Use the **Manual Setup** link shown on the same blade in the portal. You use this approach to download installation file and run it manually. After the installation is successful, you see the Data Management Gateway Configuration dialog box. Copy the **key** from the portal screen and use it in the configuration manager to manually register the gateway with the service. --### Problem: Fail to connect to SQL Server -Launch **Data Management Gateway Configuration Manager** on the gateway machine and use the **Troubleshooting** tab to test the connection to SQL Server from the gateway machine. See [Troubleshoot gateway issues](data-factory-data-management-gateway.md#troubleshooting-gateway-issues) for tips on troubleshooting connection/gateway related issues. --### Problem: Input slices are in Waiting state forever -The slices could be in **Waiting** state due to various reasons. One of the common reasons is that the **external** property is not set to **true**. Any dataset that is produced outside the scope of Azure Data Factory should be marked with **external** property. This property indicates that the data is external and not backed by any pipelines within the data factory. The data slices are marked as **Ready** once the data is available in the respective store. --See the following example for the usage of the **external** property. You can optionally specify **externalData*** when you set external to true. --See [Datasets](data-factory-create-datasets.md) article for more details about this property. --```json -{ - "name": "CustomerTable", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "MyLinkedService", - "typeProperties": { - "folderPath": "MyContainer/MySubFolder/", - "format": { - "type": "TextFormat", - "columnDelimiter": ",", - "rowDelimiter": ";" - } - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - } - } - } -} -``` --To resolve the error, add the **external** property and the optional **externalData** section to the JSON definition of the input table and recreate the table. --### Problem: Hybrid copy operation fails -See [Troubleshoot gateway issues](data-factory-data-management-gateway.md#troubleshooting-gateway-issues) for steps to troubleshoot issues with copying to/from an on-premises data store using the Data Management Gateway. --### Problem: On-demand HDInsight provisioning fails -When using a linked service of type HDInsightOnDemand, you need to specify a linkedServiceName that points to an Azure Blob Storage. Data Factory service uses this storage to store logs and supporting files for your on-demand HDInsight cluster. Sometimes provisioning of an on-demand HDInsight cluster fails with the following error: --``` -Failed to create cluster. Exception: Unable to complete the cluster create operation. Operation failed with code '400'. Cluster left behind state: 'Error'. Message: 'StorageAccountNotColocated'. -``` --This error usually indicates that the location of the storage account specified in the linkedServiceName is not in the same data center location where the HDInsight provisioning is happening. Example: if your data factory is in West US and the Azure storage is in East US, the on-demand provisioning fails in West US. --Additionally, there is a second JSON property additionalLinkedServiceNames where additional storage accounts may be specified in on-demand HDInsight. Those additional linked storage accounts should be in the same location as the HDInsight cluster, or it fails with the same error. --### Problem: Custom .NET activity fails -See [Debug a pipeline with custom activity](data-factory-use-custom-activities.md#troubleshoot-failures) for detailed steps. --## Use Azure portal to troubleshoot -### Using portal blades -See [Monitor pipeline](data-factory-monitor-manage-pipelines.md) for steps. --### Using Monitor and Manage App -See [Monitor and manage data factory pipelines using Monitor and Manage App](data-factory-monitor-manage-app.md) for details. --## Use Azure PowerShell to troubleshoot -### Use Azure PowerShell to troubleshoot an error -See [Monitor Data Factory pipelines using Azure PowerShell](data-factory-monitor-manage-pipelines.md) for details. --[adfgetstarted]: data-factory-copy-data-from-azure-blob-storage-to-sql-database.md -[use-custom-activities]: data-factory-use-custom-activities.md -[troubleshoot]: data-factory-troubleshoot.md -[developer-reference]: /previous-versions/azure/dn834987(v=azure.100) -[cmdlet-reference]: /powershell/resourcemanager/Azurerm.DataFactories/v2.2.0/Azurerm.DataFactories -[json-scripting-reference]: /previous-versions/azure/dn835050(v=azure.100) --[image-data-factory-troubleshoot-with-error-link]: ./media/data-factory-troubleshoot/DataFactoryWithErrorLink.png --[image-data-factory-troubleshoot-datasets-with-errors-blade]: ./media/data-factory-troubleshoot/DatasetsWithErrorsBlade.png --[image-data-factory-troubleshoot-table-blade-with-problem-slices]: ./media/data-factory-troubleshoot/TableBladeWithProblemSlices.png --[image-data-factory-troubleshoot-activity-run-with-error]: ./media/data-factory-troubleshoot/ActivityRunDetailsWithError.png --[image-data-factory-troubleshoot-dataslice-blade-with-active-runs]: ./media/data-factory-troubleshoot/DataSliceBladeWithActivityRuns.png --[image-data-factory-troubleshoot-walkthrough2-with-errors-link]: ./media/data-factory-troubleshoot/Walkthrough2WithErrorsLink.png --[image-data-factory-troubleshoot-walkthrough2-datasets-with-errors]: ./media/data-factory-troubleshoot/Walkthrough2DataSetsWithErrors.png --[image-data-factory-troubleshoot-walkthrough2-table-with-problem-slices]: ./media/data-factory-troubleshoot/Walkthrough2TableProblemSlices.png --[image-data-factory-troubleshoot-walkthrough2-slice-activity-runs]: ./media/data-factory-troubleshoot/Walkthrough2DataSliceActivityRuns.png --[image-data-factory-troubleshoot-activity-run-details]: ./media/data-factory-troubleshoot/Walkthrough2ActivityRunDetails.png |
data-factory | Data Factory Use Custom Activities | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-use-custom-activities.md | - Title: Use custom activities in an Azure Data Factory pipeline -description: Learn how to create custom activities and use them in an Azure Data Factory pipeline. --- Previously updated : 04/12/2023------# Use custom activities in an Azure Data Factory version 1 pipeline -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-use-custom-activities.md) -> * [Version 2 (current version)](../transform-data-using-dotnet-custom-activity.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Custom activities in V2](../transform-data-using-dotnet-custom-activity.md). --There are two types of activities that you can use in an Azure Data Factory pipeline. --- [Data Movement Activities](data-factory-data-movement-activities.md) to move data between [supported source and sink data stores](data-factory-data-movement-activities.md#supported-data-stores-and-formats).-- [Data Transformation Activities](data-factory-data-transformation-activities.md) to transform data using compute services such as Azure HDInsight and Azure Batch.--To move data to/from a data store that Data Factory does not support, create a **custom activity** with your own data movement logic and use the activity in a pipeline. Similarly, to transform/process data in a way that isn't supported by Data Factory, create a custom activity with your own data transformation logic and use the activity in a pipeline. --You can configure a custom activity to run on an **Azure Batch** pool of virtual machines. When using Azure Batch, you can use only an existing Azure Batch pool. --The following walkthrough provides step-by-step instructions for creating a custom .NET activity and using the custom activity in a pipeline. The walkthrough uses an **Azure Batch** linked service. --> [!IMPORTANT] -> - It is not possible to use a Data Management Gateway from a custom activity to access on-premises data sources. Currently, [Data Management Gateway](data-factory-data-management-gateway.md) supports only the copy activity and stored procedure activity in Data Factory. --## Walkthrough: create a custom activity -### Prerequisites -* Visual Studio 2012/2013/2015/2017 -* Download and install [Azure .NET SDK](https://azure.microsoft.com/downloads/) --### Azure Batch prerequisites -In the walkthrough, you run your custom .NET activities using Azure Batch as a compute resource. **Azure Batch** is a platform service for running large-scale parallel and high-performance computing (HPC) applications efficiently in the cloud. Azure Batch schedules compute-intensive work to run on a managed **collection of virtual machines**, and can automatically scale compute resources to meet the needs of your jobs. See [Azure Batch basics][batch-technical-overview] article for a detailed overview of the Azure Batch service. --For the tutorial, create an Azure Batch account with a pool of VMs. Here are the steps: --1. Create an **Azure Batch account** using the [Azure portal](https://portal.azure.com). See [Create and manage an Azure Batch account][batch-create-account] article for instructions. -2. Note down the Azure Batch account name, account key, URI, and pool name. You need them to create an Azure Batch linked service. - 1. On the home page for Azure Batch account, you see a **URL** in the following format: `https://myaccount.westus.batch.azure.com`. In this example, **myaccount** is the name of the Azure Batch account. URI you use in the linked service definition is the URL without the name of the account. For example: `https://<region>.batch.azure.com`. - 2. Click **Keys** on the left menu, and copy the **PRIMARY ACCESS KEY**. - 3. To use an existing pool, click **Pools** on the menu, and note down the **ID** of the pool. If you don't have an existing pool, move to the next step. -2. Create an **Azure Batch pool**. -- 1. In the [Azure portal](https://portal.azure.com), click **Browse** in the left menu, and click **Batch Accounts**. - 2. Select your Azure Batch account to open the **Batch Account** blade. - 3. Click **Pools** tile. - 4. In the **Pools** blade, click Add button on the toolbar to add a pool. - 1. Enter an ID for the pool (Pool ID). Note the **ID of the pool**; you need it when creating the Data Factory solution. - 2. Specify **Windows Server 2012 R2** for the Operating System Family setting. - 3. Select a **node pricing tier**. - 4. Enter **2** as value for the **Target Dedicated** setting. - 5. Enter **2** as value for the **Max tasks per node** setting. - 5. Click **OK** to create the pool. - 6. Note down the **ID** of the pool. --### High-level steps -Here are the two high-level steps you perform as part of this walkthrough: --1. Create a custom activity that contains simple data transformation/processing logic. -2. Create an Azure data factory with a pipeline that uses the custom activity. --### Create a custom activity -To create a .NET custom activity, create a **.NET Class Library** project with a class that implements that **IDotNetActivity** interface. This interface has only one method: [Execute](/dotnet/api/microsoft.azure.management.datafactories.runtime.idotnetactivity) and its signature is: --```csharp -public IDictionary<string, string> Execute( - IEnumerable<LinkedService> linkedServices, - IEnumerable<Dataset> datasets, - Activity activity, - IActivityLogger logger) -``` --The method takes four parameters: --- **linkedServices**. This property is an enumerable list of Data Store linked services referenced by input/output datasets for the activity.-- **datasets**. This property is an enumerable list of input/output datasets for the activity. You can use this parameter to get the locations and schemas defined by input and output datasets.-- **activity**. This property represents the current activity. It can be used to access extended properties associated with the custom activity. See [Access extended properties](#access-extended-properties) for details.-- **logger**. This object lets you write debug comments that surface in the user log for the pipeline.--The method returns a dictionary that can be used to chain custom activities together in the future. This feature is not implemented yet, so return an empty dictionary from the method. --### Procedure --1. Create a **.NET Class Library** project. - - <ol type="a"> - <li>Launch Visual Studio.</li> - <li>Click <b>File</b>, point to <b>New</b>, and click <b>Project</b>.</li> - <li>Expand <b>Templates</b>, and select <b>Visual C#</b>. In this walkthrough, you use C#, but you can use any .NET language to develop the custom activity.</li> - <li>Select <b>Class Library</b> from the list of project types on the right. In Visual Studio, choose <b>Class Library (.NET Framework)</b> </li> - <li>Enter <b>MyDotNetActivity</b> for the <b>Name</b>.</li> - <li>Select <b>C:\ADFGetStarted</b> for the <b>Location</b>.</li> - <li>Click <b>OK</b> to create the project.</li> - </ol> --2. Click **Tools**, point to **NuGet Package Manager**, and click **Package Manager Console**. --3. In the Package Manager Console, execute the following command to import **Microsoft.Azure.Management.DataFactories**. -- ```powershell - Install-Package Microsoft.Azure.Management.DataFactories - ``` --4. Import the **Azure Storage** NuGet package in to the project. -- ```powershell - Install-Package WindowsAzure.Storage -Version 4.3.0 - ``` -- > [!IMPORTANT] - > Data Factory service launcher requires the 4.3 version of WindowsAzure.Storage. If you add a reference to a later version of Azure Storage assembly in your custom activity project, you see an error when the activity executes. To resolve the error, see [Appdomain isolation](#appdomain-isolation) section. -5. Add the following **using** statements to the source file in the project. -- ```csharp -- // Comment these lines if using VS 2017 - using System.IO; - using System.Globalization; - using System.Diagnostics; - using System.Linq; - // -- -- // Comment these lines if using <= VS 2015 - using System; - using System.Collections.Generic; - using System.Linq; - using System.Text; - using System.Threading.Tasks; - // -- using Microsoft.Azure.Management.DataFactories.Models; - using Microsoft.Azure.Management.DataFactories.Runtime; -- using Microsoft.WindowsAzure.Storage; - using Microsoft.WindowsAzure.Storage.Blob; - ``` --6. Change the name of the **namespace** to **MyDotNetActivityNS**. -- ```csharp - namespace MyDotNetActivityNS - ``` --7. Change the name of the class to **MyDotNetActivity** and derive it from the **IDotNetActivity** interface as shown in the following code snippet: -- ```csharp - public class MyDotNetActivity : IDotNetActivity - ``` --8. Implement (Add) the **Execute** method of the **IDotNetActivity** interface to the **MyDotNetActivity** class and copy the following sample code to the method. -- The following sample counts the number of occurrences of the search term ("Microsoft") in each blob associated with a data slice. -- ```csharp - /// <summary> - /// Execute method is the only method of IDotNetActivity interface you must implement. - /// In this sample, the method invokes the Calculate method to perform the core logic. - /// </summary> -- public IDictionary<string, string> Execute( - IEnumerable<LinkedService> linkedServices, - IEnumerable<Dataset> datasets, - Activity activity, - IActivityLogger logger) - { - // get extended properties defined in activity JSON definition - // (for example: SliceStart) - DotNetActivity dotNetActivity = (DotNetActivity)activity.TypeProperties; - string sliceStartString = dotNetActivity.ExtendedProperties["SliceStart"]; -- // to log information, use the logger object - // log all extended properties - IDictionary<string, string> extendedProperties = dotNetActivity.ExtendedProperties; - logger.Write("Logging extended properties if any..."); - foreach (KeyValuePair<string, string> entry in extendedProperties) - { - logger.Write("<key:{0}> <value:{1}>", entry.Key, entry.Value); - } -- // linked service for input and output data stores - // in this example, same storage is used for both input/output - AzureStorageLinkedService inputLinkedService; -- // get the input dataset - Dataset inputDataset = datasets.Single(dataset => dataset.Name == activity.Inputs.Single().Name); -- // declare variables to hold type properties of input/output datasets - AzureBlobDataset inputTypeProperties, outputTypeProperties; -- // get type properties from the dataset object - inputTypeProperties = inputDataset.Properties.TypeProperties as AzureBlobDataset; -- // log linked services passed in linkedServices parameter - // you will see two linked services of type: AzureStorage - // one for input dataset and the other for output dataset - foreach (LinkedService ls in linkedServices) - logger.Write("linkedService.Name {0}", ls.Name); -- // get the first Azure Storage linked service from linkedServices object - // using First method instead of Single since we are using the same - // Azure Storage linked service for input and output. - inputLinkedService = linkedServices.First( - linkedService => - linkedService.Name == - inputDataset.Properties.LinkedServiceName).Properties.TypeProperties - as AzureStorageLinkedService; -- // get the connection string in the linked service - string connectionString = inputLinkedService.ConnectionString; -- // get the folder path from the input dataset definition - string folderPath = GetFolderPath(inputDataset); - string output = string.Empty; // for use later. -- // create storage client for input. Pass the connection string. - CloudStorageAccount inputStorageAccount = CloudStorageAccount.Parse(connectionString); - CloudBlobClient inputClient = inputStorageAccount.CreateCloudBlobClient(); -- // initialize the continuation token before using it in the do-while loop. - BlobContinuationToken continuationToken = null; - do - { // get the list of input blobs from the input storage client object. - BlobResultSegment blobList = inputClient.ListBlobsSegmented(folderPath, - true, - BlobListingDetails.Metadata, - null, - continuationToken, - null, - null); -- // Calculate method returns the number of occurrences of - // the search term ("Microsoft") in each blob associated - // with the data slice. definition of the method is shown in the next step. -- output = Calculate(blobList, logger, folderPath, ref continuationToken, "Microsoft"); -- } while (continuationToken != null); -- // get the output dataset using the name of the dataset matched to a name in the Activity output collection. - Dataset outputDataset = datasets.Single(dataset => dataset.Name == activity.Outputs.Single().Name); -- // get type properties for the output dataset - outputTypeProperties = outputDataset.Properties.TypeProperties as AzureBlobDataset; -- // get the folder path from the output dataset definition - folderPath = GetFolderPath(outputDataset); -- // log the output folder path - logger.Write("Writing blob to the folder: {0}", folderPath); -- // create a storage object for the output blob. - CloudStorageAccount outputStorageAccount = CloudStorageAccount.Parse(connectionString); - // write the name of the file. - Uri outputBlobUri = new Uri(outputStorageAccount.BlobEndpoint, folderPath + "/" + GetFileName(outputDataset)); -- // log the output file name - logger.Write("output blob URI: {0}", outputBlobUri.ToString()); -- // create a blob and upload the output text. - CloudBlockBlob outputBlob = new CloudBlockBlob(outputBlobUri, outputStorageAccount.Credentials); - logger.Write("Writing {0} to the output blob", output); - outputBlob.UploadText(output); -- // The dictionary can be used to chain custom activities together in the future. - // This feature is not implemented yet, so just return an empty dictionary. -- return new Dictionary<string, string>(); - } - ``` --9. Add the following helper methods: -- ```csharp - /// <summary> - /// Gets the folderPath value from the input/output dataset. - /// </summary> -- private static string GetFolderPath(Dataset dataArtifact) - { - if (dataArtifact == null || dataArtifact.Properties == null) - { - return null; - } -- // get type properties of the dataset - AzureBlobDataset blobDataset = dataArtifact.Properties.TypeProperties as AzureBlobDataset; - if (blobDataset == null) - { - return null; - } -- // return the folder path found in the type properties - return blobDataset.FolderPath; - } -- /// <summary> - /// Gets the fileName value from the input/output dataset. - /// </summary> -- private static string GetFileName(Dataset dataArtifact) - { - if (dataArtifact == null || dataArtifact.Properties == null) - { - return null; - } -- // get type properties of the dataset - AzureBlobDataset blobDataset = dataArtifact.Properties.TypeProperties as AzureBlobDataset; - if (blobDataset == null) - { - return null; - } -- // return the blob/file name in the type properties - return blobDataset.FileName; - } -- /// <summary> - /// Iterates through each blob (file) in the folder, counts the number of instances of search term in the file, - /// and prepares the output text that is written to the output blob. - /// </summary> -- public static string Calculate(BlobResultSegment Bresult, IActivityLogger logger, string folderPath, ref BlobContinuationToken token, string searchTerm) - { - string output = string.Empty; - logger.Write("number of blobs found: {0}", Bresult.Results.Count<IListBlobItem>()); - foreach (IListBlobItem listBlobItem in Bresult.Results) - { - CloudBlockBlob inputBlob = listBlobItem as CloudBlockBlob; - if ((inputBlob != null) && (inputBlob.Name.IndexOf("$$$.$$$") == -1)) - { - string blobText = inputBlob.DownloadText(Encoding.ASCII, null, null, null); - logger.Write("input blob text: {0}", blobText); - string[] source = blobText.Split(new char[] { '.', '?', '!', ' ', ';', ':', ',' }, StringSplitOptions.RemoveEmptyEntries); - var matchQuery = from word in source - where word.ToLowerInvariant() == searchTerm.ToLowerInvariant() - select word; - int wordCount = matchQuery.Count(); - output += string.Format("{0} occurrences(s) of the search term \"{1}\" were found in the file {2}.\r\n", wordCount, searchTerm, inputBlob.Name); - } - } - return output; - } - ``` -- The GetFolderPath method returns the path to the folder that the dataset points to and the GetFileName method returns the name of the blob/file that the dataset points to. If you have folderPath defines using variables such as {Year}, {Month}, {Day} etc., the method returns the string as it is without replacing them with runtime values. See [Access extended properties](#access-extended-properties) section for details on accessing SliceStart, SliceEnd, etc. -- ```JSON - "name": "InputDataset", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "fileName": "file.txt", - "folderPath": "adftutorial/inputfolder/", - ``` -- The Calculate method calculates the number of instances of keyword Microsoft in the input files (blobs in the folder). The search term ("Microsoft") is hard-coded in the code. --10. Compile the project. Click **Build** from the menu and click **Build Solution**. -- > [!IMPORTANT] - > Set 4.5.2 version of .NET Framework as the target framework for your project: right-click the project, and click **Properties** to set the target framework. Data Factory does not support custom activities compiled against .NET Framework versions later than 4.5.2. --11. Launch **Windows Explorer**, and navigate to **bin\debug** or **bin\release** folder depending on the type of build. --12. Create a zip file **MyDotNetActivity.zip** that contains all the binaries in the \<project folder\>\bin\Debug folder. Include the **MyDotNetActivity.pdb** file so that you get additional details such as line number in the source code that caused the issue if there was a failure. -- > [!IMPORTANT] - > All the files in the zip file for the custom activity must be at the **top level** with no sub folders. -- :::image type="content" source="./media/data-factory-use-custom-activities/Binaries.png" alt-text="Binary output files"::: --13. Create a blob container named **customactivitycontainer** if it does not already exist. --14. Upload MyDotNetActivity.zip as a blob to the customactivitycontainer in a **general-purpose** Azure blob storage (not hot/cool Blob storage) that is referred by AzureStorageLinkedService. --> [!IMPORTANT] -> If you add this .NET activity project to a solution in Visual Studio that contains a Data Factory project, and add a reference to .NET activity project from the Data Factory application project, you do not need to perform the last two steps of manually creating the zip file and uploading it to the general-purpose Azure blob storage. When you publish Data Factory entities using Visual Studio, these steps are automatically done by the publishing process. For more information, see [Data Factory project in Visual Studio](#data-factory-project-in-visual-studio) section. --## Create a pipeline with custom activity --You have created a custom activity and uploaded the zip file with binaries to a blob container in a **general-purpose** Azure Storage Account. In this section, you create an Azure data factory with a pipeline that uses the custom activity. --The input dataset for the custom activity represents blobs (files) in the customactivityinput folder of adftutorial container in the blob storage. The output dataset for the activity represents output blobs in the customactivityoutput folder of adftutorial container in the blob storage. --Create **file.txt** file with the following content and upload it to **customactivityinput** folder of the **adftutorial** container. Create the adftutorial container if it does not exist already. --``` -test custom activity Microsoft test custom activity Microsoft -``` --The input folder corresponds to a slice in Azure Data Factory even if the folder has two or more files. When each slice is processed by the pipeline, the custom activity iterates through all the blobs in the input folder for that slice. --You see one output file with in the adftutorial\customactivityoutput folder with one or more lines (same as number of blobs in the input folder): --``` -2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2016-11-16-00/file.txt. -``` ---Here are the steps you perform in this section: --1. Create a **data factory**. -2. Create **Linked services** for the Azure Batch pool of VMs on which the custom activity runs and the Azure Storage that holds the input/output blobs. -3. Create input and output **datasets** that represent input and output of the custom activity. -4. Create a **pipeline** that uses the custom activity. --> [!NOTE] -> Create the **file.txt** and upload it to a blob container if you haven't already done so. See instructions in the preceding section. --### Step 1: Create the data factory -1. After logging in to the Azure portal, do the following steps: - 1. Click **Create a resource** on the left menu. - 2. Click **Data + Analytics** in the **New** blade. - 3. Click **Data Factory** on the **Data analytics** blade. -- :::image type="content" source="media/data-factory-use-custom-activities/new-azure-data-factory-menu.png" alt-text="New Azure Data Factory menu"::: -2. In the **New data factory** blade, enter **CustomActivityFactory** for the Name. The name of the Azure data factory must be globally unique. If you receive the error: **Data factory name "CustomActivityFactory" is not available**, change the name of the data factory (for example, **yournameCustomActivityFactory**) and try creating again. -- :::image type="content" source="media/data-factory-use-custom-activities/new-azure-data-factory-blade.png" alt-text="New Azure Data Factory blade"::: -3. Click **RESOURCE GROUP NAME**, and select an existing resource group or create a resource group. -4. Verify that you are using the correct **subscription** and **region** where you want the data factory to be created. -5. Click **Create** on the **New data factory** blade. -6. You see the data factory being created in the **Dashboard** of the Azure portal. -7. After the data factory has been created successfully, you see the Data Factory blade, which shows you the contents of the data factory. -- :::image type="content" source="media/data-factory-use-custom-activities/data-factory-blade.png" alt-text="Data Factory blade"::: --### Step 2: Create linked services -Linked services link data stores or compute services to an Azure data factory. In this step, you link your Azure Storage account and Azure Batch account to your data factory. --#### Create Azure Storage linked service -1. Click the **Author and deploy** tile on the **DATA FACTORY** blade for **CustomActivityFactory**. You see the Data Factory Editor. -2. Click **New data store** on the command bar and choose **Azure storage**. You should see the JSON script for creating an Azure Storage linked service in the editor. -- :::image type="content" source="media/data-factory-use-custom-activities/new-data-store-menu.png" alt-text="New data store - Azure Storage"::: -3. Replace `<accountname>` with name of your Azure storage account and `<accountkey>` with access key of the Azure storage account. To learn how to get your storage access key, see [Manage storage account access keys](../../storage/common/storage-account-keys-manage.md). -- :::image type="content" source="media/data-factory-use-custom-activities/azure-storage-linked-service.png" alt-text="Azure Storage liked service"::: -4. Click **Deploy** on the command bar to deploy the linked service. --#### Create Azure Batch linked service -1. In the Data Factory Editor, click **... More** on the command bar, click **New compute**, and then select **Azure Batch** from the menu. -- :::image type="content" source="media/data-factory-use-custom-activities/new-azure-compute-batch.png" alt-text="New compute - Azure Batch"::: -2. Make the following changes to the JSON script: -- 1. Specify Azure Batch account name for the **accountName** property. The **URL** from the **Azure Batch account blade** is in the following format: `http://accountname.region.batch.azure.com`. For the **batchUri** property in the JSON, you need to remove `accountname.` from the URL and use the `accountname` for the `accountName` JSON property. - 2. Specify the Azure Batch account key for the **accessKey** property. - 3. Specify the name of the pool you created as part of prerequisites for the **poolName** property. You can also specify the ID of the pool instead of the name of the pool. - 4. Specify Azure Batch URI for the **batchUri** property. Example: `https://westus.batch.azure.com`. - 5. Specify the **AzureStorageLinkedService** for the **linkedServiceName** property. -- ```json - { - "name": "AzureBatchLinkedService", - "properties": { - "type": "AzureBatch", - "typeProperties": { - "accountName": "myazurebatchaccount", - "batchUri": "https://westus.batch.azure.com", - "accessKey": "<yourbatchaccountkey>", - "poolName": "myazurebatchpool", - "linkedServiceName": "AzureStorageLinkedService" - } - } - } - ``` -- For the **poolName** property, you can also specify the ID of the pool instead of the name of the pool. --### Step 3: Create datasets -In this step, you create datasets to represent input and output data. --#### Create input dataset -1. In the **Editor** for the Data Factory, click **... More** on the command bar, click **New dataset**, and then select **Azure Blob storage** from the drop-down menu. -2. Replace the JSON in the right pane with the following JSON snippet: -- ```json - { - "name": "InputDataset", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "folderPath": "adftutorial/customactivityinput/", - "format": { - "type": "TextFormat" - } - }, - "availability": { - "frequency": "Hour", - "interval": 1 - }, - "external": true, - "policy": {} - } - } - ``` -- You create a pipeline later in this walkthrough with start time: 2016-11-16T00:00:00Z and end time: 2016-11-16T05:00:00Z. It is scheduled to produce data hourly, so there are five input/output slices (between **00**:00:00 -> **05**:00:00). -- The **frequency** and **interval** for the input dataset is set to **Hour** and **1**, which means that the input slice is available hourly. In this sample, it is the same file (file.txt) in the intputfolder. -- Here are the start times for each slice, which is represented by SliceStart system variable in the above JSON snippet. -3. Click **Deploy** on the toolbar to create and deploy the **InputDataset**. Confirm that you see the **TABLE CREATED SUCCESSFULLY** message on the title bar of the Editor. --#### Create an output dataset -1. In the **Data Factory editor**, click **... More** on the command bar, click **New dataset**, and then select **Azure Blob storage**. -2. Replace the JSON script in the right pane with the following JSON script: -- ```JSON - { - "name": "OutputDataset", - "properties": { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": { - "fileName": "{slice}.txt", - "folderPath": "adftutorial/customactivityoutput/", - "partitionedBy": [ - { - "name": "slice", - "value": { - "type": "DateTime", - "date": "SliceStart", - "format": "yyyy-MM-dd-HH" - } - } - ] - }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } - } - ``` -- Output location is **adftutorial/customactivityoutput/** and output file name is yyyy-MM-dd-HH.txt where yyyy-MM-dd-HH is the year, month, date, and hour of the slice being produced. See [Developer Reference][adf-developer-reference] for details. -- An output blob/file is generated for each input slice. Here is how an output file is named for each slice. All the output files are generated in one output folder: **adftutorial\customactivityoutput**. -- | Slice | Start time | Output file | - |: |: |: | - | 1 |2016-11-16T00:00:00 |2016-11-16-00.txt | - | 2 |2016-11-16T01:00:00 |2016-11-16-01.txt | - | 3 |2016-11-16T02:00:00 |2016-11-16-02.txt | - | 4 |2016-11-16T03:00:00 |2016-11-16-03.txt | - | 5 |2016-11-16T04:00:00 |2016-11-16-04.txt | -- Remember that all the files in an input folder are part of a slice with the start times mentioned above. When this slice is processed, the custom activity scans through each file and produces a line in the output file with the number of occurrences of search term ("Microsoft"). If there are three files in the input folder, there are three lines in the output file for each hourly slice: 2016-11-16-00.txt, 2016-11-16:01:00:00.txt, etc. -3. To deploy the **OutputDataset**, click **Deploy** on the command bar. --### Create and run a pipeline that uses the custom activity -1. In the Data Factory Editor, click **... More**, and then select **New pipeline** on the command bar. -2. Replace the JSON in the right pane with the following JSON script: -- ```JSON - { - "name": "ADFTutorialPipelineCustom", - "properties": { - "description": "Use custom activity", - "activities": [ - { - "Name": "MyDotNetActivity", - "Type": "DotNetActivity", - "Inputs": [ - { - "Name": "InputDataset" - } - ], - "Outputs": [ - { - "Name": "OutputDataset" - } - ], - "LinkedServiceName": "AzureBatchLinkedService", - "typeProperties": { - "AssemblyName": "MyDotNetActivity.dll", - "EntryPoint": "MyDotNetActivityNS.MyDotNetActivity", - "PackageLinkedService": "AzureStorageLinkedService", - "PackageFile": "customactivitycontainer/MyDotNetActivity.zip", - "extendedProperties": { - "SliceStart": "$$Text.Format('{0:yyyyMMddHH-mm}', Time.AddMinutes(SliceStart, 0))" - } - }, - "Policy": { - "Concurrency": 2, - "ExecutionPriorityOrder": "OldestFirst", - "Retry": 3, - "Timeout": "00:30:00", - "Delay": "00:00:00" - } - } - ], - "start": "2016-11-16T00:00:00Z", - "end": "2016-11-16T05:00:00Z", - "isPaused": false - } - } - ``` -- Note the following points: -- * **Concurrency** is set to **2** so that two slices are processed in parallel by 2 VMs in the Azure Batch pool. - * There is one activity in the activities section and it is of type: **DotNetActivity**. - * **AssemblyName** is set to the name of the DLL: **MyDotnetActivity.dll**. - * **EntryPoint** is set to **MyDotNetActivityNS.MyDotNetActivity**. - * **PackageLinkedService** is set to **AzureStorageLinkedService** that points to the blob storage that contains the custom activity zip file. If you are using different Azure Storage accounts for input/output files and the custom activity zip file, you create another Azure Storage linked service. This article assumes that you are using the same Azure Storage account. - * **PackageFile** is set to **customactivitycontainer/MyDotNetActivity.zip**. It is in the format: containerforthezip/nameofthezip.zip. - * The custom activity takes **InputDataset** as input and **OutputDataset** as output. - * The linkedServiceName property of the custom activity points to the **AzureBatchLinkedService**, which tells Azure Data Factory that the custom activity needs to run on Azure Batch VMs. - * **isPaused** property is set to **false** by default. The pipeline runs immediately in this example because the slices start in the past. You can set this property to true to pause the pipeline and set it back to false to restart. - * The **start** time and **end** times are **five** hours apart and slices are produced hourly, so five slices are produced by the pipeline. -3. To deploy the pipeline, click **Deploy** on the command bar. --### Monitor the pipeline -1. In the Data Factory blade in the Azure portal, click **Diagram**. -- :::image type="content" source="./media/data-factory-use-custom-activities/DataFactoryBlade.png" alt-text="Diagram tile"::: -2. In the Diagram View, now click the OutputDataset. -- :::image type="content" source="./media/data-factory-use-custom-activities/diagram.png" alt-text="Diagram view"::: -3. You should see that the five output slices are in the Ready state. If they are not in the Ready state, they haven't been produced yet. -- :::image type="content" source="./media/data-factory-use-custom-activities/OutputSlices.png" alt-text="Output slices"::: -4. Verify that the output files are generated in the blob storage in the **adftutorial** container. -- ![output from custom activity][image-data-factory-output-from-custom-activity] -5. If you open the output file, you should see the output similar to the following output: -- ``` - 2 occurrences(s) of the search term "Microsoft" were found in the file inputfolder/2016-11-16-00/file.txt. - ``` -6. Use the [Azure portal](https://portal.azure.com) or Azure PowerShell cmdlets to monitor your data factory, pipelines, and data sets. You can see messages from the **ActivityLogger** in the code for the custom activity in the logs (specifically user-0.log) that you can download from the portal or using cmdlets. -- ![download logs from custom activity][image-data-factory-download-logs-from-custom-activity] --See [Monitor and Manage Pipelines](data-factory-monitor-manage-pipelines.md) for detailed steps for monitoring datasets and pipelines. --## Data Factory project in Visual Studio -You can create and publish Data Factory entities by using Visual Studio instead of using Azure portal. For detailed information about creating and publishing Data Factory entities by using Visual Studio, See [Build your first pipeline using Visual Studio](data-factory-build-your-first-pipeline-using-vs.md) and [Copy data from Azure Blob to Azure SQL](data-factory-copy-activity-tutorial-using-visual-studio.md) articles. --Do the following additional steps if you are creating Data Factory project in Visual Studio: --1. Add the Data Factory project to the Visual Studio solution that contains the custom activity project. -2. Add a reference to the .NET activity project from the Data Factory project. Right-click Data Factory project, point to **Add**, and then click **Reference**. -3. In the **Add Reference** dialog box, select the **MyDotNetActivity** project, and click **OK**. -4. Build and publish the solution. -- > [!IMPORTANT] - > When you publish Data Factory entities, a zip file is automatically created for you and is uploaded to the blob container: customactivitycontainer. If the blob container does not exist, it is automatically created too. --## Data Factory and Batch integration -The Data Factory service creates a job in Azure Batch with the name: **adf-poolname: job-xxx**. Click **Jobs** from the left menu. ---A task is created for each activity run of a slice. If there are five slices ready to be processed, five tasks are created in this job. If there are multiple compute nodes in the Batch pool, two or more slices can run in parallel. If the maximum tasks per compute node is set to > 1, you can also have more than one slice running on the same compute. ---The following diagram illustrates the relationship between Azure Data Factory and Batch tasks. ---## Troubleshoot failures -Troubleshooting consists of a few basic techniques: --1. If you see the following error, you may be using a Hot/Cool blob storage instead of using a general-purpose Azure blob storage. Upload the zip file to a **general-purpose Azure Storage Account**. -- ``` - Error in Activity: Job encountered scheduling error. Code: BlobDownloadMiscError Category: ServerError Message: Miscellaneous error encountered while downloading one of the specified Azure Blob(s). - ``` -2. If you see the following error, confirm that the name of the class in the CS file matches the name you specified for the **EntryPoint** property in the pipeline JSON. In the walkthrough, name of the class is: MyDotNetActivity, and the EntryPoint in the JSON is: MyDotNetActivityNS.**MyDotNetActivity**. -- ``` - MyDotNetActivity assembly does not exist or doesn't implement the type Microsoft.DataFactories.Runtime.IDotNetActivity properly - ``` -- If the names do match, confirm that all the binaries are in the **root folder** of the zip file. That is, when you open the zip file, you should see all the files in the root folder, not in any sub folders. -3. If the input slice is not set to **Ready**, confirm that the input folder structure is correct and **file.txt** exists in the input folders. -3. In the **Execute** method of your custom activity, use the **IActivityLogger** object to log information that helps you troubleshoot issues. The logged messages show up in the user log files (one or more files named: user-0.log, user-1.log, user-2.log, etc.). -- In the **OutputDataset** blade, click the slice to see the **DATA SLICE** blade for that slice. You see **activity runs** for that slice. You should see one activity run for the slice. If you click Run in the command bar, you can start another activity run for the same slice. -- When you click the activity run, you see the **ACTIVITY RUN DETAILS** blade with a list of log files. You see logged messages in the user_0.log file. When an error occurs, you see three activity runs because the retry count is set to 3 in the pipeline/activity JSON. When you click the activity run, you see the log files that you can review to troubleshoot the error. -- In the list of log files, click the **user-0.log**. In the right panel are the results of using the **IActivityLogger.Write** method. If you don't see all messages, check if you have more log files named: user_1.log, user_2.log etc. Otherwise, the code may have failed after the last logged message. -- In addition, check **system-0.log** for any system error messages and exceptions. -4. Include the **PDB** file in the zip file so that the error details have information such as **call stack** when an error occurs. -5. All the files in the zip file for the custom activity must be at the **top level** with no sub folders. -6. Ensure that the **assemblyName** (MyDotNetActivity.dll), **entryPoint**(MyDotNetActivityNS.MyDotNetActivity), **packageFile** (customactivitycontainer/MyDotNetActivity.zip), and **packageLinkedService** (should point to the **general-purpose**Azure blob storage that contains the zip file) are set to correct values. -7. If you fixed an error and want to reprocess the slice, right-click the slice in the **OutputDataset** blade and click **Run**. -8. If you see the following error, you are using the Azure Storage package of version > 4.3.0. Data Factory service launcher requires the 4.3 version of WindowsAzure.Storage. See [Appdomain isolation](#appdomain-isolation) section for a work-around if you must use the later version of Azure Storage assembly. -- ``` - Error in Activity: Unknown error in module: System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. > System.TypeLoadException: Could not load type 'Microsoft.WindowsAzure.Storage.Blob.CloudBlob' from assembly 'Microsoft.WindowsAzure.Storage, Version=4.3.0.0, Culture=neutral, - ``` -- If you can use the 4.3.0 version of Azure Storage package, remove the existing reference to Azure Storage package of version > 4.3.0. Then, run the following command from NuGet Package Manager Console. -- ```powershell - Install-Package WindowsAzure.Storage -Version 4.3.0 - ``` -- Build the project. Delete Azure.Storage assembly of version > 4.3.0 from the bin\Debug folder. Create a zip file with binaries and the PDB file. Replace the old zip file with this one in the blob container (customactivitycontainer). Rerun the slices that failed (right-click slice, and click Run). -8. The custom activity does not use the **app.config** file from your package. Therefore, if your code reads any connection strings from the configuration file, it does not work at runtime. The best practice when using Azure Batch is to hold any secrets in an **Azure KeyVault**, use a certificate-based service principal to protect the **keyvault**, and distribute the certificate to Azure Batch pool. The .NET custom activity then can access secrets from the KeyVault at runtime. This solution is a generic solution and can scale to any type of secret, not just connection string. -- There is an easier workaround (but not a best practice): you can create an **Azure SQL linked service** with connection string settings, create a dataset that uses the linked service, and chain the dataset as a dummy input dataset to the custom .NET activity. You can then access the linked service's connection string in the custom activity code. --## Update custom activity -If you update the code for the custom activity, build it, and upload the zip file that contains new binaries to the blob storage. --## Appdomain isolation -See [Cross AppDomain Sample](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/CrossAppDomainDotNetActivitySample) that shows you how to create a custom activity that is not constrained to assembly versions used by the Data Factory launcher (example: WindowsAzure.Storage v4.3.0, Newtonsoft.Json v6.0.x, etc.). --## Access extended properties -You can declare extended properties in the activity JSON as shown in the following sample: --```JSON -"typeProperties": { - "AssemblyName": "MyDotNetActivity.dll", - "EntryPoint": "MyDotNetActivityNS.MyDotNetActivity", - "PackageLinkedService": "AzureStorageLinkedService", - "PackageFile": "customactivitycontainer/MyDotNetActivity.zip", - "extendedProperties": { - "SliceStart": "$$Text.Format('{0:yyyyMMddHH-mm}', Time.AddMinutes(SliceStart, 0))", - "DataFactoryName": "CustomActivityFactory" - } -}, -``` --In the example, there are two extended properties: **SliceStart** and **DataFactoryName**. The value for SliceStart is based on the SliceStart system variable. See [System Variables](data-factory-functions-variables.md) for a list of supported system variables. The value for DataFactoryName is hard-coded to CustomActivityFactory. --To access these extended properties in the **Execute** method, use code similar to the following code: --```csharp -// to get extended properties (for example: SliceStart) -DotNetActivity dotNetActivity = (DotNetActivity)activity.TypeProperties; -string sliceStartString = dotNetActivity.ExtendedProperties["SliceStart"]; --// to log all extended properties -IDictionary<string, string> extendedProperties = dotNetActivity.ExtendedProperties; -logger.Write("Logging extended properties if any..."); -foreach (KeyValuePair<string, string> entry in extendedProperties) -{ - logger.Write("<key:{0}> <value:{1}>", entry.Key, entry.Value); -} -``` --## Auto-scaling of Azure Batch -You can also create an Azure Batch pool with **autoscale** feature. For example, you could create an azure batch pool with 0 dedicated VMs and an autoscale formula based on the number of pending tasks. --The sample formula here achieves the following behavior: When the pool is initially created, it starts with 1 VM. $PendingTasks metric defines the number of tasks in running + active (queued) state. The formula finds the average number of pending tasks in the last 180 seconds and sets TargetDedicated accordingly. It ensures that TargetDedicated never goes beyond 25 VMs. So, as new tasks are submitted, pool automatically grows and as tasks complete, VMs become free one by one and the autoscaling shrinks those VMs. startingNumberOfVMs and maxNumberofVMs can be adjusted to your needs. --Autoscale formula: --``` -startingNumberOfVMs = 1; -maxNumberofVMs = 25; -pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second); -pendingTaskSamples = pendingTaskSamplePercent < 70 ? startingNumberOfVMs : avg($PendingTasks.GetSample(180 * TimeInterval_Second)); -$TargetDedicated=min(maxNumberofVMs,pendingTaskSamples); -``` --See [Automatically scale compute nodes in an Azure Batch pool](../../batch/batch-automatic-scaling.md) for details. --If the pool is using the default [autoScaleEvaluationInterval](/rest/api/batchservice/pool/enableautoscale), the Batch service could take 15-30 minutes to prepare the VM before running the custom activity. If the pool is using a different autoScaleEvaluationInterval, the Batch service could take autoScaleEvaluationInterval + 10 minutes. ---## Create a custom activity by using .NET SDK -In the walkthrough in this article, you create a data factory with a pipeline that uses the custom activity by using the Azure portal. The following code shows you how to create the data factory by using .NET SDK instead. You can find more details about using SDK to programmatically create pipelines in the [create a pipeline with copy activity by using .NET API](data-factory-copy-activity-tutorial-using-dotnet-api.md) article. --```csharp -using System; -using System.Configuration; -using System.Collections.ObjectModel; -using System.Threading; -using System.Threading.Tasks; --using Microsoft.Azure; -using Microsoft.Azure.Management.DataFactories; -using Microsoft.Azure.Management.DataFactories.Models; -using Microsoft.Azure.Management.DataFactories.Common.Models; --using Microsoft.IdentityModel.Clients.ActiveDirectory; -using System.Collections.Generic; --namespace DataFactoryAPITestApp -{ - class Program - { - static void Main(string[] args) - { - // create data factory management client -- // TODO: replace ADFTutorialResourceGroup with the name of your resource group. - string resourceGroupName = "ADFTutorialResourceGroup"; -- // TODO: replace APITutorialFactory with a name that is globally unique. For example: APITutorialFactory04212017 - string dataFactoryName = "APITutorialFactory"; -- TokenCloudCredentials aadTokenCredentials = new TokenCloudCredentials( - ConfigurationManager.AppSettings["SubscriptionId"], - GetAuthorizationHeader().Result); -- Uri resourceManagerUri = new Uri(ConfigurationManager.AppSettings["ResourceManagerEndpoint"]); -- DataFactoryManagementClient client = new DataFactoryManagementClient(aadTokenCredentials, resourceManagerUri); -- Console.WriteLine("Creating a data factory"); - client.DataFactories.CreateOrUpdate(resourceGroupName, - new DataFactoryCreateOrUpdateParameters() - { - DataFactory = new DataFactory() - { - Name = dataFactoryName, - Location = "westus", - Properties = new DataFactoryProperties() - } - } - ); -- // create a linked service for input data store: Azure Storage - Console.WriteLine("Creating Azure Storage linked service"); - client.LinkedServices.CreateOrUpdate(resourceGroupName, dataFactoryName, - new LinkedServiceCreateOrUpdateParameters() - { - LinkedService = new LinkedService() - { - Name = "AzureStorageLinkedService", - Properties = new LinkedServiceProperties - ( - // TODO: Replace <accountname> and <accountkey> with name and key of your Azure Storage account. - new AzureStorageLinkedService("DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>") - ) - } - } - ); -- // create a linked service for output data store: Azure SQL Database - Console.WriteLine("Creating Azure Batch linked service"); - client.LinkedServices.CreateOrUpdate(resourceGroupName, dataFactoryName, - new LinkedServiceCreateOrUpdateParameters() - { - LinkedService = new LinkedService() - { - Name = "AzureBatchLinkedService", - Properties = new LinkedServiceProperties - ( - // TODO: replace <batchaccountname> and <yourbatchaccountkey> with name and key of your Azure Batch account - new AzureBatchLinkedService("<batchaccountname>", "https://westus.batch.azure.com", "<yourbatchaccountkey>", "myazurebatchpool", "AzureStorageLinkedService") - ) - } - } - ); -- // create input and output datasets - Console.WriteLine("Creating input and output datasets"); - string Dataset_Source = "InputDataset"; - string Dataset_Destination = "OutputDataset"; -- Console.WriteLine("Creating input dataset of type: Azure Blob"); - client.Datasets.CreateOrUpdate(resourceGroupName, dataFactoryName, -- new DatasetCreateOrUpdateParameters() - { - Dataset = new Dataset() - { - Name = Dataset_Source, - Properties = new DatasetProperties() - { - LinkedServiceName = "AzureStorageLinkedService", - TypeProperties = new AzureBlobDataset() - { - FolderPath = "adftutorial/customactivityinput/", - Format = new TextFormat() - }, - External = true, - Availability = new Availability() - { - Frequency = SchedulePeriod.Hour, - Interval = 1, - }, -- Policy = new Policy() { } - } - } - }); -- Console.WriteLine("Creating output dataset of type: Azure Blob"); - client.Datasets.CreateOrUpdate(resourceGroupName, dataFactoryName, - new DatasetCreateOrUpdateParameters() - { - Dataset = new Dataset() - { - Name = Dataset_Destination, - Properties = new DatasetProperties() - { - LinkedServiceName = "AzureStorageLinkedService", - TypeProperties = new AzureBlobDataset() - { - FileName = "{slice}.txt", - FolderPath = "adftutorial/customactivityoutput/", - PartitionedBy = new List<Partition>() - { - new Partition() - { - Name = "slice", - Value = new DateTimePartitionValue() - { - Date = "SliceStart", - Format = "yyyy-MM-dd-HH" - } - } - } - }, - Availability = new Availability() - { - Frequency = SchedulePeriod.Hour, - Interval = 1, - }, - } - } - }); -- Console.WriteLine("Creating a custom activity pipeline"); - DateTime PipelineActivePeriodStartTime = new DateTime(2017, 3, 9, 0, 0, 0, 0, DateTimeKind.Utc); - DateTime PipelineActivePeriodEndTime = PipelineActivePeriodStartTime.AddMinutes(60); - string PipelineName = "ADFTutorialPipelineCustom"; -- client.Pipelines.CreateOrUpdate(resourceGroupName, dataFactoryName, - new PipelineCreateOrUpdateParameters() - { - Pipeline = new Pipeline() - { - Name = PipelineName, - Properties = new PipelineProperties() - { - Description = "Use custom activity", -- // Initial value for pipeline's active period. With this, you won't need to set slice status - Start = PipelineActivePeriodStartTime, - End = PipelineActivePeriodEndTime, - IsPaused = false, -- Activities = new List<Activity>() - { - new Activity() - { - Name = "MyDotNetActivity", - Inputs = new List<ActivityInput>() - { - new ActivityInput() { - Name = Dataset_Source - } - }, - Outputs = new List<ActivityOutput>() - { - new ActivityOutput() - { - Name = Dataset_Destination - } - }, - LinkedServiceName = "AzureBatchLinkedService", - TypeProperties = new DotNetActivity() - { - AssemblyName = "MyDotNetActivity.dll", - EntryPoint = "MyDotNetActivityNS.MyDotNetActivity", - PackageLinkedService = "AzureStorageLinkedService", - PackageFile = "customactivitycontainer/MyDotNetActivity.zip", - ExtendedProperties = new Dictionary<string, string>() - { - { "SliceStart", "$$Text.Format('{0:yyyyMMddHH-mm}', Time.AddMinutes(SliceStart, 0))"} - } - }, - Policy = new ActivityPolicy() - { - Concurrency = 2, - ExecutionPriorityOrder = "OldestFirst", - Retry = 3, - Timeout = new TimeSpan(0,0,30,0), - Delay = new TimeSpan() - } - } - } - } - } - }); - } -- public static async Task<string> GetAuthorizationHeader() - { - AuthenticationContext context = new AuthenticationContext(ConfigurationManager.AppSettings["ActiveDirectoryEndpoint"] + ConfigurationManager.AppSettings["ActiveDirectoryTenantId"]); - ClientCredential credential = new ClientCredential( - ConfigurationManager.AppSettings["ApplicationId"], - ConfigurationManager.AppSettings["Password"]); - AuthenticationResult result = await context.AcquireTokenAsync( - resource: ConfigurationManager.AppSettings["WindowsManagementUri"], - clientCredential: credential); -- if (result != null) - return result.AccessToken; -- throw new InvalidOperationException("Failed to acquire token"); - } - } -} -``` --## Debug custom activity in Visual Studio -The [Azure Data Factory - local environment](https://github.com/gbrueckl/Azure.DataFactory.LocalEnvironment) sample on GitHub includes a tool that allows you to debug custom .NET activities within Visual Studio. --## Sample custom activities on GitHub -| Sample | What custom activity does | -| | | -| [HTTP Data Downloader](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/HttpDataDownloaderSample). |Downloads data from an HTTP Endpoint to Azure Blob Storage using custom C# Activity in Data Factory. | -| [Run R Script](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/RunRScriptUsingADFSample). |Invokes R script by running RScript.exe on your HDInsight cluster that already has R Installed on it. | -| [Cross AppDomain .NET Activity](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/CrossAppDomainDotNetActivitySample) |Uses different assembly versions from ones used by the Data Factory launcher | -| [Reprocess a model in Azure Analysis Services](https://github.com/Azure/Azure-DataFactory/tree/master/SamplesV1/AzureAnalysisServicesProcessSample) | Reprocesses a model in Azure Analysis Services. | --[batch-net-library]: ../../batch/quick-run-dotnet.md -[batch-create-account]: ../../batch/batch-account-create-portal.md -[batch-technical-overview]:/azure/azure-sql/database/sql-database-paas-overview -[batch-get-started]: ../../batch/quick-run-dotnet.md -[use-custom-activities]: data-factory-use-custom-activities.md -[troubleshoot]: data-factory-troubleshoot.md -[data-factory-introduction]: data-factory-introduction.md -[azure-powershell-install]: https://github.com/Azure/azure-sdk-tools/releases ---[developer-reference]: /previous-versions/azure/dn834987(v=azure.100) -[cmdlet-reference]: /powershell/resourcemanager/Azurerm.DataFactories/v2.2.0/Azurerm.DataFactories --[new-azure-batch-account]: /previous-versions/azure/mt125880(v=azure.100) -[new-azure-batch-pool]: /previous-versions/azure/mt125936(v=azure.100) -[azure-batch-blog]: /archive/blogs/windowshpc/using-azure-powershell-to-manage-azure-batch-account --[nuget-package]: https://go.microsoft.com/fwlink/?LinkId=517478 -[adf-developer-reference]: /previous-versions/azure/dn834987(v=azure.100) --[adfgetstarted]: data-factory-copy-data-from-azure-blob-storage-to-sql-database.md -[hivewalkthrough]: data-factory-data-transformation-activities.md --[image-data-factory-output-from-custom-activity]: ./media/data-factory-use-custom-activities/OutputFilesFromCustomActivity.png --[image-data-factory-download-logs-from-custom-activity]: ./media/data-factory-use-custom-activities/DownloadLogsFromCustomActivity.png |
data-factory | Data Factory Usql Activity | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-usql-activity.md | - Title: Transform data using U-SQL script - Azure -description: Learn how to process or transform data by running U-SQL scripts on Azure Data Lake Analytics compute service - version 1. --- Previously updated : 04/12/2023------# Transform data by running U-SQL scripts on Azure Data Lake Analytics -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-usql-activity.md) -> * [Version 2 (current version)](../transform-data-using-data-lake-analytics.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [U-SQL Activity in V2](../transform-data-using-data-lake-analytics.md). --A pipeline in an Azure data factory processes data in linked storage services by using linked compute services. It contains a sequence of activities where each activity performs a specific processing operation. This article describes the **Data Lake Analytics U-SQL Activity** that runs a **U-SQL** script on an **Azure Data Lake Analytics** compute linked service. --Create an Azure Data Lake Analytics account before creating a pipeline with a Data Lake Analytics U-SQL Activity. To learn about Azure Data Lake Analytics, see [Get started with Azure Data Lake Analytics](../../data-lake-analytics/data-lake-analytics-get-started-portal.md). --Review the [Build your first pipeline tutorial](data-factory-build-your-first-pipeline.md) for detailed steps to create a data factory, linked services, datasets, and a pipeline. Use JSON snippets with Data Factory Editor or Visual Studio or Azure PowerShell to create Data Factory entities. --## Supported authentication types -U-SQL activity supports below authentication types against Data Lake Analytics: -* Service principal authentication -* User credential (OAuth) authentication --We recommend that you use service principal authentication, especially for a scheduled U-SQL execution. Token expiration behavior can occur with user credential authentication. For configuration details, see the [Linked service properties](#azure-data-lake-analytics-linked-service) section. --## Azure Data Lake Analytics Linked Service -You create an **Azure Data Lake Analytics** linked service to link an Azure Data Lake Analytics compute service to an Azure data factory. The Data Lake Analytics U-SQL activity in the pipeline refers to this linked service. --The following table provides descriptions for the generic properties used in the JSON definition. You can further choose between service principal and user credential authentication. --| Property | Description | Required | -| | | | -| **type** |The type property should be set to: **AzureDataLakeAnalytics**. |Yes | -| **accountName** |Azure Data Lake Analytics Account Name. |Yes | -| **dataLakeAnalyticsUri** |Azure Data Lake Analytics URI. |No | -| **subscriptionId** |Azure subscription id |No (If not specified, subscription of the data factory is used). | -| **resourceGroupName** |Azure resource group name |No (If not specified, resource group of the data factory is used). | --### Service principal authentication (recommended) -To use service principal authentication, register an application entity in Microsoft Entra ID and grant it the access to Data Lake Store. For detailed steps, see [Service-to-service authentication](../../data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory.md). Make note of the following values, which you use to define the linked service: -* Application ID -* Application key -* Tenant ID --Use service principal authentication by specifying the following properties: --| Property | Description | Required | -|: |: |: | -| **servicePrincipalId** | Specify the application's client ID. | Yes | -| **servicePrincipalKey** | Specify the application's key. | Yes | -| **tenant** | Specify the tenant information (domain name or tenant ID) under which your application resides. You can retrieve it by hovering the mouse in the upper-right corner of the Azure portal. | Yes | --**Example: Service principal authentication** -```json -{ - "name": "AzureDataLakeAnalyticsLinkedService", - "properties": { - "type": "AzureDataLakeAnalytics", - "typeProperties": { - "accountName": "adftestaccount", - "dataLakeAnalyticsUri": "azuredatalakeanalytics.net", - "servicePrincipalId": "<service principal id>", - "servicePrincipalKey": "<service principal key>", - "tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>", - "subscriptionId": "<optional, subscription id of ADLA>", - "resourceGroupName": "<optional, resource group name of ADLA>" - } - } -} -``` --### User credential authentication -Alternatively, you can use user credential authentication for Data Lake Analytics by specifying the following properties: --| Property | Description | Required | -|: |: |: | -| **authorization** | Click the **Authorize** button in the Data Factory Editor and enter your credential that assigns the autogenerated authorization URL to this property. | Yes | -| **sessionId** | OAuth session ID from the OAuth authorization session. Each session ID is unique and can be used only once. This setting is automatically generated when you use the Data Factory Editor. | Yes | --**Example: User credential authentication** -```json -{ - "name": "AzureDataLakeAnalyticsLinkedService", - "properties": { - "type": "AzureDataLakeAnalytics", - "typeProperties": { - "accountName": "adftestaccount", - "dataLakeAnalyticsUri": "azuredatalakeanalytics.net", - "authorization": "<authcode>", - "sessionId": "<session ID>", - "subscriptionId": "<optional, subscription id of ADLA>", - "resourceGroupName": "<optional, resource group name of ADLA>" - } - } -} -``` --#### Token expiration -The authorization code you generated by using the **Authorize** button expires after sometime. See the following table for the expiration times for different types of user accounts. You may see the following error message when the authentication **token expires**: Credential operation error: invalid_grant - AADSTS70002: Error validating credentials. AADSTS70008: The provided access grant is expired or revoked. Trace ID: d18629e8-af88-43c5-88e3-d8419eb1fca1 Correlation ID: fac30a0c-6be6-4e02-8d69-a776d2ffefd7 Timestamp: 2015-12-15 21:09:31Z --| User type | Expires after | -|: |: | -| User accounts NOT managed by Microsoft Entra ID (@hotmail.com, @live.com, etc.) |12 hours | -| Users accounts managed by Microsoft Entra ID |14 days after the last slice run. <br/><br/>90 days, if a slice based on OAuth-based linked service runs at least once every 14 days. | --To avoid/resolve this error, reauthorize using the **Authorize** button when the **token expires** and redeploy the linked service. You can also generate values for **sessionId** and **authorization** properties programmatically using code as follows: --```csharp -if (linkedService.Properties.TypeProperties is AzureDataLakeStoreLinkedService || - linkedService.Properties.TypeProperties is AzureDataLakeAnalyticsLinkedService) -{ - AuthorizationSessionGetResponse authorizationSession = this.Client.OAuth.Get(this.ResourceGroupName, this.DataFactoryName, linkedService.Properties.Type); -- WindowsFormsWebAuthenticationDialog authenticationDialog = new WindowsFormsWebAuthenticationDialog(null); - string authorization = authenticationDialog.AuthenticateAAD(authorizationSession.AuthorizationSession.Endpoint, new Uri("urn:ietf:wg:oauth:2.0:oob")); -- AzureDataLakeStoreLinkedService azureDataLakeStoreProperties = linkedService.Properties.TypeProperties as AzureDataLakeStoreLinkedService; - if (azureDataLakeStoreProperties != null) - { - azureDataLakeStoreProperties.SessionId = authorizationSession.AuthorizationSession.SessionId; - azureDataLakeStoreProperties.Authorization = authorization; - } -- AzureDataLakeAnalyticsLinkedService azureDataLakeAnalyticsProperties = linkedService.Properties.TypeProperties as AzureDataLakeAnalyticsLinkedService; - if (azureDataLakeAnalyticsProperties != null) - { - azureDataLakeAnalyticsProperties.SessionId = authorizationSession.AuthorizationSession.SessionId; - azureDataLakeAnalyticsProperties.Authorization = authorization; - } -} -``` --See [AzureDataLakeStoreLinkedService Class](/dotnet/api/microsoft.azure.management.datafactories.models.azuredatalakestorelinkedservice), [AzureDataLakeAnalyticsLinkedService Class](/dotnet/api/microsoft.azure.management.datafactories.models.azuredatalakeanalyticslinkedservice), and [AuthorizationSessionGetResponse Class](/dotnet/api/microsoft.azure.management.datafactories.models.authorizationsessiongetresponse) topics for details about the Data Factory classes used in the code. Add a reference to: Microsoft.IdentityModel.Clients.ActiveDirectory.WindowsForms.dll for the WindowsFormsWebAuthenticationDialog class. --## Data Lake Analytics U-SQL Activity -The following JSON snippet defines a pipeline with a Data Lake Analytics U-SQL Activity. The activity definition has a reference to the Azure Data Lake Analytics linked service you created earlier. --```json -{ - "name": "ComputeEventsByRegionPipeline", - "properties": { - "description": "This is a pipeline to compute events for en-gb locale and date less than 2012/02/19.", - "activities": - [ - { - "type": "DataLakeAnalyticsU-SQL", - "typeProperties": { - "scriptPath": "scripts\\kona\\SearchLogProcessing.txt", - "scriptLinkedService": "StorageLinkedService", - "degreeOfParallelism": 3, - "priority": 100, - "parameters": { - "in": "/datalake/input/SearchLog.tsv", - "out": "/datalake/output/Result.tsv" - } - }, - "inputs": [ - { - "name": "DataLakeTable" - } - ], - "outputs": - [ - { - "name": "EventsByRegionTable" - } - ], - "policy": { - "timeout": "06:00:00", - "concurrency": 1, - "executionPriorityOrder": "NewestFirst", - "retry": 1 - }, - "scheduler": { - "frequency": "Day", - "interval": 1 - }, - "name": "EventsByRegion", - "linkedServiceName": "AzureDataLakeAnalyticsLinkedService" - } - ], - "start": "2015-08-08T00:00:00Z", - "end": "2015-08-08T01:00:00Z", - "isPaused": false - } -} -``` --The following table describes names and descriptions of properties that are specific to this activity. --| Property | Description | Required | -| : | : | : | -| type | The type property must be set to **DataLakeAnalyticsU-SQL**. | Yes | -| linkedServiceName | Reference to the Azure Data Lake Analytics registered as a linked service in Data Factory | Yes | -| scriptPath | Path to folder that contains the U-SQL script. Name of the file is case-sensitive. | No (if you use script) | -| scriptLinkedService | Linked service that links the storage that contains the script to the data factory | No (if you use script) | -| script | Specify inline script instead of specifying scriptPath and scriptLinkedService. For example: `"script": "CREATE DATABASE test"`. | No (if you use scriptPath and scriptLinkedService) | -| degreeOfParallelism | The maximum number of nodes simultaneously used to run the job. | No | -| priority | Determines which jobs out of all that are queued should be selected to run first. The lower the number, the higher the priority. | No | -| parameters | Parameters for the U-SQL script | No | -| runtimeVersion | Runtime version of the U-SQL engine to use | No | -| compilationMode | <p>Compilation mode of U-SQL. Must be one of these values:</p> <ul><li>**Semantic:** Only perform semantic checks and necessary sanity checks.</li><li>**Full:** Perform the full compilation, including syntax check, optimization, code generation, etc.</li><li>**SingleBox:** Perform the full compilation, with TargetType setting to SingleBox.</li></ul><p>If you don't specify a value for this property, the server determines the optimal compilation mode. </p> | No | --See [SearchLogProcessing.txt Script Definition](#sample-u-sql-script) for the script definition. --## Sample input and output datasets -### Input dataset -In this example, the input data resides in an Azure Data Lake Store (SearchLog.tsv file in the datalake/input folder). --```json -{ - "name": "DataLakeTable", - "properties": { - "type": "AzureDataLakeStore", - "linkedServiceName": "AzureDataLakeStoreLinkedService", - "typeProperties": { - "folderPath": "datalake/input/", - "fileName": "SearchLog.tsv", - "format": { - "type": "TextFormat", - "rowDelimiter": "\n", - "columnDelimiter": "\t" - } - }, - "availability": { - "frequency": "Day", - "interval": 1 - } - } -} -``` --### Output dataset -In this example, the output data produced by the U-SQL script is stored in an Azure Data Lake Store (datalake/output folder). --```json -{ - "name": "EventsByRegionTable", - "properties": { - "type": "AzureDataLakeStore", - "linkedServiceName": "AzureDataLakeStoreLinkedService", - "typeProperties": { - "folderPath": "datalake/output/" - }, - "availability": { - "frequency": "Day", - "interval": 1 - } - } -} -``` --### Sample Data Lake Store Linked Service -Here is the definition of the sample Azure Data Lake Store linked service used by the input/output datasets. --```json -{ - "name": "AzureDataLakeStoreLinkedService", - "properties": { - "type": "AzureDataLakeStore", - "typeProperties": { - "dataLakeUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1", - "servicePrincipalId": "<service principal id>", - "servicePrincipalKey": "<service principal key>", - "tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>", - } - } -} -``` --See [Move data to and from Azure Data Lake Store](data-factory-azure-datalake-connector.md) article for descriptions of JSON properties. --## Sample U-SQL Script --``` -@searchlog = - EXTRACT UserId int, - Start DateTime, - Region string, - Query string, - Duration int?, - Urls string, - ClickedUrls string - FROM @in - USING Extractors.Tsv(nullEscape:"#NULL#"); --@rs1 = - SELECT Start, Region, Duration - FROM @searchlog -WHERE Region == "en-gb"; --@rs1 = - SELECT Start, Region, Duration - FROM @rs1 - WHERE Start <= DateTime.Parse("2012/02/19"); --OUTPUT @rs1 - TO @out - USING Outputters.Tsv(quoting:false, dateTimeFormat:null); -``` --The values for **\@in** and **\@out** parameters in the U-SQL script are passed dynamically by ADF using the 'parameters' section. See the 'parameters' section in the pipeline definition. --You can specify other properties such as degreeOfParallelism and priority as well in your pipeline definition for the jobs that run on the Azure Data Lake Analytics service. --## Dynamic parameters -In the sample pipeline definition, in and out parameters are assigned with hard-coded values. --```json -"parameters": { - "in": "/datalake/input/SearchLog.tsv", - "out": "/datalake/output/Result.tsv" -} -``` --It is possible to use dynamic parameters instead. For example: --```json -"parameters": { - "in": "$$Text.Format('/datalake/input/{0:yyyy-MM-dd HH:mm:ss}.tsv', SliceStart)", - "out": "$$Text.Format('/datalake/output/{0:yyyy-MM-dd HH:mm:ss}.tsv', SliceStart)" -} -``` --In this case, input files are still picked up from the /datalake/input folder and output files are generated in the /datalake/output folder. The file names are dynamic based on the slice start time. |
data-factory | Data Factory Web Table Connector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/data-factory-web-table-connector.md | - Title: Move data from Web Table using Azure Data Factory -description: Learn about how to move data from a table in a Web page using Azure Data Factory. ---- Previously updated : 04/12/2023----# Move data from a Web table source using Azure Data Factory -> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"] -> * [Version 1](data-factory-web-table-connector.md) -> * [Version 2 (current version)](../connector-web-table.md) --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Web table connector in V2](../connector-web-table.md). --This article outlines how to use the Copy Activity in Azure Data Factory to move data from a table in a Web page to a supported sink data store. This article builds on the [data movement activities](data-factory-data-movement-activities.md) article that presents a general overview of data movement with copy activity and the list of data stores supported as sources/sinks. --Data factory currently supports only moving data from a Web table to other data stores, but not moving data from other data stores to a Web table destination. --> [!IMPORTANT] -> This Web connector currently supports only extracting table content from an HTML page. To retrieve data from a HTTP/s endpoint, use [HTTP connector](data-factory-http-connector.md) instead. --## Prerequisites --To use this Web table connector, you need to set up a Self-hosted Integration Runtime (aka Data Management Gateway) and configure the `gatewayName` property in the sink linked service. For example, to copy from Web table to Azure Blob storage, configure the Azure Storage linked service as the following: --```json -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>", - "gatewayName": "<gateway name>" - } - } -} -``` --## Getting started -You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by using different tools/APIs. --- The easiest way to create a pipeline is to use the **Copy Wizard**. See [Tutorial: Create a pipeline using Copy Wizard](data-factory-copy-data-wizard-tutorial.md) for a quick walkthrough on creating a pipeline using the Copy data wizard. -- You can also use the following tools to create a pipeline: **Visual Studio**, **Azure PowerShell**, **Azure Resource Manager template**, **.NET API**, and **REST API**. See [Copy activity tutorial](data-factory-copy-data-from-azure-blob-storage-to-sql-database.md) for step-by-step instructions to create a pipeline with a copy activity. --Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store: --1. Create **linked services** to link input and output data stores to your data factory. -2. Create **datasets** to represent input and output data for the copy operation. -3. Create a **pipeline** with a copy activity that takes a dataset as an input and a dataset as an output. --When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from a web table, see [JSON example: Copy data from Web table to Azure Blob](#json-example-copy-data-from-web-table-to-azure-blob) section of this article. --The following sections provide details about JSON properties that are used to define Data Factory entities specific to a Web table: --## Linked service properties -The following table provides description for JSON elements specific to Web linked service. --| Property | Description | Required | -| | | | -| type |The type property must be set to: **Web** |Yes | -| Url |URL to the Web source |Yes | -| authenticationType |Anonymous. |Yes | --### Using Anonymous authentication --```json -{ - "name": "web", - "properties": - { - "type": "Web", - "typeProperties": - { - "authenticationType": "Anonymous", - "url" : "https://en.wikipedia.org/wiki/" - } - } -} -``` --## Dataset properties -For a full list of sections & properties available for defining datasets, see the [Creating datasets](data-factory-create-datasets.md) article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.). --The **typeProperties** section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type **WebTable** has the following properties --| Property | Description | Required | -|: |: |: | -| type |type of the dataset. must be set to **WebTable** |Yes | -| path |A relative URL to the resource that contains the table. |No. When path is not specified, only the URL specified in the linked service definition is used. | -| index |The index of the table in the resource. See [Get index of a table in an HTML page](#get-index-of-a-table-in-an-html-page) section for steps to getting index of a table in an HTML page. |Yes | --**Example:** --```json -{ - "name": "WebTableInput", - "properties": { - "type": "WebTable", - "linkedServiceName": "WebLinkedService", - "typeProperties": { - "index": 1, - "path": "AFI's_100_Years...100_Movies" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` --## Copy activity properties -For a full list of sections & properties available for defining activities, see the [Creating Pipelines](data-factory-create-pipelines.md) article. Properties such as name, description, input and output tables, and policy are available for all types of activities. --Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks. --Currently, when the source in copy activity is of type **WebSource**, no additional properties are supported. ---## JSON example: Copy data from Web table to Azure Blob -The following sample shows: --1. A linked service of type [Web](#linked-service-properties). -2. A linked service of type [AzureStorage](data-factory-azure-blob-connector.md#linked-service-properties). -3. An input [dataset](data-factory-create-datasets.md) of type [WebTable](#dataset-properties). -4. An output [dataset](data-factory-create-datasets.md) of type [AzureBlob](data-factory-azure-blob-connector.md#dataset-properties). -5. A [pipeline](data-factory-create-pipelines.md) with Copy Activity that uses [WebSource](#copy-activity-properties) and [BlobSink](data-factory-azure-blob-connector.md#copy-activity-properties). --The sample copies data from a Web table to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples. --The following sample shows how to copy data from a Web table to an Azure blob. However, data can be copied directly to any of the sinks stated in the [Data Movement Activities](data-factory-data-movement-activities.md) article by using the Copy Activity in Azure Data Factory. --**Web linked service** -This example uses the Web linked service with anonymous authentication. See [Web linked service](#linked-service-properties) section for different types of authentication you can use. --```json -{ - "name": "WebLinkedService", - "properties": - { - "type": "Web", - "typeProperties": - { - "authenticationType": "Anonymous", - "url" : "https://en.wikipedia.org/wiki/" - } - } -} -``` --**Azure Storage linked service** --```json -{ - "name": "AzureStorageLinkedService", - "properties": { - "type": "AzureStorage", - "typeProperties": { - "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>", - "gatewayName": "<gateway name>" - } - } -} -``` --**WebTable input dataset** -Setting **external** to **true** informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory. --> [!NOTE] -> See [Get index of a table in an HTML page](#get-index-of-a-table-in-an-html-page) section for steps to getting index of a table in an HTML page. -> -> --```json -{ - "name": "WebTableInput", - "properties": { - "type": "WebTable", - "linkedServiceName": "WebLinkedService", - "typeProperties": { - "index": 1, - "path": "AFI's_100_Years...100_Movies" - }, - "external": true, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } -} -``` ---**Azure Blob output dataset** --Data is written to a new blob every hour (frequency: hour, interval: 1). --```json -{ - "name": "AzureBlobOutput", - "properties": - { - "type": "AzureBlob", - "linkedServiceName": "AzureStorageLinkedService", - "typeProperties": - { - "folderPath": "adfgetstarted/Movies" - }, - "availability": - { - "frequency": "Hour", - "interval": 1 - } - } -} -``` ----**Pipeline with Copy activity** --The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the **source** type is set to **WebSource** and **sink** type is set to **BlobSink**. --See WebSource type properties for the list of properties supported by the WebSource. --```json -{ - "name":"SamplePipeline", - "properties":{ - "start":"2014-06-01T18:00:00", - "end":"2014-06-01T19:00:00", - "description":"pipeline with copy activity", - "activities":[ - { - "name": "WebTableToAzureBlob", - "description": "Copy from a Web table to an Azure blob", - "type": "Copy", - "inputs": [ - { - "name": "WebTableInput" - } - ], - "outputs": [ - { - "name": "AzureBlobOutput" - } - ], - "typeProperties": { - "source": { - "type": "WebSource" - }, - "sink": { - "type": "BlobSink" - } - }, - "scheduler": { - "frequency": "Hour", - "interval": 1 - }, - "policy": { - "concurrency": 1, - "executionPriorityOrder": "OldestFirst", - "retry": 0, - "timeout": "01:00:00" - } - } - ] - } -} -``` --## Get index of a table in an HTML page -1. Launch **Excel 2016** and switch to the **Data** tab. -2. Click **New Query** on the toolbar, point to **From Other Sources** and click **From Web**. -- :::image type="content" source="./media/data-factory-web-table-connector/PowerQuery-Menu.png" alt-text="Power Query menu"::: -3. In the **From Web** dialog box, enter **URL** that you would use in linked service JSON (for example: https://en.wikipedia.org/wiki/) along with path you would specify for the dataset (for example: AFI%27s_100_Years...100_Movies), and click **OK**. -- :::image type="content" source="./media/data-factory-web-table-connector/FromWeb-DialogBox.png" alt-text="From Web dialog"::: -- URL used in this example: https://en.wikipedia.org/wiki/AFI%27s_100_Years...100_Movies -4. If you see **Access Web content** dialog box, select the right **URL**, **authentication**, and click **Connect**. -- :::image type="content" source="./media/data-factory-web-table-connector/AccessWebContentDialog.png" alt-text="Access Web content dialog box"::: -5. Click a **table** item in the tree view to see content from the table and then click **Edit** button at the bottom. -- :::image type="content" source="./media/data-factory-web-table-connector/Navigator-DialogBox.png" alt-text="Navigator dialog"::: -6. In the **Query Editor** window, click **Advanced Editor** button on the toolbar. -- :::image type="content" source="./media/data-factory-web-table-connector/QueryEditor-AdvancedEditorButton.png" alt-text="Advanced Editor button"::: -7. In the Advanced Editor dialog box, the number next to "Source" is the index. -- :::image type="content" source="./media/data-factory-web-table-connector/AdvancedEditor-Index.png" alt-text="Advanced Editor - Index"::: --If you are using Excel 2013, use [Microsoft Power Query for Excel](https://www.microsoft.com/download/details.aspx?id=39379) to get the index. See [Connect to a web page](https://support.office.com/article/Connect-to-a-web-page-Power-Query-b2725d67-c9e8-43e6-a590-c0a175bd64d8) article for details. The steps are similar if you are using [Microsoft Power BI for Desktop](https://powerbi.microsoft.com/desktop/). --> [!NOTE] -> To map columns from source dataset to columns from sink dataset, see [Mapping dataset columns in Azure Data Factory](data-factory-map-columns.md). --## Performance and Tuning -See [Copy Activity Performance & Tuning Guide](data-factory-copy-activity-performance.md) to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it. |
data-factory | How To Invoke Ssis Package Stored Procedure Activity | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/data-factory/v1/how-to-invoke-ssis-package-stored-procedure-activity.md | - Title: Invoke SSIS package using Azure Data Factory - Stored Procedure Activity -description: This article describes how to invoke a SQL Server Integration Services (SSIS) package from an Azure Data Factory pipeline using the Stored Procedure Activity. ----- Previously updated : 04/12/2023---# Invoke an SSIS package using stored procedure activity in Azure Data Factory -This article describes how to invoke an SSIS package from an Azure Data Factory pipeline by using a stored procedure activity. --> [!NOTE] -> This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see [Invoke SSIS packages using stored procedure activity in](../how-to-invoke-ssis-package-stored-procedure-activity.md). --## Prerequisites --### Azure SQL Database -The walkthrough in this article uses Azure SQL Database. You can also use an Azure SQL Managed Instance. --### Create an Azure-SSIS integration runtime -Create an Azure-SSIS integration runtime if you don't have one by following the step-by-step instruction in the [Tutorial: Deploy SSIS packages](../tutorial-deploy-ssis-packages-azure.md). You cannot use Data Factory version 1 to create an Azure-SSIS integration runtime. --## Azure PowerShell -In this section you use Azure PowerShell to create a Data Factory pipeline with a stored procedure activity that invokes an SSIS package. ---Install the latest Azure PowerShell modules by following instructions in [How to install and configure Azure PowerShell](/powershell/azure/install-azure-powershell). --### Create a data factory -The following procedure provides steps to create a data factory. You create a pipeline with a stored procedure activity in this data factory. The stored procedure activity executes a stored procedure in the SSISDB database to run your SSIS package. --1. Define a variable for the resource group name that you use in PowerShell commands later. Copy the following command text to PowerShell, specify a name for the [Azure resource group](../../azure-resource-manager/management/overview.md) in double quotes, and then run the command. For example: `"adfrg"`. -- ```powershell - $resourceGroupName = "ADFTutorialResourceGroup"; - ``` -- If the resource group already exists, you may not want to overwrite it. Assign a different value to the `$ResourceGroupName` variable and run the command again --2. To create the Azure resource group, run the following command: -- ```powershell - $ResGrp = New-AzResourceGroup $resourceGroupName -location 'eastus' - ``` -- If the resource group already exists, you may not want to overwrite it. Assign a different value to the `$ResourceGroupName` variable and run the command again. --3. Define a variable for the data factory name. -- > [!IMPORTANT] - > Update the data factory name to be globally unique. -- ```powershell - $DataFactoryName = "ADFTutorialFactory"; - ``` --5. To create the data factory, run the following **New-AzDataFactory** cmdlet, using the Location and ResourceGroupName property from the $ResGrp variable: - - ```powershell - $df = New-AzDataFactory -ResourceGroupName $ResourceGroupName -Name $dataFactoryName -Location "East US" - ``` --Note the following points: --* The name of the Azure data factory must be globally unique. If you receive the following error, change the name and try again. -- ``` - The specified Data Factory name 'ADFTutorialFactory' is already in use. Data Factory names must be globally unique. - ``` -* To create Data Factory instances, the user account you use to log in to Azure must be a member of **contributor** or **owner** roles, or an **administrator** of the Azure subscription. --### Create an Azure SQL Database linked service -Create a linked service to link your database in Azure SQL Database that hosts the SSIS catalog to your data factory. Data Factory uses information in this linked service to connect to SSISDB database, and executes a stored procedure to run an SSIS package. --1. Create a JSON file named **AzureSqlDatabaseLinkedService.json** in **C:\ADF\RunSSISPackage** folder with the following content: -- > [!IMPORTANT] - > Replace <servername>, <username>@<servername> and <password> with values of your Azure SQL Database before saving the file. -- ```json - { - "name": "AzureSqlDatabaseLinkedService", - "properties": { - "type": "AzureSqlDatabase", - "typeProperties": { - "connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=SSISDB;User ID=<username>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30" - } - } - } - ``` -2. In **Azure PowerShell**, switch to the **C:\ADF\RunSSISPackage** folder. -3. Run the **New-AzDataFactoryLinkedService** cmdlet to create the linked service: **AzureSqlDatabaseLinkedService**. -- ```powershell - New-AzDataFactoryLinkedService $df -File ".\AzureSqlDatabaseLinkedService.json" - ``` --### Create an output dataset -This output dataset is a dummy dataset that drives the schedule of the pipeline. Notice that the frequency is set to Hour and interval is set to 1. Therefore, the pipeline runs once an hour within the pipeline start and end times. --1. Create an OutputDataset.json file with the following content: - - ```json - { - "name": "sprocsampleout", - "properties": { - "type": "AzureSqlTable", - "linkedServiceName": "AzureSqlLinkedService", - "typeProperties": { }, - "availability": { - "frequency": "Hour", - "interval": 1 - } - } - } - ``` -2. Run the **New-AzDataFactoryDataset** cmdlet to create a dataset. -- ```powershell - New-AzDataFactoryDataset $df -File ".\OutputDataset.json" - ``` --### Create a pipeline with stored procedure activity -In this step, you create a pipeline with a stored procedure activity. The activity invokes the sp_executesql stored procedure to run your SSIS package. --1. Create a JSON file named **MyPipeline.json** in the **C:\ADF\RunSSISPackage** folder with the following content: -- > [!IMPORTANT] - > Replace <folder name>, <project name>, <package name> with names of folder, project, and package in the SSIS catalog before saving the file. -- ```json - { - "name": "MyPipeline", - "properties": { - "activities": [{ - "name": "SprocActivitySample", - "type": "SqlServerStoredProcedure", - "typeProperties": { - "storedProcedureName": "sp_executesql", - "storedProcedureParameters": { - "stmt": "DECLARE @return_value INT, @exe_id BIGINT, @err_msg NVARCHAR(150) EXEC @return_value=[SSISDB].[catalog].[create_execution] @folder_name=N'<folder name>', @project_name=N'<project name>', @package_name=N'<package name>', @use32bitruntime=0, @runinscaleout=1, @useanyworker=1, @execution_id=@exe_id OUTPUT EXEC [SSISDB].[catalog].[set_execution_parameter_value] @exe_id, @object_type=50, @parameter_name=N'SYNCHRONIZED', @parameter_value=1 EXEC [SSISDB].[catalog].[start_execution] @execution_id=@exe_id, @retry_count=0 IF(SELECT [status] FROM [SSISDB].[catalog].[executions] WHERE execution_id=@exe_id)<>7 BEGIN SET @err_msg=N'Your package execution did not succeed for execution ID: ' + CAST(@exe_id AS NVARCHAR(20)) RAISERROR(@err_msg,15,1) END" - } - }, - "outputs": [{ - "name": "sprocsampleout" - }], - "scheduler": { - "frequency": "Hour", - "interval": 1 - } - }], - "start": "2017-10-01T00:00:00Z", - "end": "2017-10-01T05:00:00Z", - "isPaused": false - } - } - ``` --2. To create the pipeline: **RunSSISPackagePipeline**, run the **New-AzDataFactoryPipeline** cmdlet. -- ```powershell - $DFPipeLine = New-AzDataFactoryPipeline -DataFactoryName $DataFactory.DataFactoryName -ResourceGroupName $ResGrp.ResourceGroupName -Name "RunSSISPackagePipeline" -DefinitionFile ".\RunSSISPackagePipeline.json" - ``` --### Monitor the pipeline run --1. Run **Get-AzDataFactorySlice** to get details about all slices of the output dataset**, which is the output table of the pipeline. -- ```powershell - Get-AzDataFactorySlice $df -DatasetName sprocsampleout -StartDateTime 2017-10-01T00:00:00Z - ``` - Notice that the StartDateTime you specify here is the same start time specified in the pipeline JSON. -1. Run **Get-AzDataFactoryRun** to get the details of activity runs for a specific slice. -- ```powershell - Get-AzDataFactoryRun $df -DatasetName sprocsampleout -StartDateTime 2017-10-01T00:00:00Z - ``` -- You can keep running this cmdlet until you see the slice in **Ready** state or **Failed** state. -- You can run the following query against the SSISDB database in your server to verify that the package executed. -- ```sql - select * from catalog.executions - ``` --## Next steps -For details about the stored procedure activity, see the [Stored Procedure activity](data-factory-stored-proc-activity.md) article. |
defender-for-cloud | Data Aware Security Dashboard Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/defender-for-cloud/data-aware-security-dashboard-overview.md | The feature is turned on at the subscription level. ## Required permissions and roles -- To view the dashboard you must have either one of the following:+- To view the dashboard you must have either one of the following scenarios: - - permissions: + - **all of the following permissions**: - Microsoft.Security/assessments/read - Microsoft.Security/assessments/subassessments/read - Microsoft.Security/alerts/read - - the minimum required privileged RBAC role of **Security Reader**. + - **the minimum required privileged RBAC role** of **Security Reader**. - Each Azure subscription must be registered for the **Microsoft.Security** resource provider: |
devtest | Concepts Gitops Azure Devtest | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/devtest/offer/concepts-gitops-azure-devtest.md | Title: GitOps & Azure Dev/Test offer description: Use GitOps in association with Azure Dev/Test +ms.technology: vs-subscriptions Previously updated : 10/20/2021 Last updated : 10/18/2023 # Using GitOps with Azure Dev/Test offer to optimize and secure applications To deploy a new application or update an existing one, you only need to update t ## Use GitOps with Dev/Test -GitOps as a process and framework should be applied to your non-production instances. It can be verified or used in your DevTest environments. You can use GitOps principles to improve your DevOps processes. Use your DevTest benefits and environments with GitOps principles to optimize your activities and maintain the security and reliability of your applications. +GitOps as a process and framework should be applied to your nonproduction instances. It can be verified or used in your DevTest environments. You can use GitOps principles to improve your DevOps processes. Use your DevTest benefits and environments with GitOps principles to optimize your activities and maintain the security and reliability of your applications. GitOps combines automation and commonly used collaboration frameworks like git. They can be combined to provide rapid delivery of cloud infrastructure while complying with enterprise security standards. |
devtest | Concepts Security Governance Devtest | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/devtest/offer/concepts-security-governance-devtest.md | Title: Security, governance, and Azure Dev/Test subscriptions description: Manage security and governance within your organization's Dev/Test subscriptions. +ms.technology: vs-subscriptions Previously updated : 10/20/2021 Last updated : 10/18/2023 Important tools within Azure Dev/Test Subscriptions help you create secure acces ## Azure Management Groups -When enabling and setting up your Azure Dev/Test Subscriptions, Azure deploys a default resource hierarchy to manage identities and access to resources in a single Microsoft Entra domain. The resource hierarchy allows your organization to set up strong security perimeters for your resources and users. +When you enable and set up your Azure Dev/Test Subscriptions, Azure deploys a default resource hierarchy to manage identities and access to resources in a single Microsoft Entra domain. The resource hierarchy allows your organization to set up strong security perimeters for your resources and users. ![A screenshot of the Azure Management Groups](media/concepts-security-governance-devtest/access-management-groups.png "Azure default resource hierarchy.") -Your resources, resource groups, subscriptions, management groups, and tenant collectively make up your resource hierarchy. Updating and changing these settings in Azure custom roles or Azure policy assignments can effect every resource in your resource hierarchy. It's important to protect the resource hierarchy from changes that could negatively impact all resources. +Your resources, resource groups, subscriptions, management groups, and tenant collectively make up your resource hierarchy. Updating and changing these settings in Azure custom roles or Azure policy assignments can affect every resource in your resource hierarchy. It's important to protect the resource hierarchy from changes that could negatively affect all resources. [Azure Management Groups](../../governance/management-groups/overview.md) are an important aspect of governing access and protecting your resources in a single tenant. Azure Management Groups allows you to set quotas, Azure policies, and security to different types of subscriptions. These groups are a vital component of developing security for your organization's dev/test subscriptions. ![A screenshot of Azure org and governance groupings](media/concepts-security-governance-devtest/orgs-and-governance.png "How Azure Management Groups fit into overall governance.") -As you can see above, using management groups changes the default hierarchy and adds a level for the management groups. This behavior can potentially create unforeseen circumstances and holes in security if you don’t follow the [appropriate process to protect your resource hierarchy](../../governance/management-groups/how-to/protect-resource-hierarchy.md) +As you can see, using management groups changes the default hierarchy and adds a level for the management groups. This behavior can potentially create unforeseen circumstances and holes in security if you don’t follow the [appropriate process to protect your resource hierarchy](../../governance/management-groups/how-to/protect-resource-hierarchy.md) ## Why are Azure Management Groups useful? -When developing security policies for your organization's dev/test subscriptions, you may choose to have multiple dev/test subscriptions per organizational unit or line of business. You can see a visual of that management grouping below. +When developing security policies for your organization's dev/test subscriptions, you might choose to have multiple dev/test subscriptions per organizational unit or line of business. You can see a visual of that management grouping in the following diagram. ![A diagram of subscription management groupings for multiple subscriptions within an organization.](media/concepts-security-governance-devtest/access-management-groups.png "A diagram of management groupings for multiple subscriptions within an organization.") -You may also choose to have one dev/test subscription for all of your different units. +You might also choose to have one dev/test subscription for all of your different units. Your Azure Management Groups and dev/test subscriptions act as a security barrier within your organizational structure. This security barrier has two components: -- Identity and access: You may need to segment access to specific resources +- Identity and access: You might need to segment access to specific resources - Data: Different subscriptions for resources that access personal information <a name='using-azure-active-directory-tenants'></a> An easy way to do separate resources and users is creating a new Microsoft Entra ### Create a new Microsoft Entra tenant -If you don't have a Microsoft Entra tenant, or want to create a new one for development, see the [quick start guide](../../active-directory/fundamentals/active-directory-access-create-new-tenant.md) or follow the [directory creation experience](https://portal.azure.com/#create/Microsoft.AzureActiveDirectory). You will have to provide the following info to create your new tenant: +If you don't have a Microsoft Entra tenant, or want to create a new one for development, see the [quick start guide](../../active-directory/fundamentals/active-directory-access-create-new-tenant.md) or follow the [directory creation experience](https://portal.azure.com/#create/Microsoft.AzureActiveDirectory). You have to provide the following info to create your new tenant: - **Organization name** -- **Initial domain** - will be part of /*.onmicrosoft.com. You can customize the domain later. +- **Initial domain** - is part of /*.onmicrosoft.com. You can customize the domain later. - **Country or region** [Learn more about creating and setting up Microsoft Entra tenants](../../active-directory/develop/quickstart-create-new-tenant.md) ### Using Azure Lighthouse to manage multiple tenants -Azure Lighthouse enables cross- and multi-tenant management, allowing for higher automation, scalability, and enhanced governance across resources and tenants. Service providers can deliver managed services using comprehensive and robust management tooling built into the Azure platform. Customers maintain control over who accesses their tenant, which resources they access, and what actions can be taken. +Azure Lighthouse enables cross- and multitenant management, allowing for higher automation, scalability, and enhanced governance across resources and tenants. Service providers can deliver managed services using comprehensive and robust management tooling built into the Azure platform. Customers maintain control over who accesses their tenant, which resources they access, and what actions can be taken. A common scenario for Azure Lighthouse is managing resources in its customers’ Microsoft Entra tenants. However, the capabilities of Azure Lighthouse can also be used to simplify cross-tenant management within an enterprise that uses multiple Microsoft Entra tenants. For most organizations, management is easier with a single Microsoft Entra tenant. Having all resources within one tenant allows centralization of management tasks by designated users, user groups, or service principals within that tenant. -Where a multi-tenant architecture is required, Azure Lighthouse helps centralize and streamline management operations. By using Azure delegated resource management, users in one managing tenant can perform cross-tenant management functions in a centralized, scalable manner. +Where a multitenant architecture is required, Azure Lighthouse helps centralize and streamline management operations. By using Azure delegated resource management, users in one managing tenant can perform cross-tenant management functions in a centralized, scalable manner. [More Security Resources](../../security-center/security-center-introduction.md) |
devtest | How To Add Users Directory | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/devtest/offer/how-to-add-users-directory.md | Title: Add users to your Azure Dev/Test developer directory tenant description: A how-to guide for adding users to your Azure credit subscription and managing their access with role-based controls. +ms.technology: vs-subscriptions Previously updated : 10/12/2021 Last updated : 10/18/2023 Step 1 is determining whether you need to add a new user to your subscription. H - At the business level, a single resource group needs to sign in to one computer but they donΓÇÖt need access to the subscription - Provides more visibility and transparency for projects you're working on but still isolates certain parts of work if necessary - Add a consultant as a user or within a group to contribute -- You want to collaborate with someone to test and monitor pre-production +- You want to collaborate with someone to test and monitor preproduction ## Where do I add users and their roles within my subscription? -Within Microsoft Entra ID access management is a critical function. [Azure role-based access control](../../role-based-access-control/overview.md) \(Azure RBAC\) is the authorization system that provides fine-grained access management of Azure resources. +Within Microsoft Entra ID, access management is a critical function. [Azure role-based access control](../../role-based-access-control/overview.md) \(Azure RBAC\) is the authorization system that provides fine-grained access management of Azure resources. -Once youΓÇÖve determined you need to add a user, you must understand where you're adding them and what resources they need access to. The set of resources that a user can access is referred to as scope. +Once youΓÇÖve determined the need to add a user, you must understand where you're adding them and what resources they need access to. The set of resources that a user can access is referred to as scope. -If the project and tasks you're working on need to be monitored by IT to enable security protocols, you'll need that user to have an Admin role within the Management group to have full access and permissions within the subscription. +If the project and tasks you're working on need to be monitored by IT to enable security protocols, you need that user to have an Admin role within the Management group to have full access and permissions within the subscription. ![A screenshot of the access levels in Azure.](media/how-to-add-users-directory/access-management.png "Managing roles with management groups in Azure.") -However, if you're collaborating with another developer or a colleague, they may only need access at the resource or resource group level. +However, if you're collaborating with another developer or a colleague, they might only need access at the resource or resource group level. Read through the [Azure RBAC overview](../../role-based-access-control/overview.md) to better understand how Azure RBAC works and the purpose it serves within your subscription. |
devtest | How To Change Directory Tenants Visual Studio Azure | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/devtest/offer/how-to-change-directory-tenants-visual-studio-azure.md | Title: Change directory tenants with your individual VSS Azure subscriptions description: Change directory tenants with your Azure subscriptions. +ms.technology: vs-subscriptions Previously updated : 10/12/2021 Last updated : 10/18/2023 # Change Directory Tenants with your Azure Subscriptions -Organizations may have several Azure credit subscriptions. Each subscription an organization sets up is associated with an [Microsoft Entra ID](../../active-directory/fundamentals/active-directory-whatis.md). +Organizations might have several Azure credit subscriptions. Each subscription an organization sets up is associated with an [Microsoft Entra ID](../../active-directory/fundamentals/active-directory-whatis.md). Microsoft Entra ID is MicrosoftΓÇÖs cloud-based identity and access management service that helps your employees sign in and access internal and external resources. -You may need to change the Active Directory youΓÇÖre working in or [transfer your subscription to another Active Directory](../../role-based-access-control/transfer-subscription.md). +You might need to change the Active Directory youΓÇÖre working in or [transfer your subscription to another Active Directory](../../role-based-access-control/transfer-subscription.md). -When activating your subscription, your identity is created based on the email you use. That identity is either associated with your organizationΓÇÖs Active Directory tenant or a new directory tenant is created for that identity. You can see the identity youΓÇÖre using in the upper right-hand side of your Azure portal. +When you activate your subscription, your identity is created based on the email you use. That identity is either associated with your organizationΓÇÖs Active Directory tenant or a new directory tenant is created for that identity. You can see the identity youΓÇÖre using in the upper right-hand side of your Azure portal. ![A screenshot of the Azure portal with the identity highlighted in the upper right-hand corner.](media/how-to-change-directory-tenants-visual-studio-azure/identity.png "The logged in identity is in the upper right-hand corner of your Azure portal.") -From here, you can either switch identities or switch directories. You may need to change your identity to access certain directories. +From here, you can either switch identities or switch directories. You might need to change your identity to access certain directories. If the identity youΓÇÖre logged in as is associated with multiple directories, switch by selecting ΓÇ£Switch directory.ΓÇ¥ You'll see the directories your current identity is associated with. -![A screenshot of the Directory and Subscription window in Azure portal.](media/how-to-change-directory-tenants-visual-studio-azure/switch-directory.png "Switch directories by clicking switch directory. Choose the directory you want.") +![A screenshot of the Directory and Subscription window in Azure portal.](media/how-to-change-directory-tenants-visual-studio-azure/switch-directory.png "Switch directories by selecting 'Switch directory'. Choose the directory you want.") -Your experience within the portal is highly dependent on the directory associated with the identity you used. To change directory tenants, an Admin will have to add your identity as a user within the target directory. +Your experience within the portal is highly dependent on the directory associated with the identity you used. To change directory tenants, an Admin has to add your identity as a user within the target directory. <a name='importance-of-changing-your-azure-active-directory-tenant'></a> When you set up your Azure Credit Subscription through a Visual Studio license, If you set up your credit subscription using a personal account, your identity and work are isolated from your organizationΓÇÖs active directory. LetΓÇÖs say you've been working on an app that needs its own subscription to test and learn before deployment. Now, you need access to your organizationΓÇÖs work or references. Changing your directoryΓÇÖs tenant lets you access your organizationΓÇÖs resources and lets them access yours. -Below is a simple diagram that shows the basic steps taken when you change or transfer your subscription. +The following diagram shows the basic steps to take when you change or transfer your subscription. ![A diagram illustrating changing or transferring your subscription.](media/how-to-change-directory-tenants-visual-studio-azure/change-diagram.png "A diagram of what happens when you change or transfer your subscription.") To access another Active Directory, you need an active account with the necessar Once youΓÇÖve been added and given proper permissions, you can switch directories within your subscription. 1. Sign in and select the subscription you want to use from the [Subscriptions page in Azure portal](https://portal.azure.com/#blade/Microsoft_Azure_Billing/SubscriptionsBlade) -2. Select ΓÇ£Change DirectoryΓÇ¥ +2. Select **Change Directory** ![A screenshot a Contoso Enterprise Subscription page with Change Directory highlighted.](media/how-to-change-directory-tenants-visual-studio-azure/change-directory.png "Select Change Directory.") 3. A box will surface to choose the new directory -4. Select ΓÇ£ChangeΓÇ¥ +4. Select **Change** > [!NOTE] > If you donΓÇÖt have access to the target directory it will not show. After the directory is changed for the subscription, you'll receive a success message. - ![A screenshot of the Change Directory validation page.](media/how-to-change-directory-tenants-visual-studio-azure/change-button.png "Select the directory from the dropdown and click the Change button.") + ![A screenshot of the Change Directory validation page.](media/how-to-change-directory-tenants-visual-studio-azure/change-button.png "Select the directory from the dropdown and select the Change button.") 5. Select ΓÇ£Switch DirectoriesΓÇ¥ on the subscription page to access the new directory - ![A screenshot of the subscriptions page with Switch Directories highlighted.](media/how-to-change-directory-tenants-visual-studio-azure/switch-directories-outlined.png "Click Switch Directories to access the new directory.") + ![A screenshot of the subscriptions page with Switch Directories highlighted.](media/how-to-change-directory-tenants-visual-studio-azure/switch-directories-outlined.png "Select Switch Directories to access the new directory.") You can also access a target directory and change your tenancy by going to the target directory and adding an Admin. Follow [these instructions](/visualstudio/subscriptions/cloud-admin) on how to add an Admin to your subscription. Once thatΓÇÖs done, the Admin has access to both directories and can change the tenant directory for you. |
devtest | How To Manage Monitor Devtest | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/devtest/offer/how-to-manage-monitor-devtest.md | Title: Managing and monitoring your Azure Dev/Test subscriptions description: Manage your Azure Dev/Test subscriptions with the flexibility of Azure's cloud environment. This guide also covers Azure Monitor to help maximize availability and performance for applications and services. +ms.technology: vs-subscriptions Previously updated : 10/12/2021 Last updated : 10/18/2023 Azure Monitor allows you to use dev/test benefits to optimize your appsΓÇÖ time Monitoring allows you to maximize your dev/test benefits with net new applications and existing applications. -- Pre-Production with Net New Green Field Applications ΓÇô with new apps, you create and enable custom metrics with log analytics or smart alerts in pre-production that you'll use in production. Using monitoring early refines your performance data for your production services -- Post-Production with Existing Applications ΓÇô when deploying new features or adding new functionality with an API to existing apps, you can deploy this feature in pre-production and tweak your monitoring to ensure correct data feeds early. Using tracking in the new feature's pre-production gives you clear visibility and allows you to blend this monitoring with your overall monitoring system after production. Integrates recent performance data with existing logs to ensure monitoring is used -- Monitoring different stages of non-production deployment mirrors monitoring during production. You manage your costs and analyze your spending before production and in post-production. +- Pre-Production with Net New Green Field Applications ΓÇô with new apps, you create and enable custom metrics with log analytics or smart alerts in preproduction that you use in production. Using monitoring early refines your performance data for your production services +- Post-Production with Existing Applications ΓÇô when deploying new features or adding new functionality with an API to existing apps, you can deploy this feature in preproduction and tweak your monitoring to ensure correct data feeds early. Using tracking in the new feature's preproduction gives you clear visibility and allows you to blend this monitoring with your overall monitoring system after production. Integrates recent performance data with existing logs to ensure monitoring is used +- Monitoring different stages of nonproduction deployment mirrors monitoring during production. You manage your costs and analyze your spending before production and in post-production. ## Cost Management -[Azure Cost Management](../../cost-management-billing/cost-management-billing-overview.md) allows you to improve the technical performance of your business workloads significantly. It can also reduce your costs and the overhead required to manage organizational assets. +[Microsoft Cost Management](../../cost-management-billing/cost-management-billing-overview.md) allows you to improve the technical performance of your business workloads significantly. It can also reduce your costs and the overhead required to manage organizational assets. -Using monitoring, you can use cost-analysis tools in pre-production before you deploy your services to run an analysis of your current pre-production compute to forecast production costs and potentially save money. +Using monitoring, you can use cost-analysis tools in preproduction before you deploy your services to run an analysis of your current preproduction compute to forecast production costs and potentially save money. ## Performance Management -Besides monitoring and cost management, you'll also want to performance test pre-production compute against expected workloads. +Besides monitoring and cost management, you also have to run performance tests in preproduction to validate against expected workloads. -In pre-production, you may set up scaling to expand based on load. When you test your application in a dev/test environment under load, you'll get better cost-analysis and monitoring figures. You'll have a better idea whether you need to start at a higher or lower scale. +In preproduction, you might set up scaling to expand based on load. When you test your application in a dev/test environment under load, you get better cost-analysis and monitoring figures. You have a better idea whether you need to start at a higher or lower scale. -Load and duress analysis provide another data in pre-production so you can continue to optimize your time to market and the delivery of your application. +Load and duress analysis provide another data in preproduction so you can continue to optimize your time to market and the delivery of your application. As you perform load and duress testing with your application or service, the method for scaling up or out depends on your workloads. You can learn more about scaling your apps in Azure: Enable monitoring for your application with [Application Insights](../../azure-m ## Azure Automation -[Azure automation](../../automation/automation-intro.md) delivers a cloud-based automation and configuration service that supports consistent management across your Azure and non-Azure environments. This tool gives you control during deployment, operations, and decommission of workloads and resources. Azure Automation is always on. It works with existing resources. Azure Automation lets you create resources or subscriptions on demand. You only pay for what you use. +[Azure Automation](../../automation/automation-intro.md) delivers a cloud-based automation and configuration service that supports consistent management across your Azure and non-Azure environments. This tool gives you control during deployment, operations, and decommission of workloads and resources. Azure Automation is always on. It works with existing resources. Azure Automation lets you create resources or subscriptions on demand. You only pay for what you use. Example: If you're following a dev/test production deployment, some of the resources and stages need to be up and running all the time. Others only need to update and run a few times a year. -Azure Automation becomes important in this scenario. When you're going into a new round of app development and you submit your first pull request (PR), you can kick off an automation job. The job will deploy infrastructure as code through an Azure Resource Manager (ARM) template to create all your resources in your Azure dev/test subscription during pre-production. +Azure Automation becomes important in this scenario. When you're going into a new round of app development and you submit your first pull request (PR), you can kick off an automation job. The job deploys infrastructure as code through an Azure Resource Manager (ARM) template to create all your resources in your Azure dev/test subscription during preproduction. ## Azure Resource Manager [Azure Resource Manager (ARM) templates](../../azure-resource-manager/templates/overview.md) implement infrastructure as code for your Azure solutions. The template defines the infrastructure and configuration for your project. You can automate your deployments. -You can deploy your configurations as many times as you want to update the pre-production environment and track your costs. Using Azure Automation you can run and delete your ARM templates as needed. +You can deploy your configurations as many times as you want to update the preproduction environment and track your costs. Using Azure Automation you can run and delete your ARM templates as needed. When a service or resource only needs to be updated twice a year, use DevOps tools to deploy your ARM template. Let the automation job turn off your resource, and then redeploy it as needed. |
devtest | How To Manage Reliability Performance | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/devtest/offer/how-to-manage-reliability-performance.md | Title: Manage reliability and performance with Azure Dev/Test subscriptions description: Build reliability into your applications with Dev/Test subscriptions. +ms.technology: vs-subscriptions Previously updated : 10/12/2021 Last updated : 10/18/2023 # Reliability Management -While production services won't be in a dev/test subscription, you may use other stages in your Azure dev/test subscription to ensure reliability in production. +While production services are not in a dev/test subscription, you might use other stages in your Azure dev/test subscription to ensure reliability in production. When using your organization dev/test subscriptions, you have to decide how youΓÇÖre going to: When using your organization dev/test subscriptions, you have to decide how you - Control security and access - Manage the uptime of that production system -Typically, there are different stages of deployment that you go through before production ΓÇô shared, QA, integration, staging, and failover. Depending on how your company defines these stages, your use of an organization dev/test subscription may change. +Typically, there are different stages of deployment that you go through before production ΓÇô shared, QA, integration, staging, and failover. Depending on how your company defines these stages, your use of an organization dev/test subscription might change. -If you're running mission-critical services like customer-facing applications, don't use a dev/test subscription. Dev/Test subscriptions don't carry a financially backed SLA. These subscriptions are for pre-production testing and development. +If you're running mission-critical services like customer-facing applications, don't use a dev/test subscription. Dev/Test subscriptions don't carry a financially backed SLA. These subscriptions are for preproduction testing and development. ## Site Reliability Engineering (SRE) To learn more about reliability engineering and management, consider site reliab How SRE and DevOps differ is still under discussion in the field. Some broadly agreed upon differences include: - SRE is an engineering discipline focused on reliability. DevOps is a cultural movement that emerged from the urge to break down the silos associated with Development and Operations organizations. -- SRE can be the name of a role as in "IΓÇÖm a site reliability engineer (SRE)". DevOps can't. +- SRE can be the name of a role, as in: *IΓÇÖm a site reliability engineer (SRE)*. DevOps can't. - SRE tends to be prescriptive. DevOps is intentionally not. Nearly universal adoption of continuous integration/continuous delivery, and Agile principles are the closest DevOps comes. If you want to learn more about the practice of SRE, check out these links: Whether you need [Monthly Azure Credits for Visual Studio subscribers](https://a Visual Studio Azure credits are an individual benefit, for individual Dev/Test and inner loop development. You canΓÇÖt pool credits between developers. Credit subscriptions are still Azure subscriptions, but a specific Azure offer. Manage your credit subscriptions in the same way you manage other Azure subscriptions so you can work within groups and teams. You can remove individual spending limits with a credit card, or if your enterprise Dev/Test subscription goes to your company's chosen procurement method. -Developer inner loop activities often use credits, but then switch to enterprise or organization Azure Dev/Test subscriptions, including pay as you go. This way as you follow DevOps processes, you can inner loop with your individual credit subscription. In the DevOps outer loop, non-production targets are in enterprise Dev/Test - prod goes to prod. +Developer inner loop activities often use credits, but then switch to enterprise or organization Azure Dev/Test subscriptions, including pay as you go. This way as you follow DevOps processes, you can inner-loop with your individual credit subscription. In the DevOps outer loop, nonproduction targets are in enterprise Dev/Test - prod goes to prod. Manage your credit subscriptions, enterprise dev/test subscriptions, and PAYG subscriptions and segment your developers using [management groups](../../governance/management-groups/how-to/protect-resource-hierarchy.md) that each have a unique hierarchy. If you need an organization Azure Dev/Test subscription, you have two offers to - [Pay-As-You-Go (PAYG) Dev/Test (0023P)](https://azure.microsoft.com/offers/ms-azr-0023p/) - [Enterprise Dev/Test (0148P)](https://azure.microsoft.com/offers/ms-azr-0148p/) -Each comes with their own set of discounts and require a Visual Studio Subscription. +Each option comes with their own set of discounts and requires a Visual Studio Subscription. -Each subscription offer allows you to get your team up and running with dev/test environments in the cloud using pre-configured virtual machines. Create multiple Azure subscriptions and manage them from one account. You can maintain isolated environments and a separate bill for different projects or teams. +Each subscription offer allows you to get your team up and running with dev/test environments in the cloud using preconfigured virtual machines. Create multiple Azure subscriptions and manage them from one account. You can maintain isolated environments and a separate bill for different projects or teams. Enterprise Dev/Test Subscriptions require an enterprise agreement (EA). Pay-As-You-Go Dev/Test Subscriptions don't require an EA but can be used with an enterprise agreement account. Enterprise Dev/Test Subscriptions require an enterprise agreement (EA). Pay-As-Y A PAYG Dev/Test offer might be the right fit to use as a Visual Studio subscriber. Unlike credit subscriptions for individual use, PAYG offers are great for team development and allow you to have multiple users within one subscription. A PAYG Dev/Test offer might be right for you if: - You don't have an enterprise agreement. In this case, you can only create a PAYG account with a Visual Studio license. -- You're creating an enterprise agreement, but you need to set up a subscription that doesn't use your organizationΓÇÖs agreement. You may have a unique project that requires its own subscription or to create an isolated environment billed separately for projects or teams. -- You prefer to keep identities isolated. You may need certain identities to remain separate from others to protect access to data, resources, and apps. +- You're creating an enterprise agreement, but you need to set up a subscription that doesn't use your organizationΓÇÖs agreement. You might have a unique project that requires its own subscription or to create an isolated environment billed separately for projects or teams. +- You prefer to keep identities isolated. You might need certain identities to remain separate from others to protect access to data, resources, and apps. |
devtest | How To Remove Credit Limits | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/devtest/offer/how-to-remove-credit-limits.md | Title: Removing credit limits and changing Azure Dev/Test offers description: How to remove credit limits and change Azure Dev/Test offers. Switch from pay-as-you-go to another offer. +ms.technology: vs-subscriptions Previously updated : 10/04/2021 Last updated : 10/18/2023 -# Changing Azure DevTest offers & remove limits +# Change Azure DevTest offers & remove limits -If you currently have a subscription with pay-as-you-go rates, you can switch your Azure subscription to another offer. Your reasons for changing offers may be specific to your organizational structure. But here are some common reasons you may want to change your subscription offer or remove credit limits. +If you currently have a subscription with pay-as-you-go rates, you can switch your Azure subscription to another offer. Your reasons for changing offers might be specific to your organizational structure. But here are some common reasons you might want to change your subscription offer or remove credit limits. - The credit limit allows you to run experiments, tests, and so on, without charging your credit card. Money can be spent quickly when running services. The credit limit puts guard rails around your subscription. You can create your desired architecture and run it as long as needed without overspending. -- Working with other organizations or changes within your organization may require you to change offer types. -- Your needs may change for an Azure subscription such that you need to continue to modify the current offer type. You may also need to change how you work on a project. In this case, you may need to switch from a credit limit structure to a pay-as-you-go structure to run the services you need. -- You have a critical project to run and need more time. You may remove limits if you need a little more than the $150 or $50 limit. +- Working with other organizations or changes within your organization might require you to change offer types. +- Your needs might change for an Azure subscription such that you need to continue to modify the current offer type. You might also need to change how you work on a project. In this case, you might need to switch from a credit limit structure to a pay-as-you-go structure to run the services you need. +- You have a critical project to run and need more time. You might remove limits if you need a little more than the $150 or $50 limit. - Your current offer has expired. Changing your subscription offer is easy for certain offer types. You can follow [our Switching Azure offers documentation](../../cost-management-billing/manage/switch-azure-offer.md), or contact support for more assistance. Changing your subscription offer is easy for certain offer types. You can follow Azure Monthly Credit Subscriptions [cost management](../../cost-management-billing/costs/cost-mgt-best-practices.md) provides you with the tools to plan for, analyze, and reduce your spending to maximize your cloud investment. Those tools give you the ability to remove and add back your spending limits. -The spending limit in Azure prevents overspending. While you canΓÇÖt change the limit, you can remove the spending limit. You'll either have no limit or a limit equal to the amount of the credit. +The spending limit in Azure prevents overspending. While you canΓÇÖt change the limit, you can remove the spending limit. You either have no limit or a limit equal to the amount of the credit. You can [remove the spending limit](../../cost-management-billing/manage/spending-limit.md) at any time as long as thereΓÇÖs a valid payment method associated with your Azure subscription. More Cost Management docs: * [Understand Cost Management Data](../../cost-management-billing/costs/understand-cost-mgt-data.md) -* [Optimize Your Cloud Investment with Azure Cost Management](../../cost-management-billing/costs/cost-mgt-best-practices.md) +* [Optimize Your Cloud Investment with Microsoft Cost Management](../../cost-management-billing/costs/cost-mgt-best-practices.md) ## Payment Methods and Overages -When removing limits or changing offers, you may be required to [update or add a new credit](../../cost-management-billing/manage/change-credit-card.md). You won't be able to use your credit card for overages until your credit has been exhausted. +When removing limits or changing offers, you might be required to [update or add a new credit](../../cost-management-billing/manage/change-credit-card.md). You aren't able to use your credit card for overages until your credit has been exhausted. In the Azure portal, you can change your default payment method to a new credit card and update your credit card details. You must be an [Account Administrator](../../cost-management-billing/understand/subscription-transfer.md#whoisaa) or you must have the correct [MCA permissions](../../cost-management-billing/manage/understand-mca-roles.md) to make these changes. If you want to a delete credit card, see [Delete an Azure billing payment method](../../cost-management-billing/manage/delete-azure-payment-method.md) -You can also update your payment method to check/wire transfer instead of by credit card. You'll first need approval to pay by invoice. To get approved, see [Pay for Azure subscriptions by invoice](../../cost-management-billing/manage/pay-by-invoice.md) +You can also update your payment method to check/wire transfer instead of by credit card. You first need approval to pay by invoice. To get approved, see [Pay for Azure subscriptions by invoice](../../cost-management-billing/manage/pay-by-invoice.md) |
devtest | How To Sign Into Azure With Github | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/devtest/offer/how-to-sign-into-azure-with-github.md | Title: Sign into Azure Dev/Test with your GitHub credentials description: Sign into an individual Monthly Azure Credit Subscription using GitHub credentials. Previously updated : 10/12/2021 Last updated : 10/18/2023 +ms.technology: vs-subscriptions After signing into GitHub and authorizing the Microsoft application, you'll get ## Sign in to Azure Credit Subscription -[Azure Credit Subscription for Visual Studio Subscribers](https://azure.microsoft.com/pricing/member-offers/credit-for-visual-studio-subscribers) offers a suite of services to create, deploy, and manage applications on your preferred platforms and devices. Use your credits in Azure for learning, development, and test services, plus collaboration tools and access to Azure DevOps Services. +[Azure Credit Subscription for Visual Studio Subscribers](https://azure.microsoft.com/pricing/member-offers/credit-for-visual-studio-subscribers) offers a suite of services to create, deploy, and manage applications on your preferred platforms and devices. Use your credits in Azure for learning, development, and test services, plus collaboration tools and access to Azure DevOps. Support for GitHub authentication makes it easier to experiment with Azure services. Choose from Virtual Machines, Websites, SQL Databases, and Mobile Services. Use [Azure Virtual Desktop](../../virtual-desktop/overview.md) to deploy and manage hundreds of VMs for development and testing purposes. If you havenΓÇÖt activated your credits yet, go to the [Monthly Azure Credits fo ![Screenshot of a Microsoft account sign in page.](media/how-to-sign-into-azure-with-github/sign-in-with-github.png "Sign in with Email, phone, or Skype.") -Once you complete the sign-in process, you'll be taken to the last Azure Visual Studio organization you visited. If youΓÇÖre brand new to Azure Visual Studio and Credits Subscription, youΓÇÖll land within a new organization created for you. +Once you complete the sign-in process, you're taken to the last Azure Visual Studio organization you visited. If youΓÇÖre brand new to Azure Visual Studio and Credits Subscription, you land within a new organization created for you. ## Access all of Microsoft online services Beside accessing developer services such as Azure Dev/Test, Monthly Credits, and Visual Studio, use your GitHub account to access all Microsoft online services, from Excel Online to Xbox. -When authenticating with those services, you can find your GitHub account after clicking on ΓÇ£Sign-in optionsΓÇ¥. +When authenticating with those services, you can find your GitHub account after selecting **Sign-in options**. ## Our commitment to your privacy -When you first use your GitHub account to sign in with Microsoft, GitHub will ask for permission to release your profile information. +When you first use your GitHub account to sign in with Microsoft, GitHub requests permission to release your profile information. -If you agree, GitHub shares the email addresses on your GitHub account and profile information. WeΓÇÖll use this data to check whether you have an account with us. We'll create an account if you donΓÇÖt. Connecting your GitHub identity to a Microsoft identity doesn't give Microsoft access to your repositories in GitHub. Apps like Visual Studio will request access to your repositories if they need to work with your code. YouΓÇÖll need to consent to that request separately. +If you agree, GitHub shares the email addresses on your GitHub account and profile information. This data is used to check whether you have an account with us. We'll create an account if you donΓÇÖt. Connecting your GitHub identity to a Microsoft identity doesn't give Microsoft access to your repositories in GitHub. Apps like Visual Studio will request access to your repositories if they need to work with your code. YouΓÇÖll need to consent to that request separately. -While your GitHub account is used to log into your Microsoft account, theyΓÇÖre separate accounts. One account just uses the other as a login method. Changes you make to your GitHub account (like changing the password or enabling two-factor authentication) wonΓÇÖt change your Microsoft account or the other way around. Manage your GitHub and Microsoft identities in yourΓÇ»[account management page](https://account.live.com/proofs/manage/). Just look under the Security tab. +While your GitHub account is used to log into your Microsoft account, theyΓÇÖre separate accounts. One account just uses the other as a sign-in method. Changes you make to your GitHub account (like changing the password or enabling two-factor authentication) donΓÇÖt change your Microsoft account or the other way around. Manage your GitHub and Microsoft identities in yourΓÇ»[account management page](https://account.live.com/proofs/manage/). Just look under the Security tab. ## Start exploring Azure Visual Studio and Monthly Credit Subscription now Go to the [Monthly Azure Credits for Visual Studio Subscribers](https://azure.microsoft.com/pricing/member-offers/credit-for-visual-studio-subscribers/) page to learn more to get started. -If you have questions, check outΓÇ»[the support page](https://support.microsoft.com/help/4501231/microsoft-account-link-your-github-account). Let us know what you think in the comments below. As always, weΓÇÖd love to hear any feedback or suggestions you have. +If you have questions, check outΓÇ»[the support page](https://support.microsoft.com/help/4501231/microsoft-account-link-your-github-account). |
devtest | Overview What Is Devtest Offer Visual Studio | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/devtest/offer/overview-what-is-devtest-offer-visual-studio.md | Title: What is Azure Dev/Test offer? description: Use the Azure Dev/Test offer to get Azure credits for Visual Studio subscribers. +ms.technology: vs-subscriptions Previously updated : 10/12/2021 Last updated : 10/18/2023 adobe-target: true |
devtest | Quickstart Create Enterprise Devtest Subscriptions | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/devtest/offer/quickstart-create-enterprise-devtest-subscriptions.md | Title: Creating Enterprise Azure Dev/Test subscriptions description: Create Enterprise and Organizational Azure Dev/Test subscriptions for teams and large organizations. +ms.technology: vs-subscriptions Previously updated : 10/20/2021 Last updated : 10/18/2023 To create an enterprise Azure dev/test subscription, you must have an account ow ### Important information before you add Account Owners -An EA Account Owner **cannot** use the same sign-in account for the EA Portal and other Azure offers. Suppose you were added to the EA Portal as an Account Owner and used the same credentials you used for your individual Visual Studio Benefits. In that case, that Visual Studio subscription will be converted to the EA Dev/Test offer. +An EA Account Owner **cannot** use the same sign-in account for the EA Portal and other Azure offers. Suppose you were added to the EA Portal as an Account Owner and used the same credentials you used for your individual Visual Studio Benefits. In that case, that Visual Studio subscription is converted to the EA Dev/Test offer. > [!Note] > The first time you sign in to the EA Portal as an account owner, you'll see a warning pop-up. ItΓÇÖs essential to read and understand. Your existing subscriptions will be converted to a billable EA offer. If you're a Visual Studio subscriber added as an Account Owner, you'll lose your individual monthly Azure credit until you take further action. To create your enterprise dev/test subscriptions, first choose an enrollment acc ![A screenshot of the Enterprise Agreement management portal page.](media/quickstart-create-enterprise-devtest-subscriptions/ea-manage-portal.png "Manage your company's enrollments from the Azure Enterprise Portal.") -3. Once youΓÇÖve selected the appropriate enrollment number, select the ΓÇ£AccountΓÇ¥ tab -4. From here, you can edit the Account(s) and toggle the dev/test flag to ΓÇ£Yes.ΓÇ¥ +3. Once youΓÇÖve selected the appropriate enrollment number, select the **Account** tab +4. From here, you can edit the Account(s) and toggle the dev/test flag to **Yes**. To add an account: To add an account: 6. Enter the Account Owner Email address to associate with the new Account. 7. Confirm the email address and then select Add. -![A screenshot of the Azure Enterprise Portal management page with +Add Account selected.](media/quickstart-create-enterprise-devtest-subscriptions/add-account.png "Click on Add Account.") +![A screenshot of the Azure Enterprise Portal management page with +Add Account selected.](media/quickstart-create-enterprise-devtest-subscriptions/add-account.png "Select Add Account.") ## Add your Azure Enterprise Dev/Test Subscription After you've chosen the account to create an enterprise Azure dev/test subscription, add the appropriate subscription offer. There are two ways. No matter what you pick, the experience is the same. -1. The subscription tab in each portal will be a ΓÇ£+/AddΓÇ¥ button. - 1. You'll go to a sign-in page. +1. The subscription tab in each portal has a **+/Add** button. ++ 1. Go to a sign-in page. 1. Sign in with your Microsoft Azure credentials.- 1. After you sign in, you're directed to a page that lists the offers available. Toggle **yes** in the EA portal to see enterprise offers. + 1. After you sign in, you're directed to a page that lists the offers available. Toggle **Yes** in the EA portal to see enterprise offers. 1. Select the enterprise dev/test offer and create your subscription.+ 1. You must have access and permissions associated with your identity. 1. You must designate the Account as a dev/test account within the enrollment portal. |
devtest | Quickstart Individual Credit | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/devtest/offer/quickstart-individual-credit.md | Title: Start using individual Azure Dev/Test credit description: As a Visual Studio subscriber, learn how to access an Azure Credit subscription. Previously updated : 11/24/2021 Last updated : 10/18/2023 +ms.technology: vs-subscriptions # Get started with your individual Azure Credit Subscription -When accessing your Azure Credit Subscription, you have two ways to sign in and use your credits. +When you access your Azure Credit Subscription, you have two ways to sign in and use your credits. You have the opportunity to use a personal or corporate email address when signing in. You have the opportunity to use a personal or corporate email address when signi The email you use should follow your organizationΓÇÖs requirements ΓÇô either you use a personal email OR your organizationΓÇÖs provided email. **Using a Personal Email Account or Microsoft Account (MSA)** -If you sign in using a personal account like Gmail, Yahoo, or Hotmail, your organization won't have access to your work, nor will you have access to their Active Directory unless that email is added by an admin. When you sign in with a personal email, a new Azure Directory will be created along with an Azure Tenancy on your behalf. Your profile will be created as the Top-Level Admin with access for adding other admins and users. +If you sign in using a personal account like Gmail, Yahoo, or Hotmail, your organization doesn't have access to your work, nor do you have access to their Active Directory unless that email is added by an admin. When you sign in with a personal email, a new Azure Directory is created along with an Azure Tenancy on your behalf. Your profile is created as the Top-Level Admin with access for adding other admins and users. **Using a Corporate Email Account or Work School Account (WSA)** -If you sign in with your organizationΓÇÖs account, you'll be added as a user within your organization's Active Directory, giving your organization access to manage your account and work. +If you sign in with your organizationΓÇÖs account, you're added as a user within your organization's Active Directory, giving your organization access to manage your account and work. ## Activate your Azure Credit Subscription Once you receive a my.visualstudio.com license, you can sign in via my.visualstudio.com or portal.azure.com.-Remember, the account you sign in with will dictate what tenant your directory will be under. +Remember, the account you sign in with dictates what tenant your directory is under. ### Sign-In through [Visual Studio](https://my.visualstudio.com/benefits) - my.visualstudio.com/benefits -1. Locate the Azure tile in the Tools section on the Benefits page and select 'Activate' to access the appropriate subscription +1. Locate the Azure tile in the Tools section on the Benefits page and select **Activate** to access the appropriate subscription - ![A screenshot of the Azure Monthly credit activation page.](media/quickstart-individual-credit/activate.png "Click Activate to access your subscription.") -2. If you don't have an existing Azure subscription, you'll be asked to fill in the required information to create your Azure subscription. The first step is to provide your personal information and then select **Next**. + ![A screenshot of the Azure Monthly credit activation page.](media/quickstart-individual-credit/activate.png "Select Activate to access your subscription.") +2. If you don't have an existing Azure subscription, you're asked to fill in the required information to create your Azure subscription. The first step is to provide your personal information and then select **Next**. - ![A screenshot of the About You subscription page.](media/quickstart-individual-credit/azure-about-you.png "Enter your information and click 'Next'.") -3. Next, you'll need to verify your identity using a simple verification code. Provide your phone number, and choose whether you'd like to receive the code by text or phone. Enter the code you receive, and select **Verify code**. + ![A screenshot of the About You subscription page.](media/quickstart-individual-credit/azure-about-you.png "Enter your information and select 'Next'.") +3. Next, you need to verify your identity using a simple verification code. Provide your phone number, and choose whether you'd like to receive the code by text or phone. Enter the code you receive, and select **Verify code**. ![A screenshot of the Identity verification by phone screen.](media/quickstart-individual-credit/azure-identity.png) 4. For the final step, select the checkbox to accept the terms, then select **Sign up**. ![A screenshot of the Agreement screen.](media/quickstart-individual-credit/azure-agreement.png) -5. The Azure Dashboard Quickstart Center will load. +5. The Azure Dashboard Quickstart Center loads. ![A screenshot of the Azure Dashboard Quickstart Center with a Welcome to Microsoft Azure pop up.](media/quickstart-individual-credit/azure-quick-start.png) 6. Bookmark the [Azure portal](https://portal.azure.com) for easy access in the future. For more details go to my.visualstudio.com, or follow this link: [Use Microsoft ![A screenshot of the Microsoft Azure pick an account screen.](media/quickstart-individual-credit/pick-an-account.png "Select an account to sign in to the Azure portal.") -2. Once youΓÇÖre logged in, go to Subscriptions under Azure Services. +2. Once youΓÇÖre logged in, go to **Subscriptions** under **Azure Services**. ![A screenshot of services available in the Azure portal](media/quickstart-individual-credit/azure-services.png "Select Subscriptions under Azure Services.") 3. Select **+ Add**. - ![A screenshot of a pop up window for adding a subscription](media/quickstart-individual-credit/click-add.png "Click the add button.") + ![A screenshot of a pop up window for adding a subscription.](media/quickstart-individual-credit/click-add.png "Select the add button.") + 4. This action takes you to a page where you can find the eligible offers.+ 5. Select the correct subscription offer to associate with your account. > [!NOTE] For more details go to my.visualstudio.com, or follow this link: [Use Microsoft <a name="maintain-a-subscription-to-use-monthly-credits"></a> ## Troubleshoot removed/expired subscriptions -If your Visual Studio subscription expires or is removed, all the subscription benefits, including the monthly Azure dev/test individual credit are no longer available. To learn how to continue using Azure with a monthly credit, read "[Renew an expired subscription, purchase a new one, or transfer your Azure resources](troubleshoot-expired-removed-subscription.md)". +If your Visual Studio subscription expires or is removed, all the subscription benefits, including the monthly Azure dev/test individual credit are no longer available. To learn how to continue using Azure with a monthly credit, read [Renew an expired subscription, purchase a new one, or transfer your Azure resources](troubleshoot-expired-removed-subscription.md). |
devtest | Troubleshoot Expired Removed Subscription | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/devtest/offer/troubleshoot-expired-removed-subscription.md | Title: Troubleshoot expired Visual Studio subscription description: Learn how to renew an expired subscription, purchase a new one, or transfer your Azure resources. Previously updated : 12/15/2021 Last updated : 10/18/2023 +ms.technology: vs-subscriptions # Renew an expired subscription, purchase a new one, or transfer your Azure resources -If your Visual Studio subscription expires or is removed, all the subscription benefits, including the monthly Azure dev/test individual credit are no longer available. To continue using Azure with a monthly credit, you will need to renew your subscription, purchase a new subscription, and/or transfer your Azure resources to a different Azure subscription that includes the Azure dev/test individual credit. +If your Visual Studio subscription expires or is removed, all the subscription benefits, including the monthly Azure dev/test individual credit are no longer available. To continue using Azure with a monthly credit, you need to renew your subscription, purchase a new subscription, and/or transfer your Azure resources to a different Azure subscription that includes the Azure dev/test individual credit. > [!IMPORTANT] > You must transfer your resources to another Azure subscription before your current Azure subscription is disabled or you will lose access to your data. |
event-grid | Communication Services Advanced Messaging Events | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/event-grid/communication-services-advanced-messaging-events.md | |
expressroute | Expressroute Howto Add Gateway Portal Resource Manager | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/expressroute/expressroute-howto-add-gateway-portal-resource-manager.md | The steps for this tutorial use the values in the following configuration refere * Gateway Public IP Name = "MyERGWVIP" * Gateway type = "ExpressRoute" This type is required for an ExpressRoute configuration. + > [!IMPORTANT] + > ExpressRoute Virtual Network Gateways no longer support the Basic Public IP SKU. Please associate a Standard IP to create the Virtual Network Gateway. + > + > + ## Create the gateway subnet 1. In the [portal](https://portal.azure.com), navigate to the Resource Manager virtual network for which you want to create a virtual network gateway. In this tutorial, you learned how to create a virtual network gateway. For more To learn how to link your virtual network to an ExpressRoute circuit, advance to the next tutorial. > [!div class="nextstepaction"]-> [Link a Virtual Network to an ExpressRoute circuit](expressroute-howto-linkvnet-portal-resource-manager.md) +> [Link a Virtual Network to an ExpressRoute circuit](expressroute-howto-linkvnet-portal-resource-manager.md) |
expressroute | Expressroute Locations Providers | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/expressroute/expressroute-locations-providers.md | The following table shows connectivity locations and the service providers for e | **Abu Dhabi** | Etisalat KDC | 3 | UAE Central | Supported | | | **Amsterdam** | [Equinix AM5](https://www.equinix.com/locations/europe-colocation/netherlands-colocation/amsterdam-data-centers/am5/) | 1 | West Europe | Supported | Aryaka Networks<br/>AT&T NetBond<br/>British Telecom<br/>Colt<br/>Equinix<br/>euNetworks<br/>GÉANT<br/>InterCloud<br/>Interxion<br/>KPN<br/>IX Reach<br/>Level 3 Communications<br/>Megaport<br/>NTT Communications<br/>Orange<br/>Tata Communications<br/>Telefonica<br/>Telenor<br/>Telia Carrier<br/>Verizon<br/>Zayo | | **Amsterdam2** | [Interxion AMS8](https://www.interxion.com/Locations/amsterdam/schiphol/) | 1 | West Europe | Supported | BICS<br/>British Telecom<br/>CenturyLink Cloud Connect<br/>Colt<br/>DE-CIX<br/>Equinix<br/>euNetworks<br/>GÉANT<br/>Interxion<br/>Megaport<br/>NL-IX<br/>NOS<br/>NTT Global DataCenters EMEA<br/>Orange<br/>Vodafone |-| **Atlanta** | [Equinix AT2](https://www.equinix.com/locations/americas-colocation/united-states-colocation/atlanta-data-centers/at2/) | 1 | n/a | Supported | Equinix<br/>Megaport | +| **Atlanta** | [Equinix AT1](https://www.equinix.com/data-centers/americas-colocation/united-states-colocation/atlanta-data-centers/at1) | 1 | n/a | Supported | Equinix<br/>Megaport | | **Auckland** | [Vocus Group NZ Albany](https://www.vocus.co.nz/business/cloud-data-centres) | 2 | n/a | Supported | Devoli<br/>Kordia<br/>Megaport<br/>REANNZ<br/>Spark NZ<br/>Vocus Group NZ | | **Bangkok** | [AIS](https://business.ais.co.th/solution/en/azure-expressroute.html) | 2 | n/a | Supported | AIS<br/>National Telecom UIH | | **Berlin** | [NTT GDC](https://services.global.ntt/en-us/newsroom/ntt-ltd-announces-access-to-microsoft-azure-expressroute-at-ntts-berlin-1-data-center) | 1 | Germany North | Supported | Colt<br/>Equinix<br/>NTT Global DataCenters EMEA | |
firewall-manager | Secured Virtual Hub | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/firewall-manager/secured-virtual-hub.md | -A *secured* virtual hub is an [Azure Virtual WAN Hub](../virtual-wan/virtual-wan-about.md#resources) with associated security and routing policies configured by Azure Firewall Manager. Use secured virtual hubs to easily create hub-and-spoke and transitive architectures with native security services for traffic governance and protection. +A *secured* virtual hub is an [Azure Virtual WAN Hub](../virtual-wan/virtual-wan-about.md#resources) with associated security and routing policies configured by Azure Firewall Manager. Use secured virtual hubs to easily create hub-and-spoke and transitive architectures with native security services for traffic governance and protection. Multiple secured hubs can be deployed within the same region or across regions part of the same Virtual WAN resource. Use Firewall Manager in the Azure Portal to add additional secured hubs. > [!IMPORTANT] > Currently, Azure Firewall in secured virtual hubs (vWAN) is not supported in Qatar and Poland Central. |
firewall | Protect Azure Kubernetes Service | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/firewall/protect-azure-kubernetes-service.md | Azure Kubernetes Service (AKS) offers a managed Kubernetes cluster on Azure. For Despite AKS being a fully managed solution, it doesn't offer a built-in solution to secure ingress and egress traffic between the cluster and external networks. Azure Firewall offers a solution to this. -AKS clusters are deployed on a virtual network. This network can be managed (created by AKS) or custom (pre-configured by the user beforehand). In either case, the cluster has outbound dependencies on services outside of that virtual network (the service has no inbound dependencies). For management and operational purposes, nodes in an AKS cluster need to access [certain ports and fully qualified domain names (FQDNs)](../aks/outbound-rules-control-egress.md) describing these outbound dependencies. This is required for various functions including, but not limited to, the nodes that communicate with the Kubernetes API server. They download and install core Kubernetes cluster components and node security updates, or pull base system container images from Microsoft Container Registry (MCR), and so on. These outbound dependencies are almost entirely defined with FQDNs, which don't have static addresses behind them. The lack of static addresses means that Network Security Groups can't be used to lock down outbound traffic from an AKS cluster. For this reason, by default, AKS clusters have unrestricted outbound (egress) Internet access. This level of network access allows nodes and services you run to access external resources as needed. +AKS clusters are deployed on a virtual network. This network can be managed (created by AKS) or custom (preconfigured by the user beforehand). In either case, the cluster has outbound dependencies on services outside of that virtual network (the service has no inbound dependencies). For management and operational purposes, nodes in an AKS cluster need to access [certain ports and fully qualified domain names (FQDNs)](../aks/outbound-rules-control-egress.md) describing these outbound dependencies. This is required for various functions including, but not limited to, the nodes that communicate with the Kubernetes API server. They download and install core Kubernetes cluster components and node security updates, or pull base system container images from Microsoft Container Registry (MCR), and so on. These outbound dependencies are almost entirely defined with FQDNs, which don't have static addresses behind them. The lack of static addresses means that Network Security Groups can't be used to lock down outbound traffic from an AKS cluster. For this reason, by default, AKS clusters have unrestricted outbound (egress) Internet access. This level of network access allows nodes and services you run to access external resources as needed. -However, in a production environment, communications with a Kubernetes cluster should be protected to prevent against data exfiltration along with other vulnerabilities. All incoming and outgoing network traffic must be monitored and controlled based on a set of security rules. If you want to do this, you'll have to restrict egress traffic, but a limited number of ports and addresses must remain accessible to maintain healthy cluster maintenance tasks and satisfy those outbound dependencies previously mentioned. +However, in a production environment, communications with a Kubernetes cluster should be protected to prevent against data exfiltration along with other vulnerabilities. All incoming and outgoing network traffic must be monitored and controlled based on a set of security rules. If you want to do this, you have to restrict egress traffic, but a limited number of ports and addresses must remain accessible to maintain healthy cluster maintenance tasks and satisfy those outbound dependencies previously mentioned. The simplest solution uses a firewall device that can control outbound traffic based on domain names. A firewall typically establishes a barrier between a trusted network and an untrusted network, such as the Internet. Azure Firewall, for example, can restrict outbound HTTP and HTTPS traffic based on the FQDN of the destination, giving you fine-grained egress traffic control, but at the same time allows you to provide access to the FQDNs encompassing an AKS clusterΓÇÖs outbound dependencies (something that NSGs can't do). Likewise, you can control ingress traffic and improve security by enabling threat intelligence-based filtering on an Azure Firewall deployed to a shared perimeter network. This filtering can provide alerts, and deny traffic to and from known malicious IP addresses and domains. FWROUTE_NAME_INTERNET="${PREFIX}-fwinternet" ### Create a virtual network with multiple subnets -Provision a virtual network with two separate subnets, one for the cluster, one for the firewall. Optionally you could also create one for internal service ingress. +Create a virtual network with two separate subnets, one for the cluster, one for the firewall. Optionally you could also create one for internal service ingress. ![Empty network topology](../aks/media/limit-egress-traffic/empty-network.png) az network firewall create -g $RG -n $FWNAME -l $LOC --enable-dns-proxy true The IP address created earlier can now be assigned to the firewall frontend. + > [!NOTE] > Set up of the public IP address to the Azure Firewall may take a few minutes.-> To leverage FQDN on network rules we need DNS proxy enabled, when enabled the firewall will listen on port 53 and will forward DNS requests to the DNS server specified above. This will allow the firewall to translate that FQDN automatically. +> To leverage FQDN on network rules we need DNS proxy enabled, when enabled the firewall will listen on port 53 and will forward DNS requests to the DNS server specified previously. This will allow the firewall to translate that FQDN automatically. ```azurecli # Configure Firewall IP Config FWPRIVATE_IP=$(az network firewall show -g $RG -n $FWNAME --query "ipConfigurati Azure automatically routes traffic between Azure subnets, virtual networks, and on-premises networks. If you want to change any of Azure's default routing, you do so by creating a route table. -Create an empty route table to be associated with a given subnet. The route table will define the next hop as the Azure Firewall created above. Each subnet can have zero or one route table associated to it. +Create an empty route table to be associated with a given subnet. The route table will define the next hop as the Azure Firewall created previously. Each subnet can have zero or one route table associated to it. ```azurecli # Create UDR and add a route for Azure Firewall az network route-table route create -g $RG --name $FWROUTE_NAME --route-table-na az network route-table route create -g $RG --name $FWROUTE_NAME_INTERNET --route-table-name $FWROUTE_TABLE_NAME --address-prefix $FWPUBLIC_IP/32 --next-hop-type Internet ``` -See [virtual network route table documentation](../virtual-network/virtual-networks-udr-overview.md#user-defined) about how you can override Azure's default system routes or add additional routes to a subnet's route table. +See [virtual network route table documentation](../virtual-network/virtual-networks-udr-overview.md#user-defined) about how you can override Azure's default system routes or add more routes to a subnet's route table. ### Adding firewall rules See [virtual network route table documentation](../virtual-network/virtual-netwo > For applications outside of the kube-system or gatekeeper-system namespaces that needs to talk to the API server, an additional network rule to allow TCP communication to port 443 for the API server IP in addition to adding application rule for fqdn-tag AzureKubernetesService is required. - You can use the following three network rules to configure your firewall. You may need to adapt these rules based on your deployment. The first rule allows access to port 9000 via TCP. The second rule allows access to port 1194 and 123 via UDP. Both these rules will only allow traffic destined to the Azure Region CIDR that we're using, in this case East US. + You can use the following three network rules to configure your firewall. You might need to adapt these rules based on your deployment. The first rule allows access to port 9000 via TCP. The second rule allows access to port 1194 and 123 via UDP. Both these rules only allow traffic destined to the Azure Region CIDR that we're using, in this case East US. Finally, we add a third network rule opening port 123 to an Internet time server FQDN (for example:`ntp.ubuntu.com`) via UDP. Adding an FQDN as a network rule is one of the specific features of Azure Firewall, and you need to adapt it when using your own options. -After setting the network rules, we'll also add an application rule using the `AzureKubernetesService` that covers the needed FQDNs accessible through TCP port 443 and port 80. In addition, you may need to configure additional network and application rules based on your deployment. For more information, see [Outbound network and FQDN rules for Azure Kubernetes Service (AKS) clusters](../aks/outbound-rules-control-egress.md#required-outbound-network-rules-and-fqdns-for-aks-clusters). +After setting the network rules, we'll also add an application rule using the `AzureKubernetesService` that covers the needed FQDNs accessible through TCP port 443 and port 80. In addition, you might need to configure more network and application rules based on your deployment. For more information, see [Outbound network and FQDN rules for Azure Kubernetes Service (AKS) clusters](../aks/outbound-rules-control-egress.md#required-outbound-network-rules-and-fqdns-for-aks-clusters). #### Add FW Network Rules See [Azure Firewall documentation](overview.md) to learn more about the Azure Fi ### Associate the route table to AKS -To associate the cluster with the firewall, the dedicated subnet for the cluster's subnet must reference the route table created above. Association can be done by issuing a command to the virtual network holding both the cluster and firewall to update the route table of the cluster's subnet. +To associate the cluster with the firewall, the dedicated subnet for the cluster's subnet must reference the route table created previously. Association can be done by issuing a command to the virtual network holding both the cluster and firewall to update the route table of the cluster's subnet. ```azurecli # Associate route table with next hop to Firewall to the AKS subnet az network vnet subnet update -g $RG --vnet-name $VNET_NAME --name $AKSSUBNET_NA ### Deploy AKS with outbound type of UDR to the existing network -Now an AKS cluster can be deployed into the existing virtual network. We'll also use [outbound type `userDefinedRouting`](../aks/egress-outboundtype.md), this feature ensures any outbound traffic is forced through the firewall and no other egress paths exist (by default the Load Balancer outbound type could be used). +Now an AKS cluster can be deployed into the existing virtual network. You also use [outbound type `userDefinedRouting`](../aks/egress-outboundtype.md), this feature ensures any outbound traffic is forced through the firewall and no other egress paths exist (by default the Load Balancer outbound type could be used). ![aks-deploy](../aks/media/limit-egress-traffic/aks-udr-fw.png) az aks get-credentials -g $RG -n $AKSNAME ## Restrict ingress traffic using Azure Firewall -You can now start exposing services and deploying applications to this cluster. In this example, we expose a public service, but you may also choose to expose an internal service via [internal load balancer](../aks/internal-lb.md). +You can now start exposing services and deploying applications to this cluster. In this example, we expose a public service, but you can also choose to expose an internal service via [internal load balancer](../aks/internal-lb.md). ![Public Service DNAT](../aks/media/limit-egress-traffic/aks-create-svc.png) -Deploy the Azure voting app application by copying the yaml below to a file named `example.yaml`. +Deploy the Azure voting app application by copying the following yaml to a file named `example.yaml`. ```yaml # voting-storage-deployment.yaml spec: spec: containers: - name: voting-storage- image: mcr.microsoft.com/aks/samples/voting/storage:2.0 + image: mcr.microsoft.com/azuredocs/voting/storage:2.0 args: ["--ignore-db-dir=lost+found"] resources: requests: spec: spec: containers: - name: voting-app- image: mcr.microsoft.com/aks/samples/voting/app:2.0 + image: mcr.microsoft.com/azuredocs/voting/app:2.0 imagePullPolicy: Always ports: - containerPort: 8080 spec: spec: containers: - name: voting-analytics- image: mcr.microsoft.com/aks/samples/voting/analytics:2.0 + image: mcr.microsoft.com/azuredocs/voting/analytics:2.0 imagePullPolicy: Always ports: - containerPort: 8080 You need to specify the internal IP address assigned to the load balancer create kubectl get services ``` -The IP address needed will be listed in the EXTERNAL-IP column, similar to the following. +The IP address needed is listed in the EXTERNAL-IP column, similar to the following. ```bash NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE |
healthcare-apis | Dicom Services Conformance Statement V2 | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/healthcare-apis/dicom/dicom-services-conformance-statement-v2.md | -The Medical Imaging Server for DICOM® supports a subset of the DICOMwebΓäó Standard. Support includes: +The Medical Imaging Server for DICOM® supports a subset of the DICOMweb Standard. Support includes: * [Studies Service](#studies-service) * [Store (STOW-RS)](#store-stow-rs) The query API returns one of the following status codes in the response: ### Delete -This transaction isn't part of the official DICOMweb™ Standard. It uses the DELETE method to remove representations of Studies, Series, and Instances from the store. +This transaction isn't part of the official DICOMweb Standard. It uses the DELETE method to remove representations of Studies, Series, and Instances from the store. | Method | Path | Description | | :-- | : | :- | required to be present, required to not be present, required to be empty, or req found [in this table](https://dicom.nema.org/medical/dicom/current/output/html/part04.html#table_CC.2.5-3). > [!NOTE]-> Although the reference table says that SOP Instance UID shouldn't be present, this guidance is specific to the DIMSE protocol and is handled differently in DICOMWebΓäó. SOP Instance UID should be present in the dataset if not in the URI. +> Although the reference table says that SOP Instance UID shouldn't be present, this guidance is specific to the DIMSE protocol and is handled differently in DICOMWeb. SOP Instance UID should be present in the dataset if not in the URI. > [!NOTE] > All the conditional requirement codes including 1C and 2C are treated as optional. |
healthcare-apis | Dicom Services Conformance Statement | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/healthcare-apis/dicom/dicom-services-conformance-statement.md | -The Medical Imaging Server for DICOM® supports a subset of the DICOMwebΓäó Standard. Support includes: +The Medical Imaging Server for DICOM® supports a subset of the DICOMweb Standard. Support includes: * [Studies Service](#studies-service) * [Store (STOW-RS)](#store-stow-rs) The query API returns one of the following status codes in the response: ### Delete -This transaction isn't part of the official DICOMweb™ Standard. It uses the DELETE method to remove representations of Studies, Series, and Instances from the store. +This transaction isn't part of the official DICOMwe Standard. It uses the DELETE method to remove representations of Studies, Series, and Instances from the store. | Method | Path | Description | | :-- | : | :- | required to be present, required to not be present, required to be empty, or req found [in this table](https://dicom.nema.org/medical/dicom/current/output/html/part04.html#table_CC.2.5-3). > [!NOTE]-> Although the reference table says that SOP Instance UID shouldn't be present, this guidance is specific to the DIMSE protocol and is handled differently in DICOMWebΓäó. SOP Instance UID should be present in the dataset if not in the URI. +> Although the reference table says that SOP Instance UID shouldn't be present, this guidance is specific to the DIMSE protocol and is handled differently in DICOMWeb. SOP Instance UID should be present in the dataset if not in the URI. > [!NOTE] > All the conditional requirement codes including 1C and 2C are treated as optional. |
healthcare-apis | Dicomweb Standard Apis C Sharp | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/healthcare-apis/dicom/dicomweb-standard-apis-c-sharp.md | Title: Using DICOMweb™Standard APIs with C# - Azure Health Data Services -description: In this tutorial, you'll learn how to use DICOMweb Standard APIs with C#. + Title: Use C# and DICOMweb Standard APIs in Azure Health Data Services +description: Learn how to use C# and DICOMweb Standard APIs to store, retrieve, search, and delete DICOM files in the DICOM service. Previously updated : 05/26/2022 Last updated : 10/18/2023 -# Using DICOMweb™ Standard APIs with C# +# Use C# and DICOMweb Standard APIs -This tutorial uses C# to demonstrate working with the DICOM service. +This article shows how to work with the DICOMweb service using C# and [sample .dcm DICOM® files](https://github.com/microsoft/dicom-server/tree/main/docs/dcms). -In this tutorial, we'll use the following [sample .dcm DICOM files](https://github.com/microsoft/dicom-server/tree/main/docs/dcms). +Use these sample files: -* blue-circle.dcm -* dicom-metadata.csv -* green-square.dcm -* red-triangle.dcm +- blue-circle.dcm +- dicom-metadata.csv +- green-square.dcm +- red-triangle.dcm -The file name, studyUID, seriesUID, and instanceUID of the sample DICOM files is as follows: +The filename, studyUID, seriesUID, and instanceUID of the sample DICOM files are: | File | StudyUID | SeriesUID | InstanceUID | | | | | | The file name, studyUID, seriesUID, and instanceUID of the sample DICOM files is |blue-circle.dcm|1.2.826.0.1.3680043.8.498.13230779778012324449356534479549187420|1.2.826.0.1.3680043.8.498.77033797676425927098669402985243398207|1.2.826.0.1.3680043.8.498.13273713909719068980354078852867170114| > [!NOTE]-> Each of these files represent a single instance and are part of the same study. Also, the green-square and red-triangle are part of the same series, while the blue-circle is in a separate series. +> Each of these files represents a single instance and is part of the same study. Also, the green-square and red-triangle are part of the same series, while the blue-circle is in a separate series. ## Prerequisites -To use the DICOMweb™ Standard APIs, you must have an instance of the DICOM service deployed. If you haven't already deployed an instance of the DICOM service, see [Deploy DICOM service using the Azure portal](deploy-dicom-services-in-azure.md). +To use the DICOMweb Standard APIs, you need an instance of the DICOM service deployed. If you haven't already deployed an instance of the DICOM service, see [Deploy DICOM service using the Azure portal](deploy-dicom-services-in-azure.md). -After you've deployed an instance of the DICOM service, retrieve the URL for your App service: +After you deploy an instance of the DICOM service, retrieve the URL for your app service: 1. Sign in to the [Azure portal](https://portal.azure.com). 1. Search **Recent resources** and select your DICOM service instance.-1. Copy the **Service URL** of your DICOM service. Make sure to specify the version as part of the url when making requests. More information can be found in the [API Versioning for DICOM service Documentation](api-versioning-dicom-service.md). +1. Copy the **Service URL** of your DICOM service. Make sure to specify the version as part of the url when making requests. For more information, see [API versioning for the DICOM service](api-versioning-dicom-service.md). In your application, install the following NuGet packages: In your application, install the following NuGet packages: ## Create a DicomWebClient -After you've deployed your DICOM service, you'll create a DicomWebClient. Run the following code snippet to create DicomWebClient, which we'll be using for the rest of this tutorial. Ensure you have both NuGet packages installed as mentioned previously. If you haven't already obtained a token, see [Get access token for the DICOM service using Azure CLI](dicom-get-access-token-azure-cli.md). +After you deploy your DICOM service, you create a DicomWebClient. Run the code snippet to create DicomWebClient, which you use for the rest of this tutorial. Ensure you have both NuGet packages installed. If you haven't already obtained a token, see [Get access token for the DICOM service using Azure CLI](dicom-get-access-token-azure-cli.md). ```c# string webServerUrl ="{Your DicomWeb Server URL}" client.HttpClient.DefaultRequestHeaders.Authorization = new System.Net.Http.Head ``` With the DicomWebClient, we can now perform the Store, Retrieve, Search, and Delete operations. -## Store DICOM Instances (STOW) +## Store DICOM instances (STOW) Using the DicomWebClient that we've created, we can now store DICOM files. DicomWebResponse response = await client.StoreAsync(new[] { dicomFile }, "1.2.82 Before moving on to the next part of the tutorial, upload the `green-square.dcm` file using either of the preceding methods. -## Retrieving DICOM instance(s) (WADO) +## Retrieve DICOM instance (WADO) -The following code snippets will demonstrate how to perform each of the retrieve queries using the DicomWebClient created previously. +The code snippets show how to perform each of the retrieve queries using the DicomWebClient created previously. -The following variables will be used throughout the rest of the examples: +The variables are used throughout the rest of the examples: ```c# string studyInstanceUid = "1.2.826.0.1.3680043.8.498.13230779778012324449356534479549187420"; //StudyInstanceUID for all 3 examples string sopInstanceUid = "1.2.826.0.1.3680043.8.498.47359123102728459884412887463 ### Retrieve all instances within a study -Retrieve all instances within a study retrieves all instances within a single study. - _Details:_ * GET /studies/{study} _Details:_ DicomWebResponse response = await client.DeleteSeriesAsync(studyInstanceUid, seriesInstanceUid); ``` -This response deletes the green-square instance (it's the only element left in the series) from the server. If it's successful, the response status code will contain no content. +This response deletes the green-square instance (it's the only element left in the series) from the server. If it's successful, the response status code contains no content. ### Delete a specific study DicomWebResponse response = await client.DeleteStudyAsync(studyInstanceUid); This response deletes the blue-circle instance (it's the only element left in the series) from the server. If it's successful, the response status code contains no content. -### Next Steps --For information about the DICOM service, see -->[!div class="nextstepaction"] ->[Overview of the DICOM service](dicom-services-overview.md) |
healthcare-apis | Dicomweb Standard Apis Curl | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/healthcare-apis/dicom/dicomweb-standard-apis-curl.md | Title: Using DICOMweb™Standard APIs with cURL - Azure Health Data Services -description: In this tutorial, you'll learn how to use DICOMweb Standard APIs with cURL. + Title: Use cURL and DICOMweb Standard APIs in Azure Health Data Services +description: Use cURL and DICOMweb Standard APIs to store, retrieve, search, and delete DICOM files in the DICOM service. Previously updated : 02/15/2022 Last updated : 10/18/2023 -# Using DICOMWeb™ Standard APIs with cURL +# Use DICOMweb Standard APIs with cURL -This tutorial uses cURL to demonstrate working with the DICOM service. +This article shows how to work with the DICOMweb service using cURL and [sample .dcm DICOM® files](https://github.com/microsoft/dicom-server/tree/main/docs/dcms). -In this tutorial, we'll use the following [sample .dcm DICOM files](https://github.com/microsoft/dicom-server/tree/main/docs/dcms). +Use these sample files: -* blue-circle.dcm -* dicom-metadata.csv -* green-square.dcm -* red-triangle.dcm +- blue-circle.dcm +- dicom-metadata.csv +- green-square.dcm +- red-triangle.dcm -The file name, studyUID, seriesUID, and instanceUID of the sample DICOM files is as follows: +The filename, studyUID, seriesUID, and instanceUID of the sample DICOM files are: | File | StudyUID | SeriesUID | InstanceUID | | | | | | The file name, studyUID, seriesUID, and instanceUID of the sample DICOM files is |blue-circle.dcm|1.2.826.0.1.3680043.8.498.13230779778012324449356534479549187420|1.2.826.0.1.3680043.8.498.77033797676425927098669402985243398207|1.2.826.0.1.3680043.8.498.13273713909719068980354078852867170114| >[!NOTE]->Each of these files represent a single instance and are part of the same study. Also, the green-square and red-triangle are part of the same series, while the blue-circle is in a separate series. +>Each of these files represents a single instance and is part of the same study. Also, the green-square and red-triangle are part of the same series, while the blue-circle is in a separate series. ## Prerequisites -To use the DICOMWeb™ Standard APIs, you must have an instance of the DICOM service deployed. If you haven't already deployed an instance of the DICOM service, see [Deploy DICOM service using the Azure portal](deploy-dicom-services-in-azure.md). +To use the DICOM Standard APIs, you must have an instance of the DICOM service deployed. If you haven't already deployed an instance of the DICOM service, see [Deploy DICOM service using the Azure portal](deploy-dicom-services-in-azure.md). Once you've deployed an instance of the DICOM service, retrieve the URL for your App service: Once you've deployed an instance of the DICOM service, retrieve the URL for your 3. Copy the **Service URL** of your DICOM service. 4. If you haven't already obtained a token, see [Get access token for the DICOM service using Azure CLI](dicom-get-access-token-azure-cli.md). -For this code, we'll be accessing an Public Preview Azure service. It's important that you don't upload any private health information (PHI). +For this code, we access a Public Preview Azure service. It's important that you don't upload any private health information (PHI). -## Working with the DICOM service +## Work with the DICOM service -The DICOMweb™ Standard makes heavy use of `multipart/related` HTTP requests combined with DICOM specific accept headers. Developers familiar with other REST-based APIs often find working with the DICOMweb™ Standard awkward. However, once you've it up and running, it's easy to use. It just takes a little familiarity to get started. +The DICOMweb Standard makes heavy use of `multipart/related` HTTP requests combined with DICOM specific accept headers. Developers familiar with other REST-based APIs often find working with the DICOMweb Standard awkward. However, after you get it up and running, it's easy to use. It just takes a little familiarity to get started. The cURL commands each contain at least one, and sometimes two, variables that must be replaced. To simplify running the commands, search and replace the following variables by replacing them with your specific values: -* {Service URL} This is the URL to access your DICOM service that you provisioned in the Azure portal, for example, ```https://<workspacename-dicomservicename>.dicom.azurehealthcareapis.com```. Make sure to specify the version as part of the url when making requests. More information can be found in the [API Versioning for DICOM service Documentation](api-versioning-dicom-service.md). +* {Service URL} The service URL is the URL to access your DICOM service that you provisioned in the Azure portal, for example, ```https://<workspacename-dicomservicename>.dicom.azurehealthcareapis.com```. Make sure to specify the version as part of the url when making requests. More information can be found in the [API Versioning for DICOM service Documentation](api-versioning-dicom-service.md). * {path-to-dicoms} - The path to the directory that contains the red-triangle.dcm file, such as `C:/dicom-server/docs/dcms` * Ensure to use forward slashes as separators and end the directory _without_ a trailing forward slash. -## Uploading DICOM Instances (STOW) +## Upload DICOM instances (STOW) ### Store-instances-using-multipart/related This request intends to demonstrate how to upload DICOM files using multipart/related. >[!NOTE]->The DICOM service is more lenient than the DICOM standard. However, the example below demonstrates a POST request that complies tightly to the standard. +>The DICOM service is more lenient than the DICOM standard. However, the example demonstrates a POST request that complies tightly to the standard. _Details:_ _Details:_ * Body: * Content-Type: application/dicom for each file uploaded, separated by a boundary value -Some programming languages and tools behave differently. For instance, some require you to define your own boundary. For those, you may need to use a slightly modified Content-Type header. The following have been used successfully. +Some programming languages and tools behave differently. For instance, some require you to define your own boundary. For those tools, you might need to use a slightly modified Content-Type header. The following have been used successfully. * Content-Type: multipart/related; type="application/dicom"; boundary=ABCD1234 * Content-Type: multipart/related; boundary=ABCD1234 * Content-Type: multipart/related _Details:_ * Body: * Content-Type: application/dicom for each file uploaded, separated by a boundary value -Some programming languages and tools behave differently. For instance, some require you to define your own boundary. For those, you may need to use a slightly modified Content-Type header. The following have been used successfully. +Some programming languages and tools behave differently. For instance, some require you to define your own boundary. For those languages and tools, you might need to use a slightly modified Content-Type header. The following have been used successfully. * Content-Type: multipart/related; type="application/dicom"; boundary=ABCD1234 * Content-Type: multipart/related; boundary=ABCD1234 curl --location --request POST "{Service URL}/v{version}/studies" --data-binary "@{path-to-dicoms}/green-square.dcm" ``` -## Retrieving DICOM (WADO) +## Retrieve DICOM (WADO) ### Retrieve all instances within a study curl --request GET "{Service URL}/v{version}/studies/1.2.826.0.1.3680043.8.498.1 --output "suppressWarnings.txt" ``` -This cURL command will show the downloaded bytes in the output file (suppressWarnings.txt), but these aren't direct DICOM files, only a text representation of the multipart/related download. +This cURL command shows the downloaded bytes in the output file (suppressWarnings.txt), but they aren't direct DICOM files, only a text representation of the multipart/related download. ### Retrieve metadata of all instances in study _Details:_ * Accept: application/dicom+json * Authorization: Bearer {token value} -This cURL command will show the downloaded bytes in the output file (suppressWarnings.txt), but these aren't direct DICOM files, only a text representation of the multipart/related download. +This cURL command shows the downloaded bytes in the output file (suppressWarnings.txt), but they aren't direct DICOM files, only a text representation of the multipart/related download. ``` curl --request GET "{Service URL}/v{version}/studies/1.2.826.0.1.3680043.8.498.13230779778012324449356534479549187420/metadata" _Details:_ * Accept: multipart/related; type="application/dicom"; transfer-syntax=* * Authorization: Bearer {token value} -This cURL command will show the downloaded bytes in the output file (suppressWarnings.txt), but it's not the DICOM file, only a text representation of the multipart/related download. +This cURL command shows the downloaded bytes in the output file (suppressWarnings.txt), but it's not the DICOM file, only a text representation of the multipart/related download. ``` curl --request GET "{Service URL}/v{version}/studies/1.2.826.0.1.3680043.8.498.13230779778012324449356534479549187420/series/1.2.826.0.1.3680043.8.498.45787841905473114233124723359129632652" curl --request GET "{Service URL}/v{version}/studies/1.2.826.0.1.3680043.8.498.1 ## Query DICOM (QIDO) -In the following examples, we'll search for items using their unique identifiers. You can also search for other attributes, such as `PatientName`. +In the following examples, we search for items using their unique identifiers. You can also search for other attributes, such as `PatientName`. ### Search for studies curl--request DELETE "{Service URL}/v{version}/studies/1.2.826.0.1.3680043.8.498 --header "Authorization: Bearer {token value}" ``` -### Next Steps --For information about the DICOM service, see -->[!div class="nextstepaction"] ->[Overview of the DICOM service](dicom-services-overview.md) |
healthcare-apis | Dicomweb Standard Apis Python | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/healthcare-apis/dicom/dicomweb-standard-apis-python.md | Title: Using DICOMweb Standard APIs with Python - Azure Health Data Services -description: This tutorial describes how to use DICOMweb Standard APIs with Python. + Title: Use Python and DICOMweb Standard APIs in Azure Health Data Services +description: Use Python and DICOMweb Standard APIs to store, retrieve, search, and delete DICOM files in the DICOM service. Last updated 02/15/2022 -# Using DICOMWeb™ Standard APIs with Python +# Use DICOMweb Standard APIs with Python -This tutorial uses Python to demonstrate working with the DICOM Service. +This article shows how to work with the DICOMweb service using Python and [sample .dcm DICOM® files](https://github.com/microsoft/dicom-server/tree/main/docs/dcms). -In the tutorial, we'll use the following [sample .dcm DICOM files](https://github.com/microsoft/dicom-server/tree/main/docs/dcms). +Use these sample files: -* blue-circle.dcm -* dicom-metadata.csv -* green-square.dcm -* red-triangle.dcm +- blue-circle.dcm +- dicom-metadata.csv +- green-square.dcm +- red-triangle.dcm - The file name, studyUID, seriesUID, and instanceUID of the sample DICOM files is as follows: +The filename, studyUID, seriesUID, and instanceUID of the sample DICOM files are: | File | StudyUID | SeriesUID | InstanceUID | | | | | | In the tutorial, we'll use the following [sample .dcm DICOM files](https://githu |blue-circle.dcm|1.2.826.0.1.3680043.8.498.13230779778012324449356534479549187420|1.2.826.0.1.3680043.8.498.77033797676425927098669402985243398207|1.2.826.0.1.3680043.8.498.13273713909719068980354078852867170114| > [!NOTE]-> Each of these files represent a single instance and are part of the same study. Also,the green-square and red-triangle are part of the same series, while the blue-circle is in a separate series. +> Each of these files represents a single instance and is part of the same study. Also,the green-square and red-triangle are part of the same series, while the blue-circle is in a separate series. ## Prerequisites -To use the DICOMWeb™ Standard APIs, you must have an instance of the DICOM service deployed. If you haven't already deployed the DICOM service, see [Deploy DICOM service using the Azure portal](deploy-dicom-services-in-azure.md). +To use the DICOMweb Standard APIs, you must have an instance of the DICOM service deployed. If you haven't already deployed the DICOM service, see [Deploy DICOM service using the Azure portal](deploy-dicom-services-in-azure.md). -After you've deployed an instance of the DICOM service, retrieve the URL for your App service: +After you deploy an instance of the DICOM service, retrieve the URL for your App service: 1. Sign in to the [Azure portal](https://portal.azure.com). 1. Search **Recent resources** and select your DICOM service instance. 1. Copy the **Service URL** of your DICOM service. 2. If you haven't already obtained a token, see [Get access token for the DICOM service using Azure CLI](dicom-get-access-token-azure-cli.md). -For this code, we'll be accessing a Public Preview Azure service. It's important that you don't upload any private health information (PHI). +For this code, you access a Public Preview Azure service. It's important that you don't upload any private health information (PHI). -## Working with the DICOM service +## Work with the DICOM service -The DICOMweb™ Standard makes heavy use of `multipart/related` HTTP requests combined with DICOM specific accept headers. Developers familiar with other REST-based APIs often find working with the DICOMweb™ standard awkward. However, once you've it up and running, it's easy to use. It just takes a little familiarity to get started. +The DICOMweb Standard makes heavy use of `multipart/related` HTTP requests combined with DICOM specific accept headers. Developers familiar with other REST-based APIs often find working with the DICOMweb standard awkward. However, after it's up and running, it's easy to use. It just takes a little familiarity to get started. -### Import the appropriate Python libraries +### Import the Python libraries First, import the necessary Python libraries. from urllib3.filepost import encode_multipart_formdata, choose_boundary from azure.identity import DefaultAzureCredential ``` -### Configure user-defined variables to be used throughout +### Configure user-defined variables -Replace all variable values wrapped in { } with your own values. Additionally, validate that any constructed variables are correct. For instance, `base_url` is constructed using the Service URL and then appended with the version of the REST API being used. The Service URL of your DICOM service will be: ```https://<workspacename-dicomservicename>.dicom.azurehealthcareapis.com```. You can use the Azure portal to navigate to the DICOM service and obtain your Service URL. You can also visit the [API Versioning for DICOM service Documentation](api-versioning-dicom-service.md) for more information on versioning. If you're using a custom URL, you'll need to override that value with your own. +Replace all variable values wrapped in { } with your own values. Additionally, validate that any constructed variables are correct. For instance, `base_url` is constructed using the Service URL and then appended with the version of the REST API being used. The Service URL of your DICOM service is: ```https://<workspacename-dicomservicename>.dicom.azurehealthcareapis.com```. You can use the Azure portal to navigate to the DICOM service and obtain your Service URL. You can also visit the [API Versioning for DICOM service Documentation](api-versioning-dicom-service.md) for more information on versioning. If you're using a custom URL, you need to override that value with your own. ```python dicom_service_name = "{server-name}" instance_uid = "1.2.826.0.1.3680043.8.498.47359123102728459884412887463296905395 ### Authenticate to Azure and get a token -`DefaultAzureCredential` allows us to get a variety of ways to get tokens to log into the service. We'll use the `AzureCliCredential` to get a token to log into the service. There are other credential providers such as `ManagedIdentityCredential` and `EnvironmentCredential` that are also possible to use. In order to use the AzureCliCredential, you must have logged into Azure from the CLI prior to running this code. (For more information, see [Get access token for the DICOM service using Azure CLI](dicom-get-access-token-azure-cli.md).) Alternatively, you can simply copy and paste the token retrieved while logging in from the CLI. +`DefaultAzureCredential` allows us to use various ways to get tokens to log into the service. In this example, use the `AzureCliCredential` to get a token to log into the service. There are other credential providers such as `ManagedIdentityCredential` and `EnvironmentCredential` that are also possible to use. In order to use the AzureCliCredential, you must have logged into Azure from the CLI prior to running this code. (For more information, see [Get access token for the DICOM service using Azure CLI](dicom-get-access-token-azure-cli.md).) Alternatively, you can copy and paste the token retrieved while logging in from the CLI. > [!NOTE] > `DefaultAzureCredential` returns several different Credential objects. We reference the `AzureCliCredential` as the 5th item in the returned collection. This may not be consistent. If so, uncomment the `print(credential.credential)` line. This will list all the items. Find the correct index, recalling that Python uses zero-based indexing. bearer_token = f'Bearer {token.token}' ### Create supporting methods to support `multipart\related` -The `Requests` libraries (and most Python libraries) don't work with `multipart\related` in a way that supports DICOMweb™. Because of these libraries, we must add a few methods to support working with DICOM files. +The `Requests` libraries (and most Python libraries) don't work with `multipart\related` in a way that supports DICOMweb. Because of these libraries, we must add a few methods to support working with DICOM files. `encode_multipart_related` takes a set of fields (in the DICOM case, these libraries are generally Part 10 dam files) and an optional user-defined boundary. It returns both the full body, along with the content_type, which it can be used. def encode_multipart_related(fields, boundary=None): ### Create a `requests` session -Creates a `requests` session, called `client`, that will be used to communicate with the DICOM service. +Creates a `requests` session, called `client` that is used to communicate with the DICOM service. ```python client = requests.session() ### Verify authentication is configured correctly -Call the changefeed API endpoint, which will return a 200 if authentication is successful. +Call the changefeed API endpoint, which returns a 200 if authentication is successful. ```python headers = {"Authorization":bearer_token} if (response.status_code != 200): print('Error! Likely not authenticated!') ``` -## Uploading DICOM Instances (STOW) +## Upload DICOM instances (STOW) The following examples highlight persisting DICOM files. ### Store instances using `multipart/related` -This example demonstrates how to upload a single DICOM file, and it uses a bit of a Python to pre-load the DICOM file (as bytes) into memory. By passing an array of files to the fields parameter of `encode_multipart_related`, multiple files can be uploaded in a single POST. It's sometimes used to upload several instances inside a complete series or study. +This example demonstrates how to upload a single DICOM file, and it uses a bit of a Python to preload the DICOM file (as bytes) into memory. When an array of files is passed to the fields parameter of `encode_multipart_related`, multiple files can be uploaded in a single POST. It's sometimes used to upload several instances inside a complete series or study. _Details:_ _Details:_ * Body: * Content-Type: application/dicom for each file uploaded, separated by a boundary value -Some programming languages and tools behave differently. For instance, some of them require you to define your own boundary. For those, you may need to use a slightly modified Content-Type header. The following have been used successfully. +Some programming languages and tools behave differently. For example, some require you to define your own boundary. For those languages and tools, you might need to use a slightly modified Content-Type header. The following have been used successfully. * Content-Type: multipart/related; type="application/dicom"; boundary=ABCD1234 * Content-Type: multipart/related; boundary=ABCD1234 * Content-Type: multipart/related response = client.post(url, body, headers=headers, verify=False) ### Store instances for a specific study -This example demonstrates how to upload multiple DICOM files into the specified study. It uses a bit of a Python to pre-load the DICOM file (as bytes) into memory. +This example demonstrates how to upload multiple DICOM files into the specified study. It uses a bit of a Python to preload the DICOM file (as bytes) into memory. -By passing an array of files to the fields parameter of `encode_multipart_related`, multiple files can be uploaded in a single POST. It's sometimes used to upload a complete series or study. +When an array of files is passed to the fields parameter of `encode_multipart_related`, multiple files can be uploaded in a single POST. It's sometimes used to upload a complete series or study. _Details:_ * Path: ../studies/{study} headers = {'Accept':'application/dicom+json', "Content-Type":content_type, "Auth url = f'{base_url}/studies' response = client.post(url, body, headers=headers, verify=False) ```-### Store single instance (non-standard) +### Store single instance (nonstandard) -The following code example demonstrates how to upload a single DICOM file. It's a non-standard API endpoint that simplifies uploading a single file as binary bytes sent in the body of a request +The following code example demonstrates how to upload a single DICOM file. It's a nonstandard API endpoint that simplifies uploading a single file as binary bytes sent in the body of a request _Details:_ * Path: ../studies response = client.post(url, body, headers=headers, verify=False) response # response should be a 409 Conflict if the file was already uploaded in the above request ``` -## Retrieve DICOM Instances (WADO) +## Retrieve DICOM instances (WADO) The following examples highlight retrieving DICOM instances. response = client.get(url, headers=headers) #, verify=False) ### Use the retrieved instances -The instances are retrieved as binary bytes. You can loop through the returned items and convert the bytes into a file-like that can be read by `pydicom`. +The instances are retrieved as binary bytes. You can loop through the returned items and convert the bytes into a file that `pydicom` can read. ```python _Details:_ * Headers: * Authorization: Bearer $token -This code example deletes the green-square instance (it's the only element left in the series) from the server. If it's successful, the response status code won't delete content. +This code example deletes the green-square instance (it's the only element left in the series) from the server. If it's successful, the response status code doesn't delete content. ```python headers = {"Authorization":bearer_token} url = f'{base_url}/studies/{study_uid}' response = client.delete(url, headers=headers) ``` -### Next Steps --For information about the DICOM service, see -->[!div class="nextstepaction"] ->[Overview of the DICOM service](dicom-services-overview.md) |
healthcare-apis | Dicomweb Standard Apis With Dicom Services | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/healthcare-apis/dicom/dicomweb-standard-apis-with-dicom-services.md | Title: Using DICOMweb - Standard APIs with Azure Health Data Services DICOM service + Title: Use DICOMweb Standard APIs with the DICOM servixw in Azure Health Data Services description: This tutorial describes how to use DICOMweb Standard APIs with the DICOM service. In addition to the subset of DICOMweb APIs, the DICOM service supports two custo The DICOM service provides a web-based interface that follows REST (representational state transfer) principles. The REST API allows different applications or systems to communicate with each other using standard methods like GET, POST, PUT, and DELETE. Use any programming language that supports HTTP requests and responses to interact with the DICOM service. ----To see language-specific examples, refer to the examples below. You can view Postman collection examples in several languages including: +For language-specific examples, refer to the examples. You can view Postman collection examples in several languages including: * Go * Java To see language-specific examples, refer to the examples below. You can view Pos ### C# -Refer to the [Using DICOMwebΓäó Standard APIs with C#](dicomweb-standard-apis-c-sharp.md) tutorial to learn how to use C# with DICOM service. +Refer to [Use DICOMweb Standard APIs with C#](dicomweb-standard-apis-c-sharp.md) to learn how to use C# with DICOM service. ### cURL -cURL is a common command-line tool for calling web endpoints that is available for nearly any operating system. [Download cURL](https://curl.haxx.se/download.html) to get started. +cURL is a common command-line tool for calling web endpoints and is available for most operating systems. To get started, [download cURL](https://curl.haxx.se/download.html). To learn how to use cURL with DICOM service, see [Using DICOMWebΓäó Standard APIs with cURL](dicomweb-standard-apis-curl.md) tutorial. Refer to the [Using DICOMWebΓäó Standard APIs with Python](dicomweb-standard-api ### Postman -Postman is an excellent tool for designing, building, and testing REST APIs. [Download Postman](https://www.postman.com/downloads/) to get started. You can learn how to effectively use Postman at the [Postman learning site](https://learning.postman.com/). --One important caveat with Postman and DICOMweb™ Standard is that Postman can only support uploading DICOM files using the single part payload defined in the DICOM standard. This reason is because Postman can't support custom separators in a multipart/related POST request. For more information, see [Multipart POST not working for me # 576](https://github.com/postmanlabs/postman-app-support/issues/576). Thus, all examples in the Postman collection for uploading DICOM documents using a multipart request are prefixed with [won't work - see description]. The examples for uploading using a single part request are included in the collection and are prefixed with "Store-Single-Instance". +Postman is an excellent tool for designing, building, and testing REST APIs. [Download Postman](https://www.postman.com/downloads/) to get started. Learn how to use Postman at the [Postman learning site](https://learning.postman.com/). -To use the Postman collection, you'll need to download the collection locally and import the collection through Postman. To access this collection, see [Postman Collection Examples](https://github.com/microsoft/dicom-server/blob/main/docs/resources/Conformance-as-Postman.postman_collection.json). +One important caveat with Postman and the DICOMweb Standard is that Postman only supports uploading DICOM files using the single part payload defined in the DICOM standard. This reason is because Postman can't support custom separators in a multipart/related POST request. For more information, see [Multipart POST not working for me # 576](https://github.com/postmanlabs/postman-app-support/issues/576). All examples in the Postman collection for uploading DICOM documents using a multipart request are prefixed with **[won't work - see description]**. The examples for uploading using a single-part request are included in the collection and are prefixed with **Store-Single-Instance**. -## Summary +To use the Postman collection, you need to download the collection locally and import the collection through Postman. To access this collection, see [Postman Collection Examples](https://github.com/microsoft/dicom-server/blob/main/docs/resources/Conformance-as-Postman.postman_collection.json). -This tutorial provided an overview of the APIs supported by DICOM service. Get started using these APIs with the following tools: +### Next steps +Check out these examples of how to use the APIs: - [Using DICOMwebΓäó Standard APIs with C#](dicomweb-standard-apis-c-sharp.md) - [Using DICOMWebΓäó Standard APIs with cURL](dicomweb-standard-apis-curl.md) - [Using DICOMWebΓäó Standard APIs with Python](dicomweb-standard-apis-python.md)-- [Use DICOMWebΓäó Standard APIs with Postman Example Collection](https://github.com/microsoft/dicom-server/blob/main/docs/resources/Conformance-as-Postman.postman_collection.json)--### Next steps +- [Use DICOMWebΓäó Standard APIs with the Postman Example Collection](https://github.com/microsoft/dicom-server/blob/main/docs/resources/Conformance-as-Postman.postman_collection.json) -To learn more about our support of DICOM Web Standard APIs, see the [DICOM Conformance Statement](dicom-services-conformance-statement-v2.md) reference document. +Learn more about Microsoft support for DICOM Web Standard APIs in the [DICOM Conformance Statement](dicom-services-conformance-statement-v2.md). -For language-agnostic information on working with the service, see [DICOM Services Conformance Statement](dicom-services-conformance-statement-v2.md). |
healthcare-apis | Import Files | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/healthcare-apis/dicom/import-files.md | + + Title: Import DICOM files into the DICOM service +description: Learn how to import DICOM files using bulk import in Azure Health Data Services ++++ Last updated : 10/05/2023++++# Import DICOM files (Preview) ++Bulk import is a quick way to add data to the DICOM® service. Importing DICOM files with the bulk import capability enables: ++- **Back-up and migration**. For example, your organization might have many DICOM instances stored in local or on-premises systems that you want to back up or migrate to the cloud for better security, scalability, and availability. Rather than uploading the data one by one, use bulk import to transfer the data faster and more efficiently. ++- **Machine learning development**. For example, your organization might have a large dataset of DICOM instances that you want to use for training machine learning models. With bulk import, you can upload the data to the DICOM service and then access it from [Microsoft Fabric](get-started-with-analytics-dicom.md), [Azure Machine Learning](../../machine-learning/overview-what-is-azure-machine-learning.md), or other tools. ++## Prerequisites ++- **Deploy an instance of the DICOM service**. For more information, see [Deploy the DICOM service](deploy-dicom-services-in-azure.md). + +- **Deploy the events capability for the DICOM service**. For more information, see [Deploy events using the Azure portal](../events/events-deploy-portal.md). ++## Enable a system-assigned managed identity ++Before you perform a bulk import, you need to enable a system-assigned managed identity. ++1. In the Azure portal, go to the DICOM instance and then select **Identity** from the left pane. ++2. On the **Identity** page, select the **System assigned** tab, and then set the **Status** field to **On**. Choose **Save**. +++## Enable bulk import ++You need to enable bulk import before you import data. ++#### Use the Azure portal ++1. In the Azure portal, go to the DICOM service and then select **Bulk Import** from the left pane. ++2. On the **Bulk Import** page, in the **Bulk Import** field, select **Enabled**. Choose **Save**. +++#### Use an Azure Resource Manager (ARM) template ++When you use an ARM template, enable bulk import with the property named `bulkImportConfiguration`. ++Here's an example of how to configure bulk import in an ARM template: ++``` json +{ +ΓÇ» ΓÇ» "type": "Microsoft.HealthcareApis/workspaces/dicomservices", +ΓÇ» ΓÇ» "apiVersion": "2023-02-01-preview", +ΓÇ» ΓÇ» "name": "[parameters('dicomservicename')]", +ΓÇ» ΓÇ» "location": "[parameters('regionname')]", +ΓÇ» ΓÇ» "identity": { +ΓÇ» ΓÇ» ΓÇ» ΓÇ» "type": "SystemAssigned" +ΓÇ» ΓÇ» }, +ΓÇ» ΓÇ» "properties": { +ΓÇ» ΓÇ» ΓÇ» ΓÇ» "bulkImportConfiguration": { +ΓÇ» ΓÇ» ΓÇ» ΓÇ» ΓÇ» ΓÇ» "enabled": true +ΓÇ» ΓÇ» ΓÇ» ΓÇ» } +ΓÇ» ΓÇ» } +} +``` ++## Import data ++After you enable bulk import, a resource group is provisioned in your Azure subscription. The name of the resource group begins with the prefix `AHDS_`, followed by the workspace and DICOM service name. For example, the DICOM service named `mydicom` in the workspace `contoso`, the resource group would be named `AHDS_contoso-mydicom`. ++Within the new resource group, two resources are created: ++- A randomly named storage account that has two precreated containers (`import-container` and `error-container`), and two queues (`import-queue` and `error-queue`). +- An [Azure Event Grid system topic](/azure/event-grid/create-view-manage-system-topics) named `dicom-bulk-import`. ++DICOM images are added to the DICOM service by copying them into the `import-container`. Bulk import monitors this container for new images and adds them to the DICOM service. If there are errors that prevent a file from being added successfully, the errors are copied to the `error-container` and an error message is written to the `error-queue`. ++#### Grant write access to the import container ++The user or account that adds DICOM images to the import container needs write access to the container using the `Data Owner` role. For more information, see [Assign Azure roles using the Azure portal](../../role-based-access-control/role-assignments-portal.md). ++#### Upload DICOM images to the import container ++Data is uploaded to Azure storage containers in many ways: ++- [Upload a blob with Azure Storage Explorer](../../storage/blobs/quickstart-storage-explorer.md#upload-blobs-to-the-container) +- [Upload a blob with AzCopy](../../storage/common/storage-use-azcopy-blobs-upload.md) +- [Upload a blob with Azure CLI](../../storage/blobs/storage-quickstart-blobs-cli.md#upload-a-blob) |
healthcare-apis | Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/healthcare-apis/iot/overview.md | The MedTech service enables you to easily modify and extend the capabilities of ### Integration -The MedTech service may also be integrated for ingesting device data from these wearables using our [open-source projects](git-projects.md): +The MedTech service can also be integrated for ingesting device data from these wearables using our [open-source projects](git-projects.md): * Fitbit® The MedTech service may also be integrated for ingesting device data from these The following Microsoft solutions can use MedTech service for extra functionality: -* [**Microsoft Azure IoT Hub**](../../iot-hub/iot-concepts-and-iot-hub.md) - enhances workflow and ease of use. +* [**Azure IoT Hub**](../../iot-hub/iot-concepts-and-iot-hub.md) - enhances workflow and ease of use. * [**Azure Machine Learning Service**](concepts-machine-learning.md) - helps build, deploy, and manage models, integrate tools, and increase open-source operability. |
machine-learning | Azure Machine Learning Release Notes Cli V2 | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/machine-learning/azure-machine-learning-release-notes-cli-v2.md | In this article, learn about Azure Machine Learning CLI (v2) releases. __RSS feed__: Get notified when this page is updated by copying and pasting the following URL into your feed reader: `https://learn.microsoft.com/api/search/rss?search=%22Azure+machine+learning+release+notes-v2%22&locale=en-us` +## 2023-10-18 ++### Azure Machine Learning CLI (v2) v2.21.1 ++ - pydash dependency version was upgraded to >=6.0.0 to patch [security vulnerability in versions below 6.0.0](https://github.com/advisories/GHSA-8mjr-6c96-39w8) ++ ## 2023-05-09 ### Azure Machine Learning CLI (v2) v2.16.0 |
machine-learning | Concept Package Models | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/machine-learning/concept-package-models.md | You can provide model packages as inputs to online endpoints. Use of model packa :::image type="content" source="media/model-packaging/model-package-targets.png" alt-text="Screenshot that shows all the possible targets for a model package."::: -### Specify model package before deployment +### Package before deployment The simplest way to deploy using a model package is by specifying to Azure Machine Learning to deploy a model package, before executing the deployment. When using the Azure CLI, Azure Machine Learning SDK, or Azure Machine Learning studio to create a deployment in an online endpoint, you can specify the use of model packaging as follows: Azure Machine Learning packages the model first and then executes the deployment > [!NOTE] > When using packages, if you indicate a base environment with `conda` or `pip` dependencies, you don't need to include the dependencies of the inference server (`azureml-inference-server-http`). Rather, these dependencies are automatically added for you. +### Deploy a packaged model ++You can deploy a model that has been packaged directly to an Online Endpoint. This practice ensures reproducibility of results and it's a best practice. See [Package and deploy models to Online Endpoints](how-to-package-models-moe.md). ++If you want to deploy the package outside of Azure Machine Learning, see [Package and deploy models outside Azure Machine Learning](how-to-package-models-app-service.md). ## Next step > [!div class="nextstepaction"]-> [Create your first model package](how-to-package-models.md) +> [Create your first model package](how-to-package-models.md) |
machine-learning | How To Deploy Online Endpoints | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/machine-learning/how-to-deploy-online-endpoints.md | Before following the steps in this article, make sure you have the following pre For managed online endpoints, Azure Machine Learning reserves 20% of your compute resources for performing upgrades on some VM SKUs. If you request a given number of instances in a deployment, you must have a quota for `ceil(1.2 * number of instances requested for deployment) * number of cores for the VM SKU` available to avoid getting an error. For example, if you request 10 instances of a [Standard_DS3_v2](/azure/virtual-machines/dv2-dsv2-series) VM (that comes with 4 cores) in a deployment, you should have a quota for 48 cores (`12 instances * 4 cores`) available. To view your usage and request quota increases, see [View your usage and quotas in the Azure portal](how-to-manage-quotas.md#view-your-usage-and-quotas-in-the-azure-portal). +There are certain VM SKUs that are exempted from extra quota reservation. To view the full list, see [Managed online endpoints SKU list](reference-managed-online-endpoints-vm-sku-list.md). Azure Machine Learning provides a [shared quota](how-to-manage-quotas.md#azure-machine-learning-shared-quota) pool from which all users can access quota to perform testing for a limited time. When you use the studio to deploy Llama models (from the model catalog) to a managed online endpoint, Azure Machine Learning allows you to access this shared quota for a short time. |
machine-learning | How To Package Models | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/machine-learning/how-to-package-models.md | The following code shows how to configure `copy` in a model package: __package-external.yml__ # [Python](#tab/sdk) |
machine-learning | How To Train Distributed Gpu | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/machine-learning/how-to-train-distributed-gpu.md | Azure Machine Learning will set the `MASTER_ADDR`, `MASTER_PORT`, `WORLD_SIZE`, `DeepSpeed` can be enabled using either Pytorch distribution or MPI for running distributed training. Azure Machine Learning supports the `DeepSpeed` launcher to launch distributed training as well as autotuning to get optimal `ds` configuration. -You can use a [curated environment](resource-curated-environments.md#azure-container-for-pytorch-acpt) for an out of the box environment with the latest state of art technologies including `DeepSpeed`, `ORT`, `MSSCCL`, and `Pytorch` for your DeepSpeed training jobs. +You can use a [curated environment](resource-curated-environments.md) for an out of the box environment with the latest state of art technologies including `DeepSpeed`, `ORT`, `MSSCCL`, and `Pytorch` for your DeepSpeed training jobs. ### DeepSpeed example |
machine-learning | Reference Checkpoint Performance For Large Models | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/machine-learning/reference-checkpoint-performance-for-large-models.md | With Nebula you can: * An Azure subscription and an Azure Machine Learning workspace. See [Create workspace resources](./quickstart-create-resources.md) for more information about workspace resource creation * An Azure Machine Learning compute target. See [Manage training & deploy computes](./how-to-create-attach-compute-studio.md) to learn more about compute target creation * A training script that uses **PyTorch**.-* ACPT-curated (Azure Container for PyTorch) environment. See [Curated environments](resource-curated-environments.md#azure-container-for-pytorch-acpt) to obtain the ACPT image. Learn how to [use the curated environment](./how-to-use-environments.md) +* ACPT-curated (Azure Container for PyTorch) environment. See [Curated environments](resource-curated-environments.md) to obtain the ACPT image. Learn how to [use the curated environment](./how-to-use-environments.md) ## How to Use Nebula The steps to quick start Nebula include: ### Using ACPT environment -[Azure Container for PyTorch (ACPT)](how-to-manage-environments-v2.md?tabs=cli#curated-environments), a curated environment for PyTorch model training, includes Nebula as a preinstalled, dependent Python package. See [Azure Container for PyTorch (ACPT)](resource-curated-environments.md#azure-container-for-pytorch-acpt) to view the curated environment, and [Enabling Deep Learning with Azure Container for PyTorch in Azure Machine Learning](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/enabling-deep-learning-with-azure-container-for-pytorch-in-azure/ba-p/3650489) to learn more about the ACPT image. +[Azure Container for PyTorch (ACPT)](how-to-manage-environments-v2.md?tabs=cli#curated-environments), a curated environment for PyTorch model training, includes Nebula as a preinstalled, dependent Python package. See [Azure Container for PyTorch (ACPT)](resource-curated-environments.md) to view the curated environment, and [Enabling Deep Learning with Azure Container for PyTorch in Azure Machine Learning](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/enabling-deep-learning-with-azure-container-for-pytorch-in-azure/ba-p/3650489) to learn more about the ACPT image. ### Initializing Nebula If you used ``nebula_checkpoint_callback`` in your ``Trainer()``, your original ## Next steps * [Track ML experiments and models with MLflow](how-to-use-mlflow-cli-runs.md)-* [Log and view metrics](how-to-log-view-metrics.md) +* [Log and view metrics](how-to-log-view-metrics.md) |
machine-learning | Reference Managed Online Endpoints Vm Sku List | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/machine-learning/reference-managed-online-endpoints-vm-sku-list.md | This table shows the VM SKUs that are supported for Azure Machine Learning manag | X-Large | Standard_D32a_v4 </br> Standard_D32as_v4 </br> Standard_D48a_v4 </br> Standard_D48as_v4 </br> Standard_D64a_v4 </br> Standard_D64as_v4 </br> Standard_D96a_v4 </br> Standard_D96as_v4 | Standard_F32s_v2 <br/> Standard_F48s_v2 <br/> Standard_F64s_v2 <br/> Standard_F72s_v2 <br/> Standard_FX24mds <br/> Standard_FX36mds <br/> Standard_FX48mds | Standard_E32s_v3 <br/> Standard_E48s_v3 <br/> Standard_E64s_v3 | Standard_NC48ads_A100_v4 </br> Standard_NC96ads_A100_v4 </br> Standard_ND96asr_v4 </br> Standard_ND96amsr_A100_v4 </br> Standard_ND40rs_v2 | > [!CAUTION]-> `Standard_DS1_v2` and `Standard_F2s_v2` may be too small for bigger models and may lead to container termination due to insufficient memory, not enough space on the disk, or probe failure as it takes too long to initiate the container. If you face [OutOfQuota errors](how-to-troubleshoot-online-endpoints.md?tabs=cli#error-outofquota) or [ReourceNotReady errors](how-to-troubleshoot-online-endpoints.md?tabs=cli#error-resourcenotready), try bigger VM SKUs. If you want to reduce the cost of deploying multiple models with managed online endpoint, see [the example for multi models](how-to-deploy-online-endpoints.md#use-more-than-one-model-in-a-deployment). +> `Standard_DS1_v2` and `Standard_F2s_v2` may be too small for bigger models and may lead to container termination due to insufficient memory, not enough space on the disk, or probe failure as it takes too long to initiate the container. If you face [OutOfQuota errors](how-to-troubleshoot-online-endpoints.md?tabs=cli#error-outofquota) or [ReourceNotReady errors](how-to-troubleshoot-online-endpoints.md?tabs=cli#error-resourcenotready), try bigger VM SKUs. If you want to reduce the cost of deploying multiple models with managed online endpoint, see [the example for multi models](how-to-deploy-online-endpoints.md#use-more-than-one-model-in-a-deployment). ++> [!NOTE] +> We recommend having more than 3 instances for deployments in production scenarios. In addition, Azure Machine Learning reserves 20% of your compute resources for performing upgrades on some VM SKUs as described in [Virtual machine quota allocation for deployment](how-to-deploy-online-endpoints.md#virtual-machine-quota-allocation-for-deployment). VM SKUs that are exempted from this extra quota reservation are listed below: +> - Standard_NC24ads_A100_v4 +> - Standard_NC48ads_A100_v4 +> - Standard_NC96ads_A100_v4 +> - Standard_ND96asr_v4 +> - Standard_ND96amsr_A100_v4 +> - Standard_ND40rs_v2 |
machine-learning | Resource Curated Environments | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/machine-learning/resource-curated-environments.md | This article lists the curated environments with latest framework versions in Az * Only have required dependencies and access right in the image/container. >[!IMPORTANT]-> To view more information about curated environment packages and versions, visit the Environments tab in the Azure Machine Learning [studio](./how-to-manage-environments-in-studio.md). +> For more information about curated environment packages and versions, see [How to manage environments in the Azure Machine Learning studio](./how-to-manage-environments-in-studio.md). -## Curated environments --### Azure Container for PyTorch (ACPT) --**Description**: Recommended environment for Deep Learning with PyTorch on Azure. It contains the Azure Machine Learning SDK with the latest compatible versions of Ubuntu, Python, PyTorch, CUDA\RocM, and NebulaML. It also provides optimizers like ORT Training, +DeepSpeed+MSCCL+ORT MoE, and checkpointing using NebulaML and more. --To learn more, see [Azure Container for PyTorch (ACPT)](resource-azure-container-for-pytorch.md). --> [!NOTE] -> Currently, due to underlying cuda and cluster incompatibilities, on [NC series](../virtual-machines/nc-series.md) only acpt-pytorch-1.11-cuda11.3 with cuda 11.3 and torch 1.11 can be used. --### PyTorch --**Name**: AzureML-pytorch-1.10-ubuntu18.04-py38-cuda11-gpu -**Description**: An environment for deep learning with PyTorch containing the Azure Machine Learning Python SDK and other Python packages. -* GPU: Cuda11 -* OS: Ubuntu18.04 -* PyTorch: 1.10 --Other available PyTorch environments: -* AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu -* AzureML-pytorch-1.8-ubuntu18.04-py37-cuda11-gpu -* AzureML-pytorch-1.7-ubuntu18.04-py37-cuda11-gpu ---### LightGBM --**Name**: AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu -**Description**: An environment for machine learning with Scikit-learn, LightGBM, XGBoost, Dask containing the Azure Machine Learning Python SDK and other packages. -* OS: Ubuntu18.04 -* Dask: 2021.6 -* LightGBM: 3.2 -* Scikit-learn: 0.24 -* XGBoost: 1.4 ---### Sklearn -**Name**: AzureML-sklearn-1.0-ubuntu20.04-py38-cpu -**Description**: An environment for tasks such as regression, clustering, and classification with Scikit-learn. Contains the Azure Machine Learning Python SDK and other Python packages. -* OS: Ubuntu20.04 -* Scikit-learn: 1.0 --Other available Sklearn environments: -* AzureML-sklearn-0.24-ubuntu18.04-py37-cpu ---### TensorFlow --**Name**: AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu -**Description**: An environment for deep learning with TensorFlow containing the Azure Machine Learning Python SDK and other Python packages. -* GPU: Cuda11 -* Horovod: 2.4.1 -* OS: Ubuntu18.04 -* TensorFlow: 2.4 ---## Automated ML (AutoML) --Azure Machine Learning pipeline training workflows that use AutoML automatically selects a curated environment based on the compute type and whether DNN is enabled. AutoML provides the following curated environments: --| Name | Compute Type | DNN enabled | -| | | | -|AzureML-AutoML | CPU | No | -|AzureML-AutoML-DNN | CPU | Yes | -| AzureML-AutoML-GPU | GPU | No | -| AzureML-AutoML-DNN-GPU | GPU | Yes | --For more information on AutoML and Azure Machine Learning pipelines, see [use automated ML in an Azure Machine Learning pipeline in Python SDK v1](v1/how-to-use-automlstep-in-pipelines.md). --## Support -Version updates for supported environments, including the base images they reference, are released every quarter to address vulnerabilities. Based on usage, some environments may be deprecated (hidden from the product but usable) to support more common machine learning scenarios. |
mariadb | Whats Happening To Mariadb | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/mariadb/whats-happening-to-mariadb.md | -As part of this retirement, there is no extended support for creating new MariaDB server instances from the Azure portal beginning **December 19, 2023**, if you still need to create MariaDB instances to meet business continuity needs, you can use [Azure CLI](/azure/mysql/single-server/quickstart-create-mysql-server-database-using-azure-cli) until **March 19, 2024**. +As part of this retirement, there is no extended support for creating new MariaDB server instances from the Azure portal beginning **January 19, 2024**, if you still need to create MariaDB instances to meet business continuity needs, you can use [Azure CLI](/azure/mysql/single-server/quickstart-create-mysql-server-database-using-azure-cli) until **March 19, 2024**. We're investing in our flagship offering of Azure Database for MySQL - Flexible Server better suited for mission-critical workloads. Azure Database for MySQL - Flexible Server has better features, performance, an improved architecture, and more controls to manage costs across all service tiers compared to Azure Database for MariaDB. We encourage you to migrate to Azure Database for MySQL - Flexible Server before retirement to experience the new capabilities of Azure Database for MySQL - Flexible Server. A. Unfortunately, we don't plan to support Azure Database for MariaDB beyond the **Q. After the Azure Database for MariaDB retirement announcement, what if I still need to create a new MariaDB server to meet my business needs?** -A. As part of this retirement, we'll no longer support creating new MariaDB instances from the Azure portal beginning **December 19, 2023**. Suppose you still need to create MariaDB instances to meet business continuity needs. In that case, you can use [Azure CLI](/azure/mysql/single-server/quickstart-create-mysql-server-database-using-azure-cli) until **March 19, 2024**. +A. As part of this retirement, we'll no longer support creating new MariaDB instances from the Azure portal beginning **January 19, 2024**. Suppose you still need to create MariaDB instances to meet business continuity needs. In that case, you can use [Azure CLI](/azure/mysql/single-server/quickstart-create-mysql-server-database-using-azure-cli) until **March 19, 2024**. ++**Q. Will I be able to restore instances of Azure Database for MariaDB after March 19, 2024?** ++A. Yes, you will be able to restore your MariaDB instances from your existing servers until September 19, 2025. **Q. How does the Azure Database for MySQL flexible server's 99.99% availability SLA differ from MariaDB?** |
migrate | Concepts Vmware Agentless Migration | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/migrate/concepts-vmware-agentless-migration.md | Once the migration is successful and the VM boots up in Azure, ensure that you s ## Replication cycles +> [!Note] +> Ensure you check for snapshots present from earlier replication attempts or from other third party apps. Change tracking cannot be enabled on the VM if snapshots are already present for the VM. Delete the existing snapshots or enable change block tracking on the VM. + Replication cycles refer to the periodic process of transferring data from on-premises environment to Azure managed disks. A full replication cycle consists of the following steps: 1. Create VMware snapshot for each disk associated with the VM Source disk is divided into sectors of 512 bytes. Every sector in the source dis ## Security -The Azure Migrate appliance compresses data and encrypts before uploading. Data is transmitted over a secure communication channel over https and uses TLS 1.2 or later. Additionally, Azure Storage automatically encrypts your data when it is persisted it to the cloud (encryption-at-rest). +The Azure Migrate appliance compresses data and encrypts before uploading. Data is transmitted over a secure communication channel over https and uses TLS 1.2 or later. Additionally, Azure Storage automatically encrypts your data when it's persisted it to the cloud (encryption-at-rest). ## Replication status When a VM undergoes replication (data copy), there are a few possible states:-- **Initial replication queued**: The VM is queued for replication (or migration) as there may be other VMs that are consuming the on-premises resources (during replication or migration). Once the resources are free, this VM will be processed.+- **Initial replication queued**: The VM is queued for replication (or migration) as there might be other VMs that are consuming the on-premises resources (during replication or migration). Once the resources are free, this VM will be processed. - **Initial replication in progress**: The VM is being scheduled for initial replication. - **Initial replication**: The VM is undergoing initial replication. When the VM is undergoing initial replication, you can't proceed with test migration and migration. You can only stop replication at this stage. - **Initial replication (x%)**: The initial replication is active and has progressed by x%. -- **Delta sync**: The VM may be undergoing a delta replication cycle that replicates the remaining data churn since the last replication cycle.+- **Delta sync**: The VM might be undergoing a delta replication cycle that replicates the remaining data churn since the last replication cycle. - **Pause in progress**: The VM is undergoing an active delta replication cycle and will be paused in some time. - **Paused**: The replication cycles have been paused. The replication cycles can be resumed by performing a resume replication operation. - **Resume queued**: The VM is queued for resuming replication as there are other VMs that are currently consuming the on-premises resources. |
migrate | Troubleshoot Changed Block Tracking Replication | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/migrate/troubleshoot-changed-block-tracking-replication.md | This article describes some common issues and specific errors that you might enc When you replicate a VMware virtual machine using the agentless replication method, data from the virtual machine's disks (vmdks) are replicated to replica managed disks in your Azure subscription. When replication starts for a VM, an initial replication cycle occurs, in which full copies of the disks are replicated. After the initial replication completes, incremental replication cycles are scheduled periodically to transfer any changes that have occurred since the previous replication cycle. -You may occasionally see replication cycles failing for a VM. These failures can happen due to reasons ranging from issues in on-premises network configuration to issues at the Azure Migrate Cloud Service backend. In this article, we will: +You might occasionally see replication cycles failing for a VM. These failures can happen due to reasons ranging from issues in on-premises network configuration to issues at the Azure Migrate Cloud Service backend. In this article, we will: - Show you how you can monitor replication status and resolve errors. - List some of the commonly occurring replication errors and suggest steps to remediate them. You may occasionally see replication cycles failing for a VM. These failures can Use the following steps to monitor the replication status for your virtual machines: - 1. Go to the **Servers, databases and web apps** page in Azure Migrate on the Azure portal. - ![Screenshot of the Get started screen of Azure Migrate.](./media/troubleshoot-changed-block-tracking-replication/Overview.png) - 1. In the **Migration and modernization** tile, under **Replications**, select the number next to **Azure VM** . - ![Screenshot of the Migration and modernization screen.](./media/troubleshoot-changed-block-tracking-replication/replicating-servers.png) - 1. You'll see a list of replicating servers along with additional information such as status, health, last sync time, etc. The **Replication health** column indicates the current replication health of the VM. A *Critical* or *Warning* value typically indicates that the previous replication cycle for the VM failed. To get more details, right-click on the VM, and select **Health error Details**. The **Error Details** page contains information on the error and additional details on how to troubleshoot. - ![Screenshot of Health error details option in the Replication machines screen.](./media/troubleshoot-changed-block-tracking-replication/health-error-details.png) - 1. Select **Recent Events** to see the previous replication cycle failures for the VM. In the events page, look for the most recent event of type *Replication cycle failed* or *Replication cycle failed* for disk" for the VM. - ![Image 4](./media/troubleshoot-changed-block-tracking-replication/image3.png) - 1. Select the event to understand the possible causes of the error and recommended remediation steps. Use the information provided to troubleshoot and remediate the error. - ![Screenshot of error message in the Error details screen.](./media/troubleshoot-changed-block-tracking-replication/error-details.png) +1. Go to the **Servers, databases and web apps** page in Azure Migrate on the Azure portal. + + :::image type="content" source="./media/troubleshoot-changed-block-tracking-replication/Overview.png" alt-text="Screenshot of the Get started screen of Azure Migrate."::: ++1. In the **Migration and modernization** tile, under **Replications**, select the number next to **Azure VM**. ++ :::image type="content" source="./media/troubleshoot-changed-block-tracking-replication/replicating-servers.png" alt-text="Screenshot of the Migration and Modernization screen."::: ++1. You'll see a list of replicating servers along with additional information such as status, health, last sync time, etc. The **Replication health** column indicates the current replication health of the VM. A *Critical* or *Warning* value typically indicates that the previous replication cycle for the VM failed. To get more details, right-click on the VM, and select **Health error Details**. The **Error Details** page contains information on the error and additional details on how to troubleshoot. ++ :::image type="content" source="./media/troubleshoot-changed-block-tracking-replication/health-error-details.png" alt-text="Screenshot of Health error details option in the Replication machines screen."::: ++1. Select **Recent Events** to see the previous replication cycle failures for the VM. In the events page, look for the most recent event of type *Replication cycle failed* or *Replication cycle failed* for disk" for the VM. ++ :::image type="content" source="./media/troubleshoot-changed-block-tracking-replication/recent-events.png" alt-text="Screenshot of Recent Events option."::: ++1. Select the event to understand the possible causes of the error and recommended remediation steps. Use the information provided to troubleshoot and remediate the error. ++ :::image type="content" source="./media/troubleshoot-changed-block-tracking-replication/error-details.png" alt-text="Screenshot of error message in the Error details screen."::: ## Common Replication Errors This section describes some of the common errors, and how you can troubleshoot t **Error:** ΓÇ£Key Vault operation failed. Operation: Generate shared access signature definition, Key Vault: Key-vault-name, Storage Account: storage account name failed with the error:ΓÇ¥ -![Key Vault](./media/troubleshoot-changed-block-tracking-replication/key-vault.png) This error typically occurs because the User Access Policy for the Key Vault doesn't give the currently logged in user the necessary permissions to configure storage accounts to be Key Vault managed. To check for user access policy on the key vault, go to the Key vault page on the portal for the Key vault and select Access policies. When the portal creates the key vault, it also adds a user access policy grantin - The logged in user is a remote principal on the customer's Azure tenant (CSP subscription - and the logged in user is the partner admin). The work-around in this case is to delete the key vault, sign out from the portal, and then sign in with a user account from the customer's tenant (not a remote principal) and retry the operation. The CSP partner will typically have a user account in the customers Microsoft Entra tenant that they can use. If not, they can create a new user account for themselves in the customers Microsoft Entra tenant, sign in to the portal as the new user, and then retry the replicate operation. The account used must have either Owner or Contributor+User Access Administrator permissions granted to the account on the resource group (Migrate project resource group). -- The other case where this may happen is when one user (user1) attempted to set up replication initially and encountered a failure, but the key vault has already been created (and user access policy appropriately assigned to this user). Now at a later point a different user (user2) tries to set up replication, but the Configure Managed Storage Account or Generate SAS definition operation fails as there's no user access policy corresponding to user2 in the key vault.+- The other case where this might happen is when one user (user1) attempted to set up replication initially and encountered a failure, but the key vault has already been created (and user access policy appropriately assigned to this user). Now at a later point a different user (user2) tries to set up replication, but the Configure Managed Storage Account or Generate SAS definition operation fails as there's no user access policy corresponding to user2 in the key vault. **Resolution**: To work around this issue, create a user access policy for user2 in the key vault granting user2 permission to configure managed storage account and generate SAS definitions. User2 can do this from Azure PowerShell using the below cmdlets: The possible causes include: **Steps to run the performance benchmark test:** - 1. [Download](../storage/common/storage-use-azcopy-v10.md) azcopy + 1. [Download](../storage/common/storage-use-azcopy-v10.md) azcopy. 2. Look for the Appliance Storage Account in the Resource Group. The Storage Account has a name that resembles migratelsa\*\*\*\*\*\*\*\*\*\*. This is the value of parameter [account] in the above command. This error can be resolved in the following two ways: - If you had opted for **Automatically repair replication** by selecting "Yes" when you triggered replication of VM, the tool will try to repair it for you. Right-click on the VM, and select **Repair Replication**. - If you didn't opt for **Automatically repair replication** or the above step didn't work for you, then stop replication for the virtual machine, [reset changed block tracking](https://go.microsoft.com/fwlink/?linkid=2139203) on the virtual machine, and then reconfigure replication. -One such known issue that may cause a CBT reset of virtual machine on VMware vSphere 5.5 is described in [VMware KB 1020128: Changed Block Tracking](https://kb.vmware.com/s/article/1020128) is reset after a storage vMotion operation in vSphere 5.x. If you are on VMware vSphere 5.5, ensure that you apply the updates described in this KB. +One such known issue that might cause a CBT reset of virtual machine on VMware vSphere 5.5 is described in [VMware KB 1020128: Changed Block Tracking](https://kb.vmware.com/s/article/1020128) is reset after a storage vMotion operation in vSphere 5.x. If you are on VMware vSphere 5.5, ensure that you apply the updates described in this KB. Alternatively, you can reset VMware changed block tracking on a virtual machine using VMware PowerCLI. ## An internal error occurred -Sometimes you may hit an error that occurs due to issues in the VMware environment/API. We've identified the following set of errors as VMware environment-related errors. These errors have a fixed format. +Sometimes you might hit an error that occurs due to issues in the VMware environment/API. We've identified the following set of errors as VMware environment-related errors. These errors have a fixed format. _Error Message: An internal error occurred. [Error message]_ The issue is a known VMware issue and occurs in VDDK 6.7. You need to stop the g ### Error Message: An internal error occurred. ['An Invalid snapshot configuration was detected.'] -If you have a virtual machine with multiple disks, you may encounter this error if you remove a disk from the virtual machine. To remediate this problem, refer to the steps in [this VMware article](https://go.microsoft.com/fwlink/?linkid=2138890). +If you have a virtual machine with multiple disks, you might encounter this error if you remove a disk from the virtual machine. To remediate this problem, refer to the steps in [this VMware article](https://go.microsoft.com/fwlink/?linkid=2138890). ### Error Message: An internal error occurred. [Generate Snapshot Hung] This error occurs when there's a problem with the underlying datastore on which ### Error message: An error occurred while taking a snapshot: Unable to open the snapshot file. This error occurs when the size of the snapshot file created is larger than the available free space in the datastore where the VM is located. Follow the resolution given in this [document](https://go.microsoft.com/fwlink/?linkid=2166464). +## Protection Readiness Error ++**Error Message:** Cannot replicate this virtual machine with current VMware configuration. ++**Possible Causes:** ++- Change tracking cannot be enabled for the VM as snapshots are already present for the VM. + +**Recommendation:** ++- Delete the snapshots or enable change block tracking on the VM and retry. + ## Replication cycle failed **Error ID:** 181008 This error occurs when the size of the snapshot file created is larger than the **Possible Causes:** -This may happen if: +This might happen if: 1. The Azure Migrate appliance is unable to resolve the hostname of the vSphere host. 2. The Azure Migrate appliance is unable to connect to the vSphere host on port 902 (default port used by VMware vSphere Virtual Disk Development Kit), because TCP port 902 is being blocked on the vSphere host or by a network firewall. |
mysql | Concepts Monitoring | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/mysql/flexible-server/concepts-monitoring.md | These metrics are available for Azure Database for MySQL: |Slow_queries|slow_queries|Count|The total count of slow queries on your server in the selected time range.| +## Storage Breakdown Metrics ++|Metric display name|Metric|Unit|Description| +||||| +|Storage Limit|storage_limit|Bytes|The maximum storage size configured for this server.| +|Storage Percent|storage_percent|Percent|The percentage of storage used out of the server's maximum storage available.| +|Storage Used|storage_used|Bytes|The amount of storage in use. The storage used by the service may include the database files, transaction logs, and the server logs.| +|Data Storage Used|data_storage_used|Bytes|The amount of storage used for storing database files.| +|ibdata1 Storage Used|ibdata1_storage_used|Bytes|The amount of storage used for storing system tablespace (ibdata1) file.| +|Binlog Storage Used|binlog_storage_used|Bytes|The amount of storage used for storing binary log files.| +|Other Storage Used|other_storage_used|Bytes| The amount of storage used for other components and metadata files.| +|Backup Storage Used|backup_storage_used|Bytes|The amount of backup storage used.| + ## Replication metrics |Metric display name|Metric|Unit|Description| These metrics are available for Azure Database for MySQL: |Innodb_buffer_pool_pages_dirty|Innodb_buffer_pool_pages_dirty|Count|The total count of pages in the InnoDB buffer pool containing dirty pages.| -## Storage Breakdown Metrics --|Metric display name|Metric|Unit|Description| -||||| -|Storage Limit|storage_limit|Bytes|The maximum storage size configured for this server.| -|Storage Percent|storage_percent|Percent|The percentage of storage used out of the server's maximum storage available.| -|Storage Used|storage_used|Bytes|The amount of storage in use. The storage used by the service may include the database files, transaction logs, and the server logs.| -|Data Storage Used|data_storage_used|Bytes|The amount of storage used for storing database files.| -|ibdata1 Storage Used|ibdata1_storage_used|Bytes|The amount of storage used for storing system tablespace (ibdata1) file.| -|Binlog Storage Used|binlog_storage_used|Bytes|The amount of storage used for storing binary log files.| -|Other Storage Used|other_storage_used|Bytes| The amount of storage used for other components and metadata files.| -|Backup Storage Used|backup_storage_used|Bytes|The amount of backup storage used.| -- ## Server logs |
mysql | Concepts Service Tiers Storage | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/mysql/flexible-server/concepts-service-tiers-storage.md | The detailed specifications of the available server types are as follows: |Standard_E8ads_v5 | 8 | 64 | 18000 | 10923 | 300 | |Standard_E16ds_v4 | 16 | 128 | 28000 | 21845 | 600 | |Standard_E16ads_v5 | 16 | 128 | 28000 | 21845 | 600 |+|Standard_E20ds_v4 | 20 | 160 | 28000 | 27306 | 750 | +|Standard_E20ads_v5 | 20 | 160 | 28000 | 27306 | 750 | |Standard_E32ds_v4 | 32 | 256 | 38000 | 43691 | 1200 | |Standard_E32ads_v5 | 32 | 256 | 38000 | 43691 | 1200 | |Standard_E48ds_v4 | 48 | 384 | 48000 | 65536 | 1800 | The detailed specifications of the available server types are as follows: |Standard_E4ds_v5 | 4 | 32 | 10000 | 5461 | 150 | |Standard_E8ds_v5 | 8 | 64 | 18000 | 10923 | 300 | |Standard_E16ds_v5 | 16 | 128 | 28000 | 21845 | 600 |+|Standard_E20ds_v5 | 20 | 160 | 28000 | 27306 | 750 | |Standard_E32ds_v5 | 32 | 256 | 38000 | 43691 | 1200 | |Standard_E48ds_v5 | 48 | 384 | 48000 | 65536 | 1800 | |Standard_E64ds_v5 | 64 | 512 | 64000 | 87383 | 2400 | |
mysql | Whats New | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/mysql/flexible-server/whats-new.md | This article summarizes new releases and features in Azure Database for MySQL - ## October 2023 +- **Addition of New vCore Options in Azure Database for MySQL - Flexible Server** ++We are excited to inform you that we have introduced new 20 vCores options under the Business Critical Service tier for our Azure Database for MySQL - Flexible Server. Please find more information under [Compute Option for Azure Database for MySQL - Flexible Server](./concepts-service-tiers-storage.md#service-tiers-size-and-server-types). + - **Metrics computation for Azure Database for MySQL - Flexible Server**+ "Host Memory Percent" metric will provide more accurate calculations of memory usage. It will now reflect the actual memory consumed by the server, excluding re-usable memory from the calculation. This improvement ensures that you have a more precise understanding of your server's memory utilization. After the completion of the [scheduled maintenance window](./concepts-maintenance.md), existing servers will benefit from this enhancement. - **Known Issues**+ When attempting to modify the User assigned managed identity and Key identifier in a single request while changing the CMK settings, the operation gets struck. We are working on the upcoming deployment for the permanent solution to address this issue, in the meantime, please ensure that you perform the two operations of updating the User Assigned Managed Identity and Key identifier in separate requests. The sequence of these operations is not critical, as long as the user-assigned identities have the necessary access to both Key Vault ## September 2023 |
operator-nexus | Howto Use Mde Runtime Protection | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/operator-nexus/howto-use-mde-runtime-protection.md | + + Title: "Azure Operator Nexus: MDE Runtime Protection" +description: Learn how to use the MDE Runtime Protection. ++++ Last updated : 10/15/2023++++# Introduction to the Microsoft Defender for Endpoint runtime protection service ++The Microsoft Defender for Endpoint (MDE) runtime protection service provides the tools to configure and manage runtime protection for a Nexus cluster. ++The Azure CLI allows you to configure runtime protection ***Enforcement Level*** and the ability to trigger ***MDE Scan*** on all nodes. +This document provides the steps to execute those tasks. ++> [!NOTE] +> The MDE runtime protection service integrates with [Microsoft Defender for Endpoint](../defender-for-cloud/integration-defender-for-endpoint.md), which provides comprehensive Endpoint Detection and Response (EDR) capabilities. With Microsoft Defender for Endpoint integration, you can spot abnormalities and detect vulnerabilities. ++## Before you begin ++- Install the latest version of the [appropriate CLI extensions](./howto-install-cli-extensions.md). ++## Setting variables ++To help with configuring and triggering MDE scans, define these environment variables used by the various commands throughout this guide. ++> [!NOTE] +> These environment variable values do not reflect a real deployment and users MUST change them to match their environments. ++```bash +# SUBSCRIPTION_ID: Subscription of your cluster +export SUBSCRIPTION_ID="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" +# RESOURCE_GROUP: Resource group of your cluster +export RESOURCE_GROUP="contoso-cluster-rg" +# MANAGED_RESOURCE_GROUP: Managed resource group managed by your cluster +export MANAGED_RESOURCE_GROUP="contoso-cluster-managed-rg" +# CLUSTER_NAME: Name of your cluster +export CLUSTER_NAME="contoso-cluster" +``` ++## Configuring enforcement level +The `az networkcloud cluster update` allows you to update of the settings for Cluster runtime protection *enforcement level* by using the argument `--runtime-protection-configuration enforcement-level="<enforcement level>"`. ++The following command configures the `enforcement level` for your Cluster. ++```bash +az networkcloud cluster update \ +--subscription ${SUBSCRIPTION_ID} \ +--resource-group ${RESOURCE_GROUP} \ +--cluster-name ${CLUSTER_NAME} \ +--runtime-protection-configuration enforcement-level="<enforcement level>" +``` ++Allowed values for `<enforcement level>`: `Audit`, `Disabled`, `OnDemand`, `Passive`, `RealTime`. ++## Enabling & Disabling MDE Service on All Nodes +By default the MDE service isn't active. You need to enable it before you can trigger an MDE scan. +To enable the MDE service, execute the following command. ++```bash +az networkcloud cluster update \ +--subscription ${SUBSCRIPTION_ID} \ +--resource-group ${RESOURCE_GROUP} \ +--cluster-name ${CLUSTER_NAME} \ +--runtime-protection-configuration enforcement-level="<enforcement level>" +``` ++where `<enforcement level>` value must be a value other than `Disabled`. ++> [!NOTE] +>As you have noted, the argument `--runtime-protection-configuration enforcement-level="<enforcement level>"` serves two purposes: enabling/disabling MDE service and updating the enforcement level. ++If you want to disable the MDE service across your Cluster, use an `<enforcement level>` of `Disabled`. ++## Triggering MDE scan on all nodes +Once you have set an enforcement level for your Cluster, you can trigger an MDE scan with the following command: ++```bash +az networkcloud cluster scan-runtime \ +--subscription ${SUBSCRIPTION_ID} \ +--resource-group ${RESOURCE_GROUP} \ +--cluster-name ${CLUSTER_NAME} \ +--scan-activity Scan +``` ++## Retrieve MDE scan information from each node +This section provides the steps to retrieve MDE scan information. +First you need to retrieve the list of node names of your cluster. +The following command assigns the list of node names to an environment variable. ++```bash +nodes=$(az networkcloud baremetalmachine list \ +--subscription ${SUBSCRIPTION_ID} \ +--resource-group ${MANAGED_RESOURCE_GROUP} \ +| jq -r '.[].machineName') +``` ++With the list of node names, we can start the process to extract MDE agent information for each node of your Cluster. +The following command will prepare MDE agent information from each node. ++```bash +for node in $nodes +do + echo "Extracting MDE agent information for node ${node}" + az networkcloud baremetalmachine run-data-extract \ + --subscription ${SUBSCRIPTION_ID} \ + --resource-group ${MANAGED_RESOURCE_GROUP} \ + --name ${node} \ + --commands '[{"command":"mde-agent-information"}]' \ + --limit-time-seconds 600 +done +``` ++The result for the command will include a URL where you can download the detailed report of MDE scans. +See the following example for the result for the MDE agent information. ++```bash +Extracting MDE agent information for node rack1control01 +====Action Command Output==== +Executing mde-agent-information command +MDE agent is running, proceeding with data extract +Getting MDE agent information for rack1control01 +Writing to /hostfs/tmp/runcommand ++================================ +Script execution result can be found in storage account: + <url to download mde scan results> + ... +``` ++## Extracting MDE scan results +The extraction of MDE scan requires a few manual steps: To download the MDE scan report and extract the scan run information, and scan detailed result report. +This section will guide you on each of these steps. ++### Download the scan report +As indicated earlier the MDE agent information response provides the URL storing the detailed report data. ++Download the report from the returned URL `<url to download mde scan results>`, and open the file `mde-agent-information.json`. ++The `mde-agent-information.json` file contains lots of information about the scan and it can be overwhelming to analyze such long detailed report. +This guide provides a few examples of extracting some essential information that can help you decide if you need to analyze thoroughly the report. ++### Extracting the list of MDE scans +The `mde-agent-information.json` file contains a detailed scan report but you might want to focus first on a few details. +This section details the steps to extract the list of scans run providing the information such as start and end time for each scan, threats found, state (succeeded or failed), etc. ++The following command extracts this simplified report. ++```bash +cat <path to>/mde-agent-information.json| jq .scanList +``` ++The following example shows the extracted scan report from `mde-agent-information.json`. ++```bash +[ + { + "endTime": "1697204632487", + "filesScanned": "1750", + "startTime": "1697204573732", + "state": "succeeded", + "threats": [], + "type": "quick" + }, + { + "endTime": "1697217162904", + "filesScanned": "1750", + "startTime": "1697217113457", + "state": "succeeded", + "threats": [], + "type": "quick" + } +] +``` ++You can use the Unix `date` command to convert the time in a more readable format. +For your convenience, see an example for converting Unix timestamp (in milliseconds) to year-month-day and hour:min:secs. ++For example: ++```bash +date -d @$(echo "1697204573732/1000" | bc) "+%Y-%m-%dT%H:%M:%S" ++2023-10-13T13:42:53 +``` ++### Extracting the MDE scan results +This section details the steps to extract the report about the list of threats identified during the MDE scans. +To extract the scan result report from `mde-agent-information.json` file, execute the following command. ++```bash +cat <path to>/mde-agent-information.json| jq .threatInformation +``` ++The following example shows the report of threats identified by the scan extracted from `mde-agent-information.json` file. ++```bash +{ + "list": { + "threats": { + "scans": [ + { + "type": "quick", + "start_time": 1697204573732, + "end_time": 1697204632487, + "files_scanned": 1750, + "threats": [], + "state": "succeeded" + }, + { + "type": "quick", + "start_time": 1697217113457, + "end_time": 1697217162904, + "files_scanned": 1750, + "threats": [], + "state": "succeeded" + } + ] + } + }, + "quarantineList": { + "type": "quarantined", + "threats": [] + } +} +``` |
postgresql | Concepts Compute Storage | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/postgresql/flexible-server/concepts-compute-storage.md | You can create an Azure Database for PostgreSQL server in one of three pricing t | Resource/Tier | Burstable | General Purpose | Memory Optimized | |:|:-|:--|:|-| VM-series | B-series | Ddsv4-series, <br> Dsv3-series | Edsv4-series, <br> Esv3-series | -| vCores | 1, 2, 4, 8, 12, 16, 20 | 2, 4, 8, 16, 32, 48, 64 | 2, 4, 8, 16, 20 (v4), 32, 48, 64 | +| VM-series | B-series | Ddsv5-series, <br> Dadsv5-series, <br> Ddsv4-series, <br> Dsv3-series | Edsv5-series, <br> Eadsv5-series, <br> Edsv4-series, <br> Esv3-series | +| vCores | 1, 2, 4, 8, 12, 16, 20 | 2, 4, 8, 16, 32, 48, 64, 96 | 2, 4, 8, 16, 20 (v4/v5), 32, 48, 64, 96 | | Memory per vCore | Variable | 4 GB | 6.75 GB to 8 GB | | Storage size | 32 GB to 32 TB | 32 GB to 32 TB | 32 GB to 32 TB | | Database backup retention period | 7 to 35 days | 7 to 35 days | 7 to 35 days | The detailed specifications of the available server types are as follows: | B12ms | 12 | 24 GiB | 3,800 | 50 MiB/sec | | B16ms | 16 | 32 GiB | 4,300 | 50 MiB/sec | | B20ms | 20 | 40 GiB | 5,000 | 50 MiB/sec |-| **General Purpose** | | | | | -| D2s_v3 / D2ds_v4 / D2ds_v5 | 2 | 8 GiB | 3,200 | 48 MiB/sec | -| D4s_v3 / D4ds_v4 / D4ds_v5 | 4 | 16 GiB | 6,400 | 96 MiB/sec | -| D8s_v3 / D8ds_v4 / D8ds_v5 | 8 | 32 GiB | 12,800 | 192 MiB/sec | -| D16s_v3 / D16ds_v4 / D16ds_v5 | 16 | 64 GiB | 20,000 | 384 MiB/sec | -| D32s_v3 / D32ds_v4 / D32ds_v5 | 32 | 128 GiB | 20,000 | 768 MiB/sec | -| D48s_v3 / D48ds_v4 / D48ds_v5 | 48 | 192 GiB | 20,000 | 900 MiB/sec | -| D64s_v3 / D64ds_v4 / D64ds_v5 | 64 | 256 GiB | 20,000 | 900 MiB/sec | -| D96ds_v5 | 96 | 384 GiB | 20,000 | 900 MiB/sec | -| **Memory Optimized** | | | | | -| E2s_v3 / E2ds_v4 / E2ds_v5 | 2 | 16 GiB | 3,200 | 48 MiB/sec | -| E4s_v3 / E4ds_v4 / E4ds_v5 | 4 | 32 GiB | 6,400 | 96 MiB/sec | -| E8s_v3 / E8ds_v4 / E8ds_v5 | 8 | 64 GiB | 12,800 | 192 MiB/sec | -| E16s_v3 / E16ds_v4 / E16ds_v5 | 16 | 128 GiB | 20,000 | 384 MiB/sec | -| E20ds_v4 / E20ds_v5 | 20 | 160 GiB | 20,000 | 480 MiB/sec | -| E32s_v3 / E32ds_v4 / E32ds_v5 | 32 | 256 GiB | 20,000 | 768 MiB/sec | -| E48s_v3 / E48ds_v4 / E48ds_v5 | 48 | 384 GiB | 20,000 | 900 MiB/sec | -| E64s_v3 / E64ds_v4 | 64 | 432 GiB | 20,000 | 900 MiB/sec | -| E64ds_v5 | 64 | 512 GiB | 20,000 | 900 MiB/sec | -| E96ds_v5 | 96 | 672 GiB | 20,000 | 900 MiB/sec | +| **General Purpose** | | | | | +| D2s_v3 / D2ds_v4 / D2ds_v5 / D2ads_v5 | 2 | 8 GiB | 3,200 | 48 MiB/sec | +| D4s_v3 / D4ds_v4 / D4ds_v5 / D4ads_v5 | 4 | 16 GiB | 6,400 | 96 MiB/sec | +| D8s_v3 / D8ds_v4 / D8ds_v5 / D8ads_v5 | 8 | 32 GiB | 12,800 | 192 MiB/sec | +| D16s_v3 / D16ds_v4 / D16ds_v5 / D16ds_v5 | 16 | 64 GiB | 20,000 | 384 MiB/sec | +| D32s_v3 / D32ds_v4 / D32ds_v5 / D32ads_v5 | 32 | 128 GiB | 20,000 | 768 MiB/sec | +| D48s_v3 / D48ds_v4 / D48ds_v5 / D48ads_v5 | 48 | 192 GiB | 20,000 | 900 MiB/sec | +| D64s_v3 / D64ds_v4 / D64ds_v5/ D64ads_v5 | 64 | 256 GiB | 20,000 | 900 MiB/sec | +| D96ds_v5 / D96ads_v5 | 96 | 384 GiB | 20,000 | 900 MiB/sec | +| **Memory Optimized** | | | | | +| E2s_v3 / E2ds_v4 / E2ds_v5 / E2ads_v5 | 2 | 16 GiB | 3,200 | 48 MiB/sec | +| E4s_v3 / E4ds_v4 / E4ds_v5 / E4ads_v5 | 4 | 32 GiB | 6,400 | 96 MiB/sec | +| E8s_v3 / E8ds_v4 / E8ds_v5 / E8ads_v5 | 8 | 64 GiB | 12,800 | 192 MiB/sec | +| E16s_v3 / E16ds_v4 / E16ds_v5 / E16ads_v5 | 16 | 128 GiB | 20,000 | 384 MiB/sec | +| E20ds_v4 / E20ds_v5 / E20ads_v5 | 20 | 160 GiB | 20,000 | 480 MiB/sec | +| E32s_v3 / E32ds_v4 / E32ds_v5 / E32ads_v5 | 32 | 256 GiB | 20,000 | 768 MiB/sec | +| E48s_v3 / E48ds_v4 / E48ds_v5 / E48ads_v5 | 48 | 384 GiB | 20,000 | 900 MiB/sec | +| E64s_v3 / E64ds_v4 | 64 | 432 GiB | 20,000 | 900 MiB/sec | +| E64ds_v5 / E64ads_v4 | 64 | 512 GiB | 20,000 | 900 MiB/sec | +| E96ds_v5 /E96ads_v5 | 96 | 672 GiB | 20,000 | 900 MiB/sec | ## Storage You can monitor your I/O consumption in the Azure portal or by using Azure CLI c ### Maximum IOPS for your configuration -|SKU name |Storage size in GiB |32 |64 |128 |256 |512 |1,024|2,048|4,096|8,192 |16,384|32,767 | -|||||-|-|--|--|--|--||| +|SKU name |Storage size in GiB |32 |64|128 |256|512|1,024|2,048|4,096|8,192|16,384|32,767| +|||||-|-|--|--|--|--|||-| | |Maximum IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | |**Burstable** | | | | | | | | | | | | |-|B1ms |640 IOPS |120|240|500 |640*|640* |640* |640* |640* |640* |640* |640* | -|B2s |1,280 IOPS |120|240|500 |1,100|1,280*|1,280*|1,280*|1,280*|1,280* |1,280* |1,280* | -|B2ms |1,280 IOPS |120|240|500 |1,100|1,700*|1,700*|1,700*|1,700*|1,700* |1,700* |1,700* | -|B4ms |1,280 IOPS |120|240|500 |1,100|2,300 |2,400*|2,400*|2,400*|2,400* |2,400* |2,400* | -|B8ms |1,280 IOPS |120|240|500 |1,100|2,300 |3,100*|3,100*|3,100*|3,100* |2,400* |2,400* | -|B12ms |1,280 IOPS |120|240|500 |1,100|2,300 |3,800*|3,800*|3,800*|3,800* |3,800* |3,800* | -|B16ms |1,280 IOPS |120|240|500 |1,100|2,300 |4,300*|4,300*|4,300*|4,300* |4,300* |4,300* | -|B20ms |1,280 IOPS |120|240|500 |1,100|2,300 |5,000 |5,000*|5,000*|5,000* |5,000* |5,000* | -|**General Purpose** | | | | | | | | | | | | -|D2s_v3 / D2ds_v4 |3,200 IOPS |120|240|500 |1,100|2,300 |3,200*|3,200*|3,200*|3,200* |3,200* |3,200* | -|D2ds_v5 |3,750 IOPS |120|240|500 |1,100|2,300 |3,200*|3,200*|3,200*|3,200* |3,200* |3,200* | -|D4s_v3 / D4ds_v4 / D4ds_v5 |6,400 IOPS |120|240|500 |1,100|2,300 |5,000 |6,400*|6,400*|6,400* |6,400* |6,400* | -|D8s_v3 / D8ds_v4 / D8ds_v5 |12,800 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |12,800*|12,800*|12,800*| -|D16s_v3 / D16ds_v4 / D16ds_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | -|D32s_v3 / D32ds_v4 / D32ds_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | -|D48s_v3 / D48ds_v4 / D48ds_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | -|D64s_v3 / D64ds_v4 / D64ds_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | -|D96ds_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | -|**Memory Optimized** | | | | | | | | | | | | | -|E2s_v3 / E2ds_v4 |3,200 IOPS |120|240|500 |1,100|2,300 |3,200*|3,200*|3,200*|3,200* |3,200* |3,200* | -|E2ds_v5 |3,750 IOPS |120|240|500 |1,100|2,300 |3,200*|3,200*|3,200*|3,200* |3,200* |3,200* | -|E4s_v3 / E4ds_v4 / E4ds_v5 |6,400 IOPS |120|240|500 |1,100|2,300 |5,000 |6,400*|6,400*|6,400* |6,400* |6,400* | -|E8s_v3 / E8ds_v4 / E8ds_v5 |12,800 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |12,800*|12,800*|12,800*| -|E16s_v3 / E16ds_v4 / E16ds_v5 |20,000 IOPS |120|240|500 |1100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | -|E20ds_v4/E20ds_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | -|E32s_v3 / E32ds_v4 / E32ds_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | -|E48s_v3 / E48ds_v4 / E48ds_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | -|E64s_v3 / E64ds_v4 / E64ds_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | -|E96ds_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | +|B1ms |640 IOPS |120|240|500 |640*|640* |640* |640* |640* |640* |640* |640* | +|B2s |1,280 IOPS |120|240|500 |1,100|1,280*|1,280*|1,280*|1,280*|1,280* |1,280* |1,280* | +|B2ms |1,280 IOPS |120|240|500 |1,100|1,700*|1,700*|1,700*|1,700*|1,700* |1,700* |1,700* | +|B4ms |1,280 IOPS |120|240|500 |1,100|2,300 |2,400*|2,400*|2,400*|2,400* |2,400* |2,400* | +|B8ms |1,280 IOPS |120|240|500 |1,100|2,300 |3,100*|3,100*|3,100*|3,100* |2,400* |2,400* | +|B12ms |1,280 IOPS |120|240|500 |1,100|2,300 |3,800*|3,800*|3,800*|3,800* |3,800* |3,800* | +|B16ms |1,280 IOPS |120|240|500 |1,100|2,300 |4,300*|4,300*|4,300*|4,300* |4,300* |4,300* | +|B20ms |1,280 IOPS |120|240|500 |1,100|2,300 |5,000 |5,000*|5,000*|5,000* |5,000* |5,000* | +|**General Purpose** | | | | | | | | | | | | | +|D2s_v3 / D2ds_v4 |3,200 IOPS |120|240|500 |1,100|2,300 |3,200*|3,200*|3,200*|3,200* |3,200* |3,200* | +|D2ds_v5 / D2ads_v5 |3,750 IOPS |120|240|500 |1,100|2,300 |3,200*|3,200*|3,200*|3,200* |3,200* |3,200* | +|D4s_v3 / D4ds_v4 / D4ds_v5 / D4ads_v5 |6,400 IOPS |120|240|500 |1,100|2,300 |5,000 |6,400*|6,400*|6,400* |6,400* |6,400* | +|D8s_v3 / D8ds_v4 / D8ds_v5 / D8ads_v5 |12,800 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |12,800*|12,800*|12,800*| +|D16s_v3 / D16ds_v4 / D16ds_v5 / D16ads_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | +|D32s_v3 / D32ds_v4 / D32ds_v5 / D32ads_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | +|D48s_v3 / D48ds_v4 / D48ds_v5 / D48ads_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | +|D64s_v3 / D64ds_v4 / D64ds_v5 / D64ads_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | +|D96ds_v5 / D96ads_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | +|**Memory Optimized** | | | | | | | | | | | | | +|E2s_v3 / E2ds_v4 |3,200 IOPS |120|240|500 |1,100|2,300 |3,200*|3,200*|3,200*|3,200* |3,200* |3,200* | +|E2ds_v5 /E2ads_v5 |3,750 IOPS |120|240|500 |1,100|2,300 |3,200*|3,200*|3,200*|3,200* |3,200* |3,200* | +|E4s_v3 / E4ds_v4 / E4ds_v5 / E4ads_v5 |6,400 IOPS |120|240|500 |1,100|2,300 |5,000 |6,400*|6,400*|6,400* |6,400* |6,400* | +|E8s_v3 / E8ds_v4 / E8ds_v5 / E8ads_v5 |12,800 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |12,800*|12,800*|12,800*| +|E16s_v3 / E16ds_v4 / E16ds_v5 / E16ads_v5 |20,000 IOPS |120|240|500 |1100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | +|E20ds_v4 / E20ds_v5 / E20ads_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | +|E32s_v3 / E32ds_v4 / E32ds_v5 / E32ads_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | +|E48s_v3 / E48ds_v4 / E48ds_v5 / E48ads_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | +|E64s_v3 / E64ds_v4 / E64ds_v5 / E64ads_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | +|E96ds_v5 /|E96ads_v5 |20,000 IOPS |120|240|500 |1,100|2,300 |5,000 |7,500 |7,500 |16,000 |18,000 |20,000 | IOPS marked with an asterisk (\*) are limited by the VM type that you selected. Otherwise, the selected storage size limits the IOPS. IOPS marked with an asterisk (\*) are limited by the VM type that you selected. |B16ms |50 MiB/sec |25 |50 |50* |50* |50* |50* |50* |50* |50* |50* |50* | |B20ms |50 MiB/sec |25 |50 |50* |50* |50* |50* |50* |50* |50* |50* |50* | |**General Purpose** | | | | | | | | | | | | |-|D2s_v3 / D2ds_v4 |48 MiB/sec |25 |48* |48* |48* |48* |48* |48* |48* |48* |48* |48* | -|D2ds_v5 |85 MiB/sec |25 |50 |85* |85* |85* |85* |85* |85* |85* |85* |85* | -|D4s_v3 / D4ds_v4 |96 MiB/sec |25 |50 |96* |96* |96* |96* |96* |96* |96* |96* |96* | -|D4ds_v5 |145 MiB/sec |25 |50* |100* |125* 145* |145* |145* |145* |145* |145* |145* | -|D8s_v3 / D8ds_v4 |192 MiB/sec |25 |50 |100 |125 |150 |192* |192* |192* |192* |192* |192* | -|D8ds_v5 |290 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |290* |290* |290* | -|D16s_v3 / D16ds_v4 |384 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |384* |384* |384* | -|D16ds_v5 |600 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |600* |600* | -|D32s_v3 / D32ds_v4 |768 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |900 | -|D32ds_v5 |865 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |865* | -|D48s_v3 / D48ds_v4 /D48ds_v5 |900 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |900 | -|D64s_v3 / Dd64ds_v4 /D64ds_v5 |900 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |900 | -|Dd96ds_v5 |900 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |900 | +|D2s_v3 / D2ds_v4 |48 MiB/sec |25 |48* |48* |48* |48* |48* |48* |48* |48* |48* |48* | +|D2ds_v5 /D2ads_v5 |85 MiB/sec |25 |50 |85* |85* |85* |85* |85* |85* |85* |85* |85* | +|D4s_v3 / D4ds_v4 |96 MiB/sec |25 |50 |96* |96* |96* |96* |96* |96* |96* |96* |96* | +|D4ds_v5 / D4ads_v5 |145 MiB/sec |25 |50* |100* |125* 145* |145* |145* |145* |145* |145* |145* | +|D8s_v3 / D8ds_v4 |192 MiB/sec |25 |50 |100 |125 |150 |192* |192* |192* |192* |192* |192* | +|D8ds_v5 / D8ads_v5 |290 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |290* |290* |290* | +|D16s_v3 / D16ds_v4 |384 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |384* |384* |384* | +|D16ds_v5 / D16ads_v5 |600 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |600* |600* | +|D32s_v3 / D32ds_v4 |768 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |900 | +|D32ds_v5 / D32ads_v5 |865 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |865* | +|D48s_v3 / D48ds_v4 / D48ds_v5 / D48ads_v5 |900 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |900 | +|D64s_v3 / Dd64ds_v4 / D64ds_v5 / D64ads_v5 |900 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |900 | +|Dd96ds_v5 / Dd96ads_v5 |900 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |900 | |**Memory Optimized** | | | | | | | | | | | | |-|E2s_v3 / E2ds_v4 |48 MiB/sec |25 |48* |48* |48* |48* |48* |48* |48* |48* |48* |48* | -|E2ds_v5 |85 MiB/sec |25 |50 |85* |85* |85* |85* |85* |85* |85* |85* |85* | -|E4s_v3 / E4ds_v4 |96 MiB/sec |25 |50 |96* |96* |96* |96* |96* |96* |96* |96* |96* | -|E4ds_v5 |145 MiB/sec |25 |50* |100* |125* 145* |145* |145* |145* |145* |145* |145* | -|E8s_v3 / E8ds_v4 |192 MiB/sec |25 |50 |100 |125 |150 |192* |192* |192* |192* |192* |192* | -|E8ds_v5 |290 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |290* |290* |290* | -|E16s_v3 / E16ds_v4 |384 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |384* |384* |384* | -|E16ds_v5 |600 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |600* |600* | -|E20ds_v4 |480 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |480* |480* |480* | -|E20ds_v5 |750 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |750* | -|E32s_v3 / E32ds_v4 |750 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |750 | -|E32ds_v5 |865 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |865* | -|E48s_v3 / E48ds_v4 /E48ds_v5 |900 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |900 | -|E64s_v3 / E64ds_v4 /E64ds_v5 |900 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |900 | -|Ed96ds_v5 |900 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |900 | +|E2s_v3 / E2ds_v4 |48 MiB/sec |25 |48* |48* |48* |48* |48* |48* |48* |48* |48* |48* | +|E2ds_v5 /E2ads_v5 |85 MiB/sec |25 |50 |85* |85* |85* |85* |85* |85* |85* |85* |85* | +|E4s_v3 / E4ds_v4 |96 MiB/sec |25 |50 |96* |96* |96* |96* |96* |96* |96* |96* |96* | +|E4ds_v5 / E4ads_v5 |145 MiB/sec |25 |50* |100* |125* 145* |145* |145* |145* |145* |145* |145* | +|E8s_v3 / E8ds_v4 |192 MiB/sec |25 |50 |100 |125 |150 |192* |192* |192* |192* |192* |192* | +|E8ds_v5 /E8ads_v5 |290 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |290* |290* |290* | +|E16s_v3 / E16ds_v4 |384 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |384* |384* |384* | +|E16ds_v5 / E16ads_v5 |600 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |600* |600* | +|E20ds_v4 |480 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |480* |480* |480* | +|E20ds_v5 / E20ads_v5 |750 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |750* | +|E32s_v3 / E32ds_v4 |750 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |750 | +|E32ds_v5 / E32ads_v5 |865 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |865* | +|E48s_v3 / E48ds_v4 /E48ds_v5 / E48ads_v5 |900 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |900 | +|E64s_v3 / E64ds_v4 / E64ds_v5 / E64ads_v5 |900 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |900 | +|Ed96ds_v5 / Ed96ads_v5 |900 MiB/sec |25 |50 |100 |125 |150 |200 |250 |250 |500 |750 |900 | I/O bandwidth marked with an asterisk (\*) is limited by the VM type that you selected. Otherwise, the selected storage size limits the I/O bandwidth. |
postgresql | Release Notes | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/postgresql/flexible-server/release-notes.md | Last updated 9/20/2023 This page provides latest news and updates regarding feature additions, engine versions support, extensions, and any other announcements relevant to Flexible Server - PostgreSQL -## Release: September 2023 -+## Release: October 2023 +* Support for [minor versions](./concepts-supported-versions.md) 15.4, 14.9, 13.12, 12.16, 11.21 <sup>$</sup> +* General availability of [Grafana Monitoring Dashboard](https://grafana.com/grafana/dashboards/19556-azure-azure-postgresql-flexible-server-monitoring/) for Azure Database for PostgreSQL ΓÇô Flexible Server. +* Public preview of Server Logs Download for Azure Database for PostgreSQL ΓÇô Flexible Server. ## Release: August 2023 * Support for [minor versions](./concepts-supported-versions.md) 15.3, 14.8, 13.11, 12.15, 11.20 <sup>$</sup> |
postgresql | Concepts Version Policy | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/postgresql/single-server/concepts-version-policy.md | Title: Versioning policy - Azure Database for PostgreSQL - Single Server and Flexible Server (Preview) -description: Describes the policy around Postgres major and minor versions in Azure Database for PostgreSQL - Single Server. + Title: Versioning policy - Azure Database for PostgreSQL - Single Server and Flexible Server +description: Describes the policy around Postgres major and minor versions in Azure Database for PostgreSQL - Single Server and Flexible Server. |
postgresql | Whats Happening To Postgresql Single Server | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/postgresql/single-server/whats-happening-to-postgresql-single-server.md | Learn how to migrate from Azure Database for PostgreSQL - Single Server to Azure **Q. Can I still create a new version 11 Azure Database for PostgreSQL - Single Server after the community EOL date in November 2023?** -**A.** Beginning November 9 2023, you'll no longer be able to create new single server instances for PostgreSQL version 11 through the Azure portal. However, you can still [make them via CLI until November 2024](https://azure.microsoft.com/updates/singlepg11-retirement/). We will continue to support single servers through our [versioning support policy.](/azure/postgresql/single-server/concepts-version-policy) It would be best to start migrating to Azure Database for PostgreSQL - Flexible Server immediately. +**A.** Beginning November 30 2023, you'll no longer be able to create new single server instances for PostgreSQL version 11 through the Azure portal. However, you can still [make them via CLI until November 2024](https://azure.microsoft.com/updates/singlepg11-retirement/). We will continue to support single servers through our [versioning support policy.](/azure/postgresql/single-server/concepts-version-policy) It would be best to start migrating to Azure Database for PostgreSQL - Flexible Server immediately. **Q. Can I continue running my Azure Database for PostgreSQL - Single Server beyond the sunset date of March 28, 2025?** |
private-5g-core | Configure Sim Policy Azure Portal | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/private-5g-core/configure-sim-policy-azure-portal.md | -*SIM policies* allow you to define different sets of policies and interoperability settings that can each be assigned to a group of SIMs. The SIM policy also defines the default Quality of Service settings for any services that use the policy. You'll need to assign a SIM policy to a SIM before the user equipment (UE) using that SIM can access the private mobile network. In this how-to-guide, you'll learn how to configure a SIM policy. +*SIM policies* allow you to define different sets of policies and interoperability settings that can each be assigned to a group of SIMs. The SIM policy also defines the default Quality of Service settings for any services that policy uses. You'll need to assign a SIM policy to a SIM before the user equipment (UE) using that SIM can access the private mobile network. In this how-to-guide, you'll learn how to configure a SIM policy. ## Prerequisites -SIM policies also define the default QoS settings for any services that use the policy. You can override the default SIM policy QoS settings on a per-service basis - see [Configure basic settings for the service](configure-service-azure-portal.md#configure-basic-settings-for-the-service). +SIM policies define the default QoS settings for services that use the policy. ++ > [!IMPORTANT] + > With the exception of **Maximum Bit Rate**, QoS settings are overridden by [service](configure-service-azure-portal.md#configure-basic-settings-for-the-service) or [data flow policy](configure-service-azure-portal.md#configure-data-flow-policy-rules-and-data-flow-templates) QoS settings. In addition, data flow policy rules override service policy rules. + 1. Select **Add**. :::image type="content" source="media/configure-sim-policy-azure-portal/add-a-network-scope.png" alt-text="Screenshot of the Azure portal. It shows the Add a network scope screen. The Add button is highlighted."::: |
sap | Tutorial | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/sap/automation/tutorial.md | For this example configuration, the resource group is `MGMT-NOEU-DEP00-INFRASTRU ```yaml bom_base_name: S4HANA_2021_FP01_v0001ms- kv_name: <Deployer KeyVault Name> + deployer_kv_name: <Deployer KeyVault Name> ``` For this example configuration, the resource group is `MGMT-NOEU-DEP00-INFRASTRU ```yaml bom_base_name: S4HANA_2021_FP01_v0001ms- kv_name: <Deployer KeyVault Name> + deployer_kv_name: <Deployer KeyVault Name> BOM_directory: ${HOME}/Azure_SAP_Automated_Deployment/samples/SAP ``` For this example configuration, the resource group is `MGMT-NOEU-DEP00-INFRASTRU --subscription "${ARM_SUBSCRIPTION_ID}" \ --spn_id "${ARM_CLIENT_ID}" \ --spn_secret "${ARM_CLIENT_SECRET}" \- --tenant_id "${ARM_TENANT_ID}" + --tenant_id "${ARM_TENANT_ID}" ``` ## Prepare the workload zone deployment |
sap | High Availability Guide Rhel | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/sap/workloads/high-availability-guide-rhel.md | Title: Azure VMs high availability for SAP NW on RHEL | Microsoft Docs -description: Azure Virtual Machines high availability for SAP NetWeaver on Red Hat Enterprise Linux + Title: Azure Virtual Machines HA for SAP NW on RHEL | Microsoft Docs +description: This article describes Azure Virtual Machines high availability for SAP NetWeaver on Red Hat Enterprise Linux (RHEL). tags: azure-resource-manager-This article describes how to deploy the virtual machines, configure the virtual machines, install the cluster framework, and install a highly available SAP NetWeaver 7.50 system. -In the example configurations, installation commands etc. ASCS instance number 00, ERS instance number 02, and SAP System ID NW1 is used. The names of the resources (for example virtual machines, virtual networks) in the example assume that you have used the [ASCS/SCS template][template-multisid-xscs] with Resource Prefix NW1 to create the resources. +This article describes how to deploy virtual machines (VMs), configure the VMs, install the cluster framework, and install a highly available SAP NetWeaver 7.50 system. -Read the following SAP Notes and papers first +In the example configurations and installation commands, ASCS instance number 00, ERS instance number 02, and SAP System ID NW1 are used. The names of the resources (for example, VMs and virtual networks) in the example assume that you used the [ASCS/SCS template][template-multisid-xscs] with Resource Prefix NW1 to create the resources. ++## Prerequisites ++Read the following SAP Notes and papers first: * SAP Note [1928533], which has:- * List of Azure VM sizes that are supported for the deployment of SAP software - * Important capacity information for Azure VM sizes - * Supported SAP software, and operating system (OS) and database combinations - * Required SAP kernel version for Windows and Linux on Microsoft Azure + * A list of Azure VM sizes that are supported for the deployment of SAP software. + * Important capacity information for Azure VM sizes. + * Supported SAP software and operating system (OS) and database combinations. + * Required SAP kernel version for Windows and Linux on Microsoft Azure. * SAP Note [2015553] lists prerequisites for SAP-supported SAP software deployments in Azure.-* SAP Note [2002167] has recommended OS settings for Red Hat Enterprise Linux -* SAP Note [2009879] has SAP HANA Guidelines for Red Hat Enterprise Linux +* SAP Note [2002167] has recommended OS settings for Red Hat Enterprise Linux (RHEL). +* SAP Note [2009879] has SAP HANA Guidelines for Red Hat Enterprise Linux. * SAP Note [2178632] has detailed information about all monitoring metrics reported for SAP in Azure. * SAP Note [2191498] has the required SAP Host Agent version for Linux in Azure. * SAP Note [2243692] has information about SAP licensing on Linux in Azure.-* SAP Note [1999351] has additional troubleshooting information for the Azure Enhanced Monitoring Extension for SAP. +* SAP Note [1999351] has more troubleshooting information for the Azure Enhanced Monitoring Extension for SAP. * [SAP Community WIKI](https://wiki.scn.sap.com/wiki/display/HOME/SAPonLinuxNotes) has all required SAP Notes for Linux. * [Azure Virtual Machines planning and implementation for SAP on Linux][planning-guide] * [Azure Virtual Machines deployment for SAP on Linux][deployment-guide] * [Azure Virtual Machines DBMS deployment for SAP on Linux][dbms-guide] * [Product Documentation for Red Hat Gluster Storage](https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5)-* [SAP Netweaver in pacemaker cluster](https://access.redhat.com/articles/3150081) -* General RHEL documentation +* [SAP NetWeaver in Pacemaker cluster](https://access.redhat.com/articles/3150081) +* General RHEL documentation: * [High Availability Add-On Overview](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_overview/index) * [High Availability Add-On Administration](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_administration/index) * [High Availability Add-On Reference](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/index)- * [Configuring ASCS/ERS for SAP Netweaver with standalone resources in RHEL 7.5](https://access.redhat.com/articles/3569681) + * [Configuring ASCS/ERS for SAP NetWeaver with Standalone Resources in RHEL 7.5](https://access.redhat.com/articles/3569681) * [Configure SAP S/4HANA ASCS/ERS with Standalone Enqueue Server 2 (ENSA2) in Pacemaker on RHEL ](https://access.redhat.com/articles/3974941)-* Azure specific RHEL documentation: +* Azure-specific RHEL documentation: * [Support Policies for RHEL High Availability Clusters - Microsoft Azure Virtual Machines as Cluster Members](https://access.redhat.com/articles/3131341) * [Installing and Configuring a Red Hat Enterprise Linux 7.4 (and later) High-Availability Cluster on Microsoft Azure](https://access.redhat.com/articles/3252491) ## Overview -To achieve high availability, SAP NetWeaver requires shared storage. GlusterFS is configured in a separate cluster and can be used by multiple SAP systems. +To achieve high availability, SAP NetWeaver requires shared storage. GlusterFS is configured in a separate cluster and multiple SAP systems can use it. -![SAP NetWeaver High Availability overview](./media/high-availability-guide-rhel/ha-rhel.png) +![Diagram that shows an overview of SAP NetWeaver high availability.](./media/high-availability-guide-rhel/ha-rhel.png) -SAP NetWeaver ASCS, SAP NetWeaver SCS, SAP NetWeaver ERS, and the SAP HANA database use virtual hostname and virtual IP addresses. On Azure, a load balancer is required to use a virtual IP address. We recommend using [Standard load balancer](../../load-balancer/quickstart-load-balancer-standard-public-portal.md). The presented configuration shows a load balancer with: +SAP NetWeaver ASCS, SAP NetWeaver SCS, SAP NetWeaver ERS, and the SAP HANA database use virtual hostname and virtual IP addresses. On Azure, a load balancer is required to use a virtual IP address. We recommend using [Standard Azure Load Balancer](../../load-balancer/quickstart-load-balancer-standard-public-portal.md). The configuration here shows a load balancer with: -* Frontend IP address 10.0.0.7 for ASCS -* Frontend IP address 10.0.0.8 for ERS +* Front-end IP address 10.0.0.7 for ASCS +* Front-end IP address 10.0.0.8 for ERS * Probe port 62000 for ASCS * Probe port 62101 for ERS -## Setting up GlusterFS +## Set up GlusterFS -SAP NetWeaver requires shared storage for the transport and profile directory. Read [GlusterFS on Azure VMs on Red Hat Enterprise Linux for SAP NetWeaver][glusterfs-ha] on how to set up GlusterFS for SAP NetWeaver. +SAP NetWeaver requires shared storage for the transport and profile directory. To see how to set up GlusterFS for SAP NetWeaver, see [GlusterFS on Azure VMs on Red Hat Enterprise Linux for SAP NetWeaver][glusterfs-ha]. -## Setting up (A)SCS +## Set up (A)SCS In this example, the resources were deployed manually via the [Azure portal](https://portal.azure.com/#home). -### Deploy Linux manually via Azure portal +### Deploy Linux manually via the Azure portal -This document assumes that you've already deployed a resource group, [Azure Virtual Network](../../virtual-network/virtual-networks-overview.md), and subnet. +This document assumes that you already deployed a resource group, an [Azure virtual network](../../virtual-network/virtual-networks-overview.md), and a subnet. -Deploy virtual machines for SAP ASCS, ERS, and application server instances. Choose a suitable RHEL image that is supported with your SAP system. You can deploy VM in any one of the availability options - scale set, availability zone or availability set. +Deploy VMs for SAP ASCS, ERS, and application server instances. Choose a suitable RHEL image that's supported with your SAP system. You can deploy VM in any one of the availability options: scale set, availability zone, or availability set. -### Deploy Azure Load Balancer via Azure portal +### Deploy Azure Load Balancer via the Azure portal -After you deploy the VMs for your SAP system, create a load balancer. Use VMs created for SAP ASCS/ERS instances in the backend pool. +After you deploy the VMs for your SAP system, create a load balancer. Use VMs created for SAP ASCS/ERS instances in the back-end pool. -1. Create load balancer (internal, standard): - 1. Create the frontend IP addresses - 1. IP address 10.0.0.7 for the ASCS - 1. Open the load balancer, select frontend IP pool, and click Add - 2. Enter the name of the new frontend IP pool (for example **nw1-ascs-frontend**) - 3. Set the Assignment to Static and enter the IP address (for example **10.0.0.7**) - 4. Click OK - 2. IP address 10.0.0.8 for the ASCS ERS - * Repeat the steps above to create an IP address for the ERS (for example **10.0.0.8** and **nw1-aers-frontend**) - 2. Create a single back-end pool: +1. Create a load balancer (internal, standard). + 1. Create the front-end IP addresses. + 1. IP address 10.0.0.7 for the ASCS: + 1. Open the load balancer, select the front-end IP pool, and select **Add**. + 1. Enter the name of the new front-end IP pool (for example, **nw1-ascs-frontend**). + 1. Set **Assignment** to **Static** and enter the IP address (for example, **10.0.0.7**). + 1. Select **OK**. + 1. IP address 10.0.0.8 for the ASCS ERS: + * Repeat the preceding steps to create an IP address for the ERS (for example, **10.0.0.8** and **nw1-aers-frontend**). + 1. Create a single back-end pool: 1. Open the load balancer, select **Backend pools**, and then select **Add**.- 2. Enter the name of the new back-end pool (for example, **nw1-backend**). - 3. Select **NIC** for Backend Pool Configuration. - 4. Select **Add a virtual machine**. - 5. Select the virtual machines of the ASCS cluster. - 6. Select **Add**. - 7. Select **Save**. - 3. Create the health probes - 1. Port 620**00** for ASCS - 1. Open the load balancer, select health probes, and click Add - 2. Enter the name of the new health probe (for example **nw1-ascs-hp**) - 3. Select TCP as protocol, port 620**00**, keep Interval 5 - 4. Click OK - 2. Port 621**02** for ASCS ERS - * Repeat the steps above to create a health probe for the ERS (for example 621**02** and **nw1-aers-hp**) - 4. Load-balancing rules - 1. Load-balancing rules for ASCS - 1. Open the load balancer, select load-balancing rules and click Add - 2. Enter the name of the new load balancer rule (for example **nw1-lb-ascs**) - 3. Select the frontend IP address, backend pool, and health probe you created earlier (for example **nw1-ascs-frontend**, **nw1-backend** and **nw1-ascs-hp**) - 4. Increase idle timeout to 30 minutes - 5. Select **HA ports** - 6. **Make sure to enable Floating IP** - 7. Click OK - * Repeat the steps above to create load balancing rules for ERS (for example **nw1-lb-ers**) + 1. Enter the name of the new back-end pool (for example, **nw1-backend**). + 1. Select **NIC** for **Backend Pool Configuration**. + 1. Select **Add a virtual machine**. + 1. Select the VMs of the ASCS cluster. + 1. Select **Add**. + 1. Select **Save**. + 1. Create the health probes. + 1. Port 620**00** for ASCS: + 1. Open the load balancer, select health probes, and select **Add**. + 1. Enter the name of the new health probe (for example, **nw1-ascs-hp**). + 1. Select **TCP** as the protocol, port 620**00**, and keep **Interval 5**. + 1. Select **OK**. + 1. Port 621**02** for ASCS ERS: + * Repeat the preceding steps to create a health probe for the ERS (for example, 621**02** and **nw1-aers-hp**). + 1. Create load-balancing rules for ASCS: + 1. Open the load balancer, select load-balancing rules, and select **Add**. + 1. Enter the name of the new load balancer rule (for example, **nw1-lb-ascs**). + 1. Select the front-end IP address, back-end pool, and health probe you created earlier (for example, **nw1-ascs-frontend**, **nw1-backend**, and **nw1-ascs-hp**). + 1. Increase the idle timeout to **30 minutes**. + 1. Select **HA ports**. + 1. Make sure to enable **Floating IP**. + 1. Select **OK**. + * Repeat the preceding steps to create load-balancing rules for ERS (for example, **nw1-lb-ers**). > [!IMPORTANT]-> Floating IP is not supported on a NIC secondary IP configuration in load-balancing scenarios. For details see [Azure Load balancer Limitations](../../load-balancer/load-balancer-multivip-overview.md#limitations). If you need additional IP address for the VM, deploy a second NIC. +> Floating IP isn't supported on a NIC secondary IP configuration in load-balancing scenarios. For more information, see [Azure Load Balancer limitations](../../load-balancer/load-balancer-multivip-overview.md#limitations). If you need another IP address for the VM, deploy a second NIC. > [!NOTE]-> When VMs without public IP addresses are placed in the backend pool of internal (no public IP address) Standard Azure load balancer, there will be no outbound internet connectivity, unless additional configuration is performed to allow routing to public end points. For details on how to achieve outbound connectivity see [Public endpoint connectivity for Virtual Machines using Azure Standard Load Balancer in SAP high-availability scenarios](./high-availability-guide-standard-load-balancer-outbound-connections.md). +> When VMs without public IP addresses are placed in the back-end pool of an internal (no public IP address) Standard Azure load balancer, there's no outbound internet connectivity unless more configuration is performed to allow routing to public endpoints. For more information on how to achieve outbound connectivity, see [Public endpoint connectivity for VMs using Azure Standard Load Balancer in SAP high-availability scenarios](./high-availability-guide-standard-load-balancer-outbound-connections.md). > [!IMPORTANT]-> Do not enable TCP timestamps on Azure VMs placed behind Azure Load Balancer. Enabling TCP timestamps will cause the health probes to fail. Set parameter **net.ipv4.tcp_timestamps** to **0**. For details see [Load Balancer health probes](../../load-balancer/load-balancer-custom-probe-overview.md). +> Don't enable TCP timestamps on Azure VMs placed behind Azure Load Balancer. Enabling TCP timestamps causes the health probes to fail. Set the parameter **net.ipv4.tcp_timestamps** to **0**. For more information, see [Load Balancer health probes](../../load-balancer/load-balancer-custom-probe-overview.md). -### Create Pacemaker cluster +### Create a Pacemaker cluster -Follow the steps in [Setting up Pacemaker on Red Hat Enterprise Linux in Azure](high-availability-guide-rhel-pacemaker.md) to create a basic Pacemaker cluster for this (A)SCS server. +Follow the steps in [Set up Pacemaker on Red Hat Enterprise Linux in Azure](high-availability-guide-rhel-pacemaker.md) to create a basic Pacemaker cluster for this (A)SCS server. ### Prepare for the SAP NetWeaver installation -The following items are prefixed with either **[A]** - applicable to all nodes, **[1]** - only applicable to node 1 or **[2]** - only applicable to node 2. +The following items are prefixed with: ++- **[A]**: Applicable to all nodes +- **[1]**: Only applicable to node 1 +- **[2]**: Only applicable to node 2 -1. **[A]** Setup host name resolution +1. **[A]** Set up hostname resolution. - You can either use a DNS server or modify the /etc/hosts on all nodes. This example shows how to use the /etc/hosts file. - Replace the IP address and the hostname in the following commands + You can either use a DNS server or modify the `/etc/hosts` file on all nodes. This example shows how to use the `/etc/hosts` file. Replace the IP address and the hostname in the following commands: ```bash sudo vi /etc/hosts ``` - Insert the following lines to /etc/hosts. Change the IP address and hostname to match your environment + Insert the following lines to the `/etc/hosts` file. Change the IP address and hostname to match your environment. ```text # IP addresses of the GlusterFS nodes The following items are prefixed with either **[A]** - applicable to all nodes, 10.0.0.8 nw1-aers ``` -1. **[A]** Create the shared directories +1. **[A]** Create the shared directories. ```bash sudo mkdir -p /sapmnt/NW1 The following items are prefixed with either **[A]** - applicable to all nodes, sudo chattr +i /usr/sap/NW1/ERS02 ``` -1. **[A]** Install GlusterFS client and other requirements +1. **[A]** Install the GlusterFS client and other required packages. ```bash sudo yum -y install glusterfs-fuse resource-agents resource-agents-sap ``` -1. **[A]** Check version of resource-agents-sap +1. **[A]** Check the version of `resource-agents-sap`. - Make sure that the version of the installed resource-agents-sap package is at least 3.9.5-124.el7 + Make sure that the version of the installed `resource-agents-sap` package is at least 3.9.5-124.el7. ```bash sudo yum info resource-agents-sap The following items are prefixed with either **[A]** - applicable to all nodes, # : environment. ``` -1. **[A]** Add mount entries +1. **[A]** Add mount entries. ```bash sudo vi /etc/fstab The following items are prefixed with either **[A]** - applicable to all nodes, glust-0:/NW1-sys /usr/sap/NW1/SYS glusterfs backup-volfile-servers=glust-1:glust-2 0 0 ``` - Mount the new shares + Mount the new shares. ```bash sudo mount -a ``` -1. **[A]** Configure SWAP file +1. **[A]** Configure the SWAP file. ```bash sudo vi /etc/waagent.conf The following items are prefixed with either **[A]** - applicable to all nodes, ResourceDisk.SwapSizeMB=2000 ``` - Restart the Agent to activate the change + Restart the agent to activate the change. ```bash sudo service waagent restart ``` -1. **[A]** RHEL configuration +1. **[A]** Configure RHEL. - Based on RHEL version, perform the configuration mentioned in SAP Note [2002167](https://launchpad.support.sap.com/#/notes/2002167), [2772999](https://launchpad.support.sap.com/#/notes/2772999), or [3108316](https://launchpad.support.sap.com/#/notes/2772999). + Based on the RHEL version, perform the configuration mentioned in SAP Note [2002167](https://launchpad.support.sap.com/#/notes/2002167), SAP Note [2772999](https://launchpad.support.sap.com/#/notes/2772999), or SAP Note [3108316](https://launchpad.support.sap.com/#/notes/2772999). -### Installing SAP NetWeaver ASCS/ERS +### Install SAP NetWeaver ASCS/ERS -1. **[1]** Configure cluster default properties +1. **[1]** Configure the cluster default properties. ```bash pcs resource defaults resource-stickiness=1 pcs resource defaults migration-threshold=3 ``` -1. **[1]** Create a virtual IP resource and health-probe for the ASCS instance +1. **[1]** Create a virtual IP resource and health probe for the ASCS instance. ```bash sudo pcs node standby nw1-cl-1 The following items are prefixed with either **[A]** - applicable to all nodes, --group g-NW1_ASCS ``` - Make sure that the cluster status is ok and that all resources are started. It is not important on which node the resources are running. + Make sure that the cluster status is okay and that all resources are started. Which node the resources are running on isn't important. ```bash sudo pcs status The following items are prefixed with either **[A]** - applicable to all nodes, # vip_NW1_ASCS (ocf::heartbeat:IPaddr2): Started nw1-cl-0 ``` -1. **[1]** Install SAP NetWeaver ASCS +1. **[1]** Install SAP NetWeaver ASCS. - Install SAP NetWeaver ASCS as root on the first node using a virtual hostname that maps to the IP address of the load balancer frontend configuration for the ASCS, for example **nw1-ascs**, **10.0.0.7** and the instance number that you used for the probe of the load balancer, for example **00**. + Install SAP NetWeaver ASCS as the root on the first node by using a virtual hostname that maps to the IP address of the load balancer front-end configuration for the ASCS, for example, **nw1-ascs** and **10.0.0.7**, and the instance number that you used for the probe of the load balancer, for example, **00**. - You can use the sapinst parameter SAPINST_REMOTE_ACCESS_USER to allow a non-root user to connect to sapinst. + You can use the `sapinst` parameter `SAPINST_REMOTE_ACCESS_USER` to allow a nonroot user to connect to `sapinst`. ```bash # Allow access to SWPM. This rule is not permanent. If you reboot the machine, you have to run the command again. The following items are prefixed with either **[A]** - applicable to all nodes, sudo chgrp sapsys /usr/sap/NW1/ASCS00 ``` -1. **[1]** Create a virtual IP resource and health-probe for the ERS instance +1. **[1]** Create a virtual IP resource and health probe for the ERS instance. ```bash sudo pcs node unstandby nw1-cl-1 The following items are prefixed with either **[A]** - applicable to all nodes, --group g-NW1_AERS ``` - Make sure that the cluster status is ok and that all resources are started. It is not important on which node the resources are running. + Make sure that the cluster status is okay and that all resources are started. Which node the resources are running on isn't important. ```bash sudo pcs status The following items are prefixed with either **[A]** - applicable to all nodes, # vip_NW1_AERS (ocf::heartbeat:IPaddr2): Started nw1-cl-1 ``` -1. **[2]** Install SAP NetWeaver ERS +1. **[2]** Install SAP NetWeaver ERS. - Install SAP NetWeaver ERS as root on the second node using a virtual hostname that maps to the IP address of the load balancer frontend configuration for the ERS, for example **nw1-aers**, **10.0.0.8** and the instance number that you used for the probe of the load balancer, for example **02**. + Install SAP NetWeaver ERS as the root on the second node by using a virtual hostname that maps to the IP address of the load balancer front-end configuration for the ERS, for example, **nw1-aers** and **10.0.0.8**, and the instance number that you used for the probe of the load balancer, for example, **02**. - You can use the sapinst parameter SAPINST_REMOTE_ACCESS_USER to allow a non-root user to connect to sapinst. + You can use the `sapinst` parameter `SAPINST_REMOTE_ACCESS_USER` to allow a nonroot user to connect to `sapinst`. ```bash # Allow access to SWPM. This rule is not permanent. If you reboot the machine, you have to run the command again. The following items are prefixed with either **[A]** - applicable to all nodes, sudo chgrp sapsys /usr/sap/NW1/ERS02 ``` -1. **[1]** Adapt the ASCS/SCS and ERS instance profiles +1. **[1]** Adapt the ASCS/SCS and ERS instance profiles. - * ASCS/SCS profile + * ASCS/SCS profile: ```bash sudo vi /sapmnt/NW1/profile/NW1_ASCS00_nw1-ascs The following items are prefixed with either **[A]** - applicable to all nodes, enque/encni/set_so_keepalive = true ``` - For both ENSA1 and ENSA2, make sure that the `keepalive` OS parameters are set as described in SAP note [1410736](https://launchpad.support.sap.com/#/notes/1410736). + For both ENSA1 and ENSA2, make sure that the `keepalive` OS parameters are set as described in SAP Note [1410736](https://launchpad.support.sap.com/#/notes/1410736). - * ERS profile + * ERS profile: ```bash sudo vi /sapmnt/NW1/profile/NW1_ERS02_nw1-aers The following items are prefixed with either **[A]** - applicable to all nodes, # Autostart = 1 ``` -1. **[A]** Configure Keep Alive +1. **[A]** Configure Keep Alive. - The communication between the SAP NetWeaver application server and the ASCS/SCS is routed through a software load balancer. The load balancer disconnects inactive connections after a configurable timeout. To prevent this, you need to set a parameter in the SAP NetWeaver ASCS/SCS profile, if using ENSA1, and change the Linux system `keepalive` settings on all SAP servers for both ENSA1/ENSA2. Read [SAP Note 1410736][1410736] for more information. + The communication between the SAP NetWeaver application server and the ASCS/SCS is routed through a software load balancer. The load balancer disconnects inactive connections after a configurable timeout. To prevent this action, set a parameter in the SAP NetWeaver ASCS/SCS profile, if you're using ENSA1. Change the Linux system `keepalive` settings on all SAP servers for both ENSA1 and ENSA2. For more information, see SAP Note [1410736][1410736]. ```bash # Change the Linux system configuration sudo sysctl net.ipv4.tcp_keepalive_time=300 ``` -1. **[A]** Update the /usr/sap/sapservices file +1. **[A]** Update the `/usr/sap/sapservices` file. - To prevent the start of the instances by the sapinit startup script, all instances managed by Pacemaker must be commented out from /usr/sap/sapservices file. + To prevent the start of the instances by the `sapinit` startup script, all instances managed by Pacemaker must be commented out from the `/usr/sap/sapservices` file. ```bash sudo vi /usr/sap/sapservices The following items are prefixed with either **[A]** - applicable to all nodes, 1. **[1]** Create the SAP cluster resources. - If using enqueue server 1 architecture (ENSA1), define the resources as follows: + If you use enqueue server 1 architecture (ENSA1), define the resources as shown here: ```bash sudo pcs property set maintenance-mode=true The following items are prefixed with either **[A]** - applicable to all nodes, sudo pcs property set maintenance-mode=false ``` - SAP introduced support for enqueue server 2, including replication, as of SAP NW 7.52. Starting with ABAP Platform 1809, enqueue server 2 is installed by default. See SAP note [2630416](https://launchpad.support.sap.com/#/notes/2630416) for enqueue server 2 support. If using enqueue server 2 architecture ([ENSA2](https://help.sap.com/viewer/cff8531bc1d9416d91bb6781e628d4e0/1709%20001/en-US/6d655c383abf4c129b0e5c8683e7ecd8.html)), install resource agent resource-agents-sap-4.1.1-12.el7.x86_64 or newer and define the resources as follows: + SAP introduced support for enqueue server 2, including replication, as of SAP NW 7.52. Starting with ABAP Platform 1809, enqueue server 2 is installed by default. See SAP Note [2630416](https://launchpad.support.sap.com/#/notes/2630416) for enqueue server 2 support. If you use enqueue server 2 architecture ([ENSA2](https://help.sap.com/viewer/cff8531bc1d9416d91bb6781e628d4e0/1709%20001/en-US/6d655c383abf4c129b0e5c8683e7ecd8.html)), install resource agent resource-agents-sap-4.1.1-12.el7.x86_64 or newer and define the resources as shown here: ```bash sudo pcs property set maintenance-mode=true The following items are prefixed with either **[A]** - applicable to all nodes, sudo pcs node unstandby nw1-cl-0 sudo pcs property set maintenance-mode=false ```+ + > [!NOTE] + > If you're upgrading from an older version and switching to enqueue server 2, see SAP Note [2641322](https://launchpad.support.sap.com/#/notes/2641322). - If you are upgrading from an older version and switching to enqueue server 2, see SAP note [2641322](https://launchpad.support.sap.com/#/notes/2641322). -- > [!NOTE] - > The timeouts in the above configuration are just examples and may need to be adapted to the specific SAP setup. + > [!NOTE] + > The timeouts in the preceding configuration are only examples and might need to be adapted to the specific SAP setup. - Make sure that the cluster status is ok and that all resources are started. It is not important on which node the resources are running. + Make sure that the cluster status is okay and that all resources are started. Which node the resources are running on isn't important. ```bash sudo pcs status The following items are prefixed with either **[A]** - applicable to all nodes, # rsc_sap_NW1_ERS02 (ocf::heartbeat:SAPInstance): Started nw1-cl-0 ``` -1. **[A]** Add firewall rules for ASCS and ERS on both nodes +1. **[A]** Add firewall rules for ASCS and ERS on both nodes. ```bash # Probe Port of ASCS The following items are prefixed with either **[A]** - applicable to all nodes, ## SAP NetWeaver application server preparation -Some databases require that the database instance installation is executed on an application server. Prepare the application server virtual machines to be able to use them in these cases. +Some databases require that the database instance installation runs on an application server. Prepare the application server VMs to be able to use them in these cases. -The steps bellow assume that you install the application server on a server different from the ASCS/SCS and HANA servers. Otherwise some of the steps below (like configuring host name resolution) are not needed. +The following steps assume that you install the application server on a server different from the ASCS/SCS and HANA servers. Otherwise, some of the steps (like configuring hostname resolution) aren't needed. -1. Setup host name resolution +1. Set up hostname resolution. - You can either use a DNS server or modify the /etc/hosts on all nodes. This example shows how to use the /etc/hosts file. - Replace the IP address and the hostname in the following commands + You can either use a DNS server or modify the `/etc/hosts` file on all nodes. This example shows how to use the `/etc/hosts` file. Replace the IP address and the hostname in the following commands: ```bash sudo vi /etc/hosts ``` - Insert the following lines to /etc/hosts. Change the IP address and hostname to match your environment + Insert the following lines to `/etc/hosts`. Change the IP address and hostname to match your environment. ```bash # IP addresses of the GlusterFS nodes The steps bellow assume that you install the application server on a server diff 10.0.0.13 nw1-db ``` -1. Create the sapmnt directory +1. Create the `sapmnt` directory. ```bash sudo mkdir -p /sapmnt/NW1 The steps bellow assume that you install the application server on a server diff sudo chattr +i /usr/sap/trans ``` -1. Install GlusterFS client and other requirements +1. Install the GlusterFS client and other requirements. ```bash sudo yum -y install glusterfs-fuse uuidd ``` -1. Add mount entries +1. Add mount entries. ```bash sudo vi /etc/fstab The steps bellow assume that you install the application server on a server diff glust-0:/NW1-trans /usr/sap/trans glusterfs backup-volfile-servers=glust-1:glust-2 0 0 ``` - Mount the new shares + Mount the new shares. ```bash sudo mount -a ``` -1. Configure SWAP file +1. Configure the SWAP file. ```bash sudo vi /etc/waagent.conf The steps bellow assume that you install the application server on a server diff ResourceDisk.SwapSizeMB=2000 ``` - Restart the Agent to activate the change + Restart the agent to activate the change. ```bash sudo service waagent restart ``` -## Install database +## Install the database -In this example, SAP NetWeaver is installed on SAP HANA. You can use every supported database for this installation. For more information on how to install SAP HANA in Azure, see [High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux][sap-hana-ha]. For a list of supported databases, see [SAP Note 1928533][1928533]. +In this example, SAP NetWeaver is installed on SAP HANA. You can use every supported database for this installation. For more information on how to install SAP HANA in Azure, see [High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux][sap-hana-ha]. For a list of supported databases, see SAP Note [1928533][1928533]. -1. Run the SAP database instance installation +1. Run the SAP database instance installation. - Install the SAP NetWeaver database instance as root using a virtual hostname that maps to the IP address of the load balancer frontend configuration for the database for example **nw1-db** and **10.0.0.13**. + Install the SAP NetWeaver database instance as the root by using a virtual hostname that maps to the IP address of the load balancer front-end configuration for the database, for example, **nw1-db** and **10.0.0.13**. - You can use the sapinst parameter SAPINST_REMOTE_ACCESS_USER to allow a non-root user to connect to sapinst. + You can use the `sapinst` parameter `SAPINST_REMOTE_ACCESS_USER` to allow a nonroot user to connect to `sapinst`. ```bash sudo <swpm>/sapinst SAPINST_REMOTE_ACCESS_USER=sapadmin In this example, SAP NetWeaver is installed on SAP HANA. You can use every suppo Follow these steps to install an SAP application server. -1. Prepare application server +1. Prepare the application server. - Follow the steps in the chapter [SAP NetWeaver application server preparation](#sap-netweaver-application-server-preparation) above to prepare the application server. + Follow the steps in the previous section [SAP NetWeaver application server preparation](#sap-netweaver-application-server-preparation) to prepare the application server. -1. Install SAP NetWeaver application server +1. Install the SAP NetWeaver application server. Install a primary or additional SAP NetWeaver applications server. - You can use the sapinst parameter SAPINST_REMOTE_ACCESS_USER to allow a non-root user to connect to sapinst. + You can use the `sapinst` parameter `SAPINST_REMOTE_ACCESS_USER` to allow a nonroot user to connect to `sapinst`. ```bash sudo <swpm>/sapinst SAPINST_REMOTE_ACCESS_USER=sapadmin ``` -1. Update SAP HANA secure store +1. Update the SAP HANA secure store. Update the SAP HANA secure store to point to the virtual name of the SAP HANA System Replication setup. - Run the following command to list the entries as \<sapsid>adm + Run the following command to list the entries as \<sapsid>adm: ```bash hdbuserstore List ``` - This should list all entries and should look similar to + All entries should be listed and look similar to: ```text DATA FILE : /home/nw1adm/.hdb/nw1-di-0/SSFS_HDB.DAT Follow these steps to install an SAP application server. DATABASE: NW1 ``` - The output shows that the IP address of the default entry is pointing to the virtual machine and not to the load balancer's IP address. This entry needs to be changed to point to the virtual hostname of the load balancer. Make sure to use the same port (**30313** in the output above) and database name (**HN1** in the output above)! + The output shows that the IP address of the default entry is pointing to the VM and not to the load balancer's IP address. This entry needs to be changed to point to the virtual hostname of the load balancer. Make sure to use the same port (**30313** in the preceding output) and database name (**HN1** in the preceding output). ```bash su - nw1adm Follow these steps to install an SAP application server. ## Test the cluster setup -1. Manually migrate the ASCS instance +1. Manually migrate the ASCS instance. Resource state before starting the test: Follow these steps to install an SAP application server. rsc_sap_NW1_ERS02 (ocf::heartbeat:SAPInstance): Started nw1-cl-0 ``` -2. Simulate node crash +1. Simulate a node crash. Resource state before starting the test: Follow these steps to install an SAP application server. rsc_sap_NW1_ERS02 (ocf::heartbeat:SAPInstance): Started nw1-cl-0 ``` - Run the following command as root on the node where the ASCS instance is running + Run the following command as root on the node where the ASCS instance is running. ```bash [root@nw1-cl-1 ~]# echo b > /proc/sysrq-trigger ``` - The status after the node is started again should look like this. + The status after the node is started again should look like: ```text Online: [ nw1-cl-0 nw1-cl-1 ] Follow these steps to install an SAP application server. rsc_sap_NW1_ERS02 (ocf::heartbeat:SAPInstance): Started nw1-cl-1 ``` -3. Blocking network communication +1. Block network communication. Resource state before starting the test: Follow these steps to install an SAP application server. rsc_sap_NW1_ERS02 (ocf::heartbeat:SAPInstance): Started nw1-cl-1 ``` - Execute firewall rule to block the communication on one of the nodes. + Run a firewall rule to block the communication on one of the nodes. ```bash # Execute iptable rule on nw1-cl-0 (10.0.0.7) to block the incoming and outgoing traffic to nw1-cl-1 (10.0.0.8) iptables -A INPUT -s 10.0.0.8 -j DROP; iptables -A OUTPUT -d 10.0.0.8 -j DROP ``` - When cluster nodes can't communicate to each other, there's a risk of a split-brain scenario. In such situations, cluster nodes will try to simultaneously fence each other, resulting in fence race. To avoid such situation, it's recommended to set [priority-fencing-delay](https://access.redhat.com/solutions/5110521) property in cluster configuration (applicable only for [pacemaker-2.0.4-6.el8](https://access.redhat.com/errata/RHEA-2020:4804) or higher). + When cluster nodes can't communicate with each other, there's a risk of a split-brain scenario. In such situations, cluster nodes try to simultaneously fence each other, which results in a fence race. To avoid this situation, we recommend that you set a [priority-fencing-delay](https://access.redhat.com/solutions/5110521) property in a cluster configuration (applicable only for [pacemaker-2.0.4-6.el8](https://access.redhat.com/errata/RHEA-2020:4804) or higher). - By enabling priority-fencing-delay property, the cluster introduces an additional delay in the fencing action specifically on the node hosting ASCS resource, allowing the node to win the fence race. + By enabling the `priority-fencing-delay` property, the cluster introduces a delay in the fencing action, specifically on the node hosting ASCS resource, allowing the node to win the fence race. - Execute below command to delete the firewall rule. + Run the following command to delete the firewall rule. ```bash # If the iptables rule set on the server gets reset after a reboot, the rules will be cleared out. In case they have not been reset, please proceed to remove the iptables rule using the following command. iptables -D INPUT -s 10.0.0.8 -j DROP; iptables -D OUTPUT -d 10.0.0.8 -j DROP ``` -4. Kill message server process +1. Kill the message server process. Resource state before starting the test: Follow these steps to install an SAP application server. [root@nw1-cl-0 ~]# pgrep -f ms.sapNW1 | xargs kill -9 ``` - If you only kill the message server once, it will be restarted by `sapstart`. If you kill it often enough, Pacemaker will eventually move the ASCS instance to the other node. Run the following commands as root to clean up the resource state of the ASCS and ERS instance after the test. + If you kill the message server only once, `sapstart` restarts it. If you kill it often enough, Pacemaker eventually moves the ASCS instance to the other node. Run the following commands as root to clean up the resource state of the ASCS and ERS instance after the test. ```bash [root@nw1-cl-0 ~]# pcs resource cleanup rsc_sap_NW1_ASCS00 Follow these steps to install an SAP application server. rsc_sap_NW1_ERS02 (ocf::heartbeat:SAPInstance): Started nw1-cl-0 ``` -5. Kill enqueue server process +1. Kill the enqueue server process. Resource state before starting the test: Follow these steps to install an SAP application server. rsc_sap_NW1_ERS02 (ocf::heartbeat:SAPInstance): Started nw1-cl-1 ``` -6. Kill enqueue replication server process +1. Kill the enqueue replication server process. Resource state before starting the test: Follow these steps to install an SAP application server. [root@nw1-cl-1 ~]# pgrep -f enqr.sapNW1 | xargs kill -9 ``` - If you only run the command once, `sapstart` will restart the process. If you run it often enough, `sapstart` will not restart the process and the resource will be in a stopped state. Run the following commands as root to clean up the resource state of the ERS instance after the test. + If you run the command only once, `sapstart` restarts the process. If you run it often enough, `sapstart` won't restart the process and the resource is in a stopped state. Run the following commands as root to clean up the resource state of the ERS instance after the test. ```bash [root@nw1-cl-0 ~]# pcs resource cleanup rsc_sap_NW1_ERS02 Follow these steps to install an SAP application server. rsc_sap_NW1_ERS02 (ocf::heartbeat:SAPInstance): Started nw1-cl-1 ``` -7. Kill enqueue sapstartsrv process +1. Kill the enqueue `sapstartsrv` process. Resource state before starting the test: Follow these steps to install an SAP application server. [root@nw1-cl-0 ~]# kill -9 59545 ``` - The sapstartsrv process should always be restarted by the Pacemaker resource agent as part of the monitoring. Resource state after the test: + The `sapstartsrv` process should always be restarted by the Pacemaker resource agent as part of the monitoring. Resource state after the test: ```text rsc_st_azure (stonith:fence_azure_arm): Started nw1-cl-0 Follow these steps to install an SAP application server. ## Next steps -* To deploy cost optimization scenario where PAS and AAS instance is deployed with SAP NetWeaver HA cluster on RHEL, see [Install SAP Dialog Instance with SAP ASCS/SCS high availability VMs on RHEL](high-availability-guide-rhel-with-dialog-instance.md) -* [HA for SAP NW on Azure VMs on RHEL for SAP applications multi-SID guide](./high-availability-guide-rhel-multi-sid.md) -* [Azure Virtual Machines planning and implementation for SAP][planning-guide] -* [Azure Virtual Machines deployment for SAP][deployment-guide] -* [Azure Virtual Machines DBMS deployment for SAP][dbms-guide] -* To learn how to establish high availability and plan for disaster recovery of SAP HANA on Azure (large instances), see [SAP HANA (large instances) high availability and disaster recovery on Azure](../../virtual-machines/workloads/sap/hana-overview-high-availability-disaster-recovery.md). -* To learn how to establish high availability and plan for disaster recovery of SAP HANA on Azure VMs, see [High Availability of SAP HANA on Azure Virtual Machines (VMs)][sap-hana-ha] +* To deploy a cost-optimization scenario where the PAS and AAS instance is deployed with SAP NetWeaver HA cluster on RHEL, see [Install SAP dialog instance with SAP ASCS/SCS high availability VMs on RHEL](high-availability-guide-rhel-with-dialog-instance.md). +* See [HA for SAP NW on Azure VMs on RHEL for SAP applications multi-SID guide](./high-availability-guide-rhel-multi-sid.md). +* See [Azure Virtual Machines planning and implementation for SAP][planning-guide]. +* See [Azure Virtual Machines deployment for SAP][deployment-guide]. +* See [Azure Virtual Machines DBMS deployment for SAP][dbms-guide]. +* To learn how to establish HA and plan for disaster recovery of SAP HANA on Azure (large instances), see [SAP HANA (large instances) high availability and disaster recovery on Azure](../../virtual-machines/workloads/sap/hana-overview-high-availability-disaster-recovery.md). +* To learn how to establish HA and plan for disaster recovery of SAP HANA on Azure VMs, see [High availability of SAP HANA on Azure Virtual Machines][sap-hana-ha]. |
sap | Sap Hana High Availability Netapp Files Red Hat | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/sap/workloads/sap-hana-high-availability-netapp-files-red-hat.md | Title: High availability of SAP HANA Scale-up with ANF on RHEL | Microsoft Docs -description: Establish high availability of SAP HANA with ANF on Azure virtual machines (VMs). + Title: High availability of SAP HANA scale-up with Azure NetApp Files on RHEL | Microsoft Docs +description: Establish high availability of SAP HANA with Azure NetApp Files on Azure Virtual Machines. Last updated 07/11/2023 -# High availability of SAP HANA Scale-up with Azure NetApp Files on Red Hat Enterprise Linux +# High availability of SAP HANA scale-up with Azure NetApp Files on RHEL [dbms-guide]:dbms-guide-general.md [deployment-guide]:deployment-guide.md [planning-guide]:planning-guide.md -This article describes how to configure SAP HANA System Replication in Scale-up deployment, when the HANA file systems are mounted via NFS, using Azure NetApp Files (ANF). In the example configurations and installation commands, instance number **03**, and HANA System ID **HN1** are used. SAP HANA Replication consists of one primary node and at least one secondary node. +This article describes how to configure SAP HANA System Replication in scale-up deployment, when the HANA file systems are mounted via NFS, by using Azure NetApp Files. In the example configurations and installation commands, instance number **03** and HANA System ID **HN1** are used. SAP HANA System Replication consists of one primary node and at least one secondary node. When steps in this document are marked with the following prefixes, the meaning is as follows: When steps in this document are marked with the following prefixes, the meaning - **[1]**: The step applies to node1 only - **[2]**: The step applies to node2 only +## Prerequisites + Read the following SAP Notes and papers first: - SAP Note [1928533](https://launchpad.support.sap.com/#/notes/1928533), which has:- - The list of Azure VM sizes that are supported for the deployment of SAP software. + - The list of Azure virtual machine (VM) sizes that are supported for the deployment of SAP software. - Important capacity information for Azure VM sizes.- - The supported SAP software, and operating system (OS) and database combinations. + - The supported SAP software and operating system (OS) and database combinations. - The required SAP kernel version for Windows and Linux on Microsoft Azure. - SAP Note [2015553](https://launchpad.support.sap.com/#/notes/2015553) lists prerequisites for SAP-supported SAP software deployments in Azure.-- SAP Note [405827](https://launchpad.support.sap.com/#/notes/405827) lists out recommended file system for HANA environment.+- SAP Note [405827](https://launchpad.support.sap.com/#/notes/405827) lists recommended file systems for HANA environments. - SAP Note [2002167](https://launchpad.support.sap.com/#/notes/2002167) has recommended OS settings for Red Hat Enterprise Linux. - SAP Note [2009879](https://launchpad.support.sap.com/#/notes/2009879) has SAP HANA Guidelines for Red Hat Enterprise Linux. - SAP Note [3108302](https://launchpad.support.sap.com/#/notes/3108302) has SAP HANA Guidelines for Red Hat Enterprise Linux 9.x. - SAP Note [2178632](https://launchpad.support.sap.com/#/notes/2178632) has detailed information about all monitoring metrics reported for SAP in Azure. - SAP Note [2191498](https://launchpad.support.sap.com/#/notes/2191498) has the required SAP Host Agent version for Linux in Azure. - SAP Note [2243692](https://launchpad.support.sap.com/#/notes/2243692) has information about SAP licensing on Linux in Azure.-- SAP Note [1999351](https://launchpad.support.sap.com/#/notes/1999351) has additional troubleshooting information for the Azure Enhanced Monitoring Extension for SAP.+- SAP Note [1999351](https://launchpad.support.sap.com/#/notes/1999351) has more troubleshooting information for the Azure Enhanced Monitoring Extension for SAP. - [SAP Community Wiki](https://wiki.scn.sap.com/wiki/display/HOME/SAPonLinuxNotes) has all required SAP Notes for Linux. - [Azure Virtual Machines planning and implementation for SAP on Linux][planning-guide] - [Azure Virtual Machines deployment for SAP on Linux][deployment-guide] - [Azure Virtual Machines DBMS deployment for SAP on Linux][dbms-guide]-- [SAP HANA system replication in pacemaker cluster.](https://access.redhat.com/articles/3004101)-- General RHEL documentation+- [SAP HANA system replication in Pacemaker cluster](https://access.redhat.com/articles/3004101) +- General Red Hat Enterprise Linux (RHEL) documentation: - [High Availability Add-On Overview](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_overview/index)- - [High Availability Add-On Administration.](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_administration/index) - - [High Availability Add-On Reference.](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/index) - - [Configure SAP HANA System Replication in Scale-Up in a Pacemaker cluster when the HANA filesystems are on NFS shares](https://access.redhat.com/solutions/5156571) + - [High Availability Add-On Administration](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_administration/index) + - [High Availability Add-On Reference](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/index) + - [Configure SAP HANA System Replication in Scale-Up in a Pacemaker cluster when the HANA file systems are on NFS shares](https://access.redhat.com/solutions/5156571) - Azure-specific RHEL documentation:- - [Support Policies for RHEL High Availability Clusters - Microsoft Azure Virtual Machines as Cluster Members.](https://access.redhat.com/articles/3131341) - - [Installing and Configuring a Red Hat Enterprise Linux 7.4 (and later) High-Availability Cluster on Microsoft Azure.](https://access.redhat.com/articles/3252491) - - [Configure SAP HANA scale-up system replication up Pacemaker cluster when the HANA file systems are on NFS shares](https://access.redhat.com/solutions/5156571) + - [Support Policies for RHEL High Availability Clusters - Microsoft Azure Virtual Machines as Cluster Members](https://access.redhat.com/articles/3131341) + - [Installing and Configuring a Red Hat Enterprise Linux 7.4 (and later) High-Availability Cluster on Microsoft Azure](https://access.redhat.com/articles/3252491) + - [Configure SAP HANA scale-up system replication in a Pacemaker cluster when the HANA file systems are on NFS shares](https://access.redhat.com/solutions/5156571) - [NetApp SAP Applications on Microsoft Azure using Azure NetApp Files](https://www.netapp.com/us/media/tr-4746.pdf) - [NFS v4.1 volumes on Azure NetApp Files for SAP HANA](./hana-vm-operations-netapp.md) ## Overview -Traditionally in scale-up environment all file systems for SAP HANA are mounted from local storage. Setting up High Availability of SAP HANA System Replication on Red Hat Enterprise Linux is published in guide [Set up SAP HANA System Replication on RHEL](./sap-hana-high-availability-rhel.md) +Traditionally in a scale-up environment, all file systems for SAP HANA are mounted from local storage. Setting up high availability (HA) of SAP HANA System Replication on Red Hat Enterprise Linux is published in [Set up SAP HANA System Replication on RHEL](./sap-hana-high-availability-rhel.md). -In order to achieve SAP HANA High Availability of scale-up system on [Azure NetApp Files](../../azure-netapp-files/index.yml) NFS shares, we need some additional resource configuration in the cluster, in order for HANA resources to recover, when one node loses access to the NFS shares on ANF. The cluster manages the NFS mounts, allowing it to monitor the health of the resources. The dependencies between the file system mounts and the SAP HANA resources are enforced. +To achieve SAP HANA HA of a scale-up system on [Azure NetApp Files](../../azure-netapp-files/index.yml) NFS shares, we need some more resource configuration in the cluster, in order for HANA resources to recover, when one node loses access to the NFS shares on Azure NetApp Files. The cluster manages the NFS mounts, allowing it to monitor the health of the resources. The dependencies between the file system mounts and the SAP HANA resources are enforced. -![SAP HANA HA Scale-up on ANF](./media/sap-hana-high-availability-rhel/sap-hana-scale-up-netapp-files-red-hat.png) +![Diagram that shows SAP HANA HA scale-up on Azure NetApp Files](./media/sap-hana-high-availability-rhel/sap-hana-scale-up-netapp-files-red-hat.png). -SAP HANA filesystems are mounted on NFS shares using Azure NetApp Files on each node. File systems /hana/data, /hana/log, and /hana/shared are unique to each node. +SAP HANA file systems are mounted on NFS shares by using Azure NetApp Files on each node. File systems `/hana/data`, `/hana/log`, and `/hana/shared` are unique to each node. -Mounted on node1 (**hanadb1**) +Mounted on node1 (**hanadb1**): - 10.32.2.4:/**hanadb1**-data-mnt00001 on /hana/data - 10.32.2.4:/**hanadb1**-log-mnt00001 on /hana/log - 10.32.2.4:/**hanadb1**-shared-mnt00001 on /hana/shared -Mounted on node2 (**hanadb2**) +Mounted on node2 (**hanadb2**): - 10.32.2.4:/**hanadb2**-data-mnt00001 on /hana/data - 10.32.2.4:/**hanadb2**-log-mnt00001 on /hana/log - 10.32.2.4:/**hanadb2**-shared-mnt00001 on /hana/shared > [!NOTE]-> File systems /hana/shared, /hana/data and /hana/log are not shared between the two nodes. Each cluster node has its own, separate file systems. +> File systems `/hana/shared`, `/hana/data`, and `/hana/log` aren't shared between the two nodes. Each cluster node has its own separate file systems. -The SAP HANA System Replication configuration uses a dedicated virtual hostname and virtual IP addresses. On Azure, a load balancer is required to use a virtual IP address. The presented configuration shows a load balancer with: +The SAP HANA System Replication configuration uses a dedicated virtual hostname and virtual IP addresses. On Azure, a load balancer is required to use a virtual IP address. The configuration shown here has a load balancer with: - Front-end IP address: 10.32.0.10 for hn1-db-- Probe Port: 62503+- Probe port: 62503 -## Set up the Azure NetApp File infrastructure +## Set up the Azure NetApp Files infrastructure -Before you proceed with the set up for Azure NetApp Files infrastructure, familiarize yourself with the Azure [NetApp Files documentation](../../azure-netapp-files/index.yml). +Before you proceed with the setup for Azure NetApp Files infrastructure, familiarize yourself with the Azure [NetApp Files documentation](../../azure-netapp-files/index.yml). Azure NetApp Files is available in several [Azure regions](https://azure.microsoft.com/global-infrastructure/services/?products=netapp). Check to see whether your selected Azure region offers Azure NetApp Files. -For information about the availability of Azure NetApp Files by Azure region, see [Azure NetApp Files Availability by Azure Region](https://azure.microsoft.com/global-infrastructure/services/?products=netapp®ions=all). +For information about the availability of Azure NetApp Files by Azure region, see [Azure NetApp Files availability by Azure region](https://azure.microsoft.com/global-infrastructure/services/?products=netapp®ions=all). ### Important considerations -As you're creating your Azure NetApp Files volumes for SAP HANA Scale-up systems, be aware of the important considerations documented in [NFS v4.1 volumes on Azure NetApp Files for SAP HANA](./hana-vm-operations-netapp.md#important-considerations). +As you're creating your Azure NetApp Files volumes for SAP HANA scale-up systems, be aware of the important considerations documented in [NFS v4.1 volumes on Azure NetApp Files for SAP HANA](./hana-vm-operations-netapp.md#important-considerations). ### Sizing of HANA database on Azure NetApp Files The throughput of an Azure NetApp Files volume is a function of the volume size and service level, as documented in [Service level for Azure NetApp Files](../../azure-netapp-files/azure-netapp-files-service-levels.md). -While designing the infrastructure for SAP HANA on Azure with Azure NetApp Files, be aware of the recommendations in [NFS v4.1 volumes on Azure NetApp Files for SAP HANA](./hana-vm-operations-netapp.md#sizing-for-hana-database-on-azure-netapp-files). +While you're designing the infrastructure for SAP HANA on Azure with Azure NetApp Files, be aware of the recommendations in [NFS v4.1 volumes on Azure NetApp Files for SAP HANA](./hana-vm-operations-netapp.md#sizing-for-hana-database-on-azure-netapp-files). -The configuration in this article is presented with simple Azure NetApp Files Volumes. +The configuration in this article is presented with simple Azure NetApp Files volumes. > [!IMPORTANT]-> For production systems, where performance is a key, we recommend to evaluate and consider using [Azure NetApp Files application volume group for SAP HANA](hana-vm-operations-netapp.md#deployment-through-azure-netapp-files-application-volume-group-for-sap-hana-avg). +> For production systems, where performance is a key, we recommend that you evaluate and consider using [Azure NetApp Files application volume group for SAP HANA](hana-vm-operations-netapp.md#deployment-through-azure-netapp-files-application-volume-group-for-sap-hana-avg). ### Deploy Azure NetApp Files resources -The following instructions assume that you've already deployed your [Azure virtual network](../../virtual-network/virtual-networks-overview.md). The Azure NetApp Files resources and VMs, where the Azure NetApp Files resources will be mounted, must be deployed in the same Azure virtual network or in peered Azure virtual networks. +The following instructions assume that you already deployed your [Azure virtual network](../../virtual-network/virtual-networks-overview.md). The Azure NetApp Files resources and VMs, where the Azure NetApp Files resources will be mounted, must be deployed in the same Azure virtual network or in peered Azure virtual networks. 1. Create a NetApp account in your selected Azure region by following the instructions in [Create a NetApp account](../../azure-netapp-files/azure-netapp-files-create-netapp-account.md). -2. Set up Azure NetApp Files capacity pool by following the instructions in [Set up an Azure NetApp Files capacity pool](../../azure-netapp-files/azure-netapp-files-set-up-capacity-pool.md). +1. Set up an Azure NetApp Files capacity pool by following the instructions in [Set up an Azure NetApp Files capacity pool](../../azure-netapp-files/azure-netapp-files-set-up-capacity-pool.md). - The HANA architecture presented in this article uses a single Azure NetApp Files capacity pool at the *Ultra* Service level. For HANA workloads on Azure, we recommend using an Azure NetApp Files *Ultra* or *Premium* [service Level](../../azure-netapp-files/azure-netapp-files-service-levels.md). + The HANA architecture shown in this article uses a single Azure NetApp Files capacity pool at the *Ultra* service level. For HANA workloads on Azure, we recommend using an Azure NetApp Files *Ultra* or *Premium* [service Level](../../azure-netapp-files/azure-netapp-files-service-levels.md). -3. Delegate a subnet to Azure NetApp Files, as described in the instructions in [Delegate a subnet to Azure NetApp Files](../../azure-netapp-files/azure-netapp-files-delegate-subnet.md). +1. Delegate a subnet to Azure NetApp Files, as described in the instructions in [Delegate a subnet to Azure NetApp Files](../../azure-netapp-files/azure-netapp-files-delegate-subnet.md). -4. Deploy Azure NetApp Files volumes by following the instructions in [Create an NFS volume for Azure NetApp Files](../../azure-netapp-files/azure-netapp-files-create-volumes.md). +1. Deploy Azure NetApp Files volumes by following the instructions in [Create an NFS volume for Azure NetApp Files](../../azure-netapp-files/azure-netapp-files-create-volumes.md). As you're deploying the volumes, be sure to select the NFSv4.1 version. Deploy the volumes in the designated Azure NetApp Files subnet. The IP addresses of the Azure NetApp volumes are assigned automatically. - Keep in mind that the Azure NetApp Files resources and the Azure VMs must be in the same Azure virtual network or in peered Azure virtual networks. For example, hanadb1-data-mnt00001, hanadb1-log-mnt00001, and so on, are the volume names and nfs://10.32.2.4/hanadb1-data-mnt00001, nfs://10.32.2.4/hanadb1-log-mnt00001, and so on, are the file paths for the Azure NetApp Files volumes. + Keep in mind that the Azure NetApp Files resources and the Azure VMs must be in the same Azure virtual network or in peered Azure virtual networks. For example, `hanadb1-data-mnt00001` and `hanadb1-log-mnt00001` are the volume names and `nfs://10.32.2.4/hanadb1-data-mnt00001` and `nfs://10.32.2.4/hanadb1-log-mnt00001` are the file paths for the Azure NetApp Files volumes. - On **hanadb1** + On **hanadb1**: - Volume hanadb1-data-mnt00001 (nfs://10.32.2.4:/hanadb1-data-mnt00001) - Volume hanadb1-log-mnt00001 (nfs://10.32.2.4:/hanadb1-log-mnt00001) - Volume hanadb1-shared-mnt00001 (nfs://10.32.2.4:/hanadb1-shared-mnt00001) - On **hanadb2** + On **hanadb2**: - Volume hanadb2-data-mnt00001 (nfs://10.32.2.4:/hanadb2-data-mnt00001) - Volume hanadb2-log-mnt00001 (nfs://10.32.2.4:/hanadb2-log-mnt00001) - Volume hanadb2-shared-mnt00001 (nfs://10.32.2.4:/hanadb2-shared-mnt00001) > [!NOTE]-> All commands to mount /hana/shared in this article are presented for NFSv4.1 /hana/shared volumes. -> If you deployed the /hana/shared volumes as NFSv3 volumes, don't forget to adjust the mount commands for /hana/shared for NFSv3. +> All commands to mount `/hana/shared` in this article are presented for NFSv4.1 `/hana/shared` volumes. +> If you deployed the `/hana/shared` volumes as NFSv3 volumes, don't forget to adjust the mount commands for `/hana/shared` for NFSv3. -## Deploy Linux virtual machine via Azure portal +## Deploy Linux virtual machine via the Azure portal -This document assumes that you've already deployed a resource group, [Azure Virtual Network](../../virtual-network/virtual-networks-overview.md), and subnet. +This document assumes that you already deployed a resource group, [Azure Virtual Network](../../virtual-network/virtual-networks-overview.md), and a subnet. -Deploy virtual machines for SAP HANA. Choose a suitable RHEL image that is supported for HANA system. You can deploy VM in any one of the availability options - scale set, availability zone or availability set. +Deploy VMs for SAP HANA. Choose a suitable RHEL image that's supported for a HANA system. You can deploy a VM in any one of the availability options: scale set, availability zone, or availability set. > [!IMPORTANT] > > Make sure that the OS you select is SAP certified for SAP HANA on the specific VM types that you plan to use in your deployment. You can look up SAP HANA-certified VM types and their OS releases in [SAP HANA Certified IaaS Platforms](https://www.sap.com/dmc/exp/2014-09-02-hana-hardware/enEN/#/solutions?filters=v:deCertified;ve:24;iaas;v:125;v:105;v:99;v:120). Make sure that you look at the details of the VM type to get the complete list of SAP HANA-supported OS releases for the specific VM type. -During VM configuration, we won't be adding any disk as all our mount points are on NFS shares from Azure NetApp Files. Also, you have an option to create or select exiting load balancer in networking section. If you're creating a new load balancer, follow below steps - +During VM configuration, we won't add any disk because all our mount points are on NFS shares from Azure NetApp Files. Also, you can create or select an existing load balancer in the networking section. If you're creating a new load balancer, follow these steps: -1. To set up standard load balancer, follow these configuration steps: +1. To set up a Standard load balancer, follow these configuration steps: 1. First, create a front-end IP pool: 1. Open the load balancer, select **frontend IP pool**, and select **Add**.- 2. Enter the name of the new front-end IP pool (for example, **hana-frontend**). - 3. Set the **Assignment** to **Static** and enter the IP address (for example, **10.32.0.10**). - 4. Select **OK**. - 5. After the new front-end IP pool is created, note the pool IP address. - 2. Create a single back-end pool: + 1. Enter the name of the new front-end IP pool (for example, **hana-frontend**). + 1. Set **Assignment** to **Static** and enter the IP address (for example, **10.32.0.10**). + 1. Select **OK**. + 1. After the new front-end IP pool is created, note the pool IP address. + 1. Create a single back-end pool: 1. Open the load balancer, select **Backend pools**, and then select **Add**.- 2. Enter the name of the new back-end pool (for example, **hana-backend**). - 3. Select **NIC** for Backend Pool Configuration. - 4. Select **Add a virtual machine**. - 5. Select the virtual machines of the HANA cluster. - 6. Select **Add**. - 7. Select **Save**. - 3. Next, create a health probe: + 1. Enter the name of the new back-end pool (for example, **hana-backend**). + 1. Select **NIC** for **Backend Pool Configuration**. + 1. Select **Add a virtual machine**. + 1. Select the VMs of the HANA cluster. + 1. Select **Add**. + 1. Select **Save**. + 1. Next, create a health probe: 1. Open the load balancer, select **health probes**, and select **Add**.- 2. Enter the name of the new health probe (for example, **hana-hp**). - 3. Select TCP as the protocol and port 625**03**. Keep the **Interval** value set to 5. - 4. Select **OK**. - 4. Next, create the load-balancing rules: + 1. Enter the name of the new health probe (for example, **hana-hp**). + 1. Select **TCP** as the protocol and port 625**03**. Keep the **Interval** value set to **5**. + 1. Select **OK**. + 1. Next, create load-balancing rules: 1. Open the load balancer, select **load balancing rules**, and select **Add**.- 2. Enter the name of the new load balancer rule (for example, **hana-lb**). - 3. Select the front-end IP address, the back-end pool, and the health probe that you created earlier (for example, **hana-frontend**, **hana-backend** and **hana-hp**). - 1. Increase idle timeout to 30 minutes - 4. Select **HA Ports**. - 5. Make sure to **enable Floating IP**. - 6. Select **OK**. + 1. Enter the name of the new load balancer rule (for example, **hana-lb**). + 1. Select the front-end IP address, the back-end pool, and the health probe that you created earlier (for example, **hana-frontend**, **hana-backend**, and **hana-hp**). + 1. Increase the idle timeout to **30 minutes**. + 1. Select **HA Ports**. + 1. Make sure to enable **Floating IP**. + 1. Select **OK**. For more information about the required ports for SAP HANA, read the chapter [Connections to Tenant Databases](https://help.sap.com/viewer/78209c1d3a9b41cd8624338e42a12bf6/latest/en-US/7a9343c9f2a2436faa3cfdb5ca00c052.html) in the [SAP HANA Tenant Databases](https://help.sap.com/viewer/78209c1d3a9b41cd8624338e42a12bf6) guide or SAP Note [2388694](https://launchpad.support.sap.com/#/notes/2388694). > [!IMPORTANT]-> Floating IP is not supported on a NIC secondary IP configuration in load-balancing scenarios. For details see [Azure Load balancer Limitations](../../load-balancer/load-balancer-multivip-overview.md#limitations). If you need additional IP address for the VM, deploy a second NIC. +> Floating IP isn't supported on a NIC secondary IP configuration in load-balancing scenarios. For more information, see [Azure Load Balancer limitations](../../load-balancer/load-balancer-multivip-overview.md#limitations). If you need another IP address for the VM, deploy a second NIC. > [!NOTE]-> When VMs without public IP addresses are placed in the backend pool of internal (no public IP address) Standard Azure load balancer, there will be no outbound internet connectivity, unless additional configuration is performed to allow routing to public end points. For details on how to achieve outbound connectivity see [Public endpoint connectivity for Virtual Machines using Azure Standard Load Balancer in SAP high-availability scenarios](./high-availability-guide-standard-load-balancer-outbound-connections.md). +> When VMs without public IP addresses are placed in the back-end pool of an internal (no public IP address) instance of Standard Azure Load Balancer, there's no outbound internet connectivity, unless more configuration is performed to allow routing to public endpoints. For more information on how to achieve outbound connectivity, see [Public endpoint connectivity for virtual machines using Standard Azure Load Balancer in SAP high-availability scenarios](./high-availability-guide-standard-load-balancer-outbound-connections.md). > [!IMPORTANT]-> Do not enable TCP timestamps on Azure VMs placed behind Azure Load Balancer. Enabling TCP timestamps will cause the health probes to fail. Set parameter **net.ipv4.tcp_timestamps** to **0**. For details see [Load Balancer health probes](../../load-balancer/load-balancer-custom-probe-overview.md). See also SAP note [2382421](https://launchpad.support.sap.com/#/notes/2382421). +> Don't enable TCP timestamps on Azure VMs placed behind Azure Load Balancer. Enabling TCP timestamps could cause the health probes to fail. Set the parameter **net.ipv4.tcp_timestamps** to **0**. For more information, see [Load Balancer health probes](../../load-balancer/load-balancer-custom-probe-overview.md). See also SAP Note [2382421](https://launchpad.support.sap.com/#/notes/2382421). ## Mount the Azure NetApp Files volume For more information about the required ports for SAP HANA, read the chapter [Co sudo mkdir -p /hana/shared ``` -2. **[A]** Verify the NFS domain setting. Make sure that the domain is configured as the default Azure NetApp Files domain, i.e. **defaultv4iddomain.com** and the mapping is set to **nobody**. +1. **[A]** Verify the NFS domain setting. Make sure that the domain is configured as the default Azure NetApp Files domain, that is, **defaultv4iddomain.com**, and the mapping is set to **nobody**. ```bash sudo cat /etc/idmapd.conf ``` - Example output + Example output: ```output [General] For more information about the required ports for SAP HANA, read the chapter [Co ``` > [!IMPORTANT]- > Make sure to set the NFS domain in /etc/idmapd.conf on the VM to match the default domain configuration on Azure NetApp Files: **defaultv4iddomain.com**. If there's a mismatch between the domain configuration on the NFS client (i.e. the VM) and the NFS server, i.e. the Azure NetApp configuration, then the permissions for files on Azure NetApp volumes that are mounted on the VMs will be displayed as nobody. + > Make sure to set the NFS domain in `/etc/idmapd.conf` on the VM to match the default domain configuration on Azure NetApp Files: **defaultv4iddomain.com**. If there's a mismatch between the domain configuration on the NFS client (that is, the VM) and the NFS server (that is, the Azure NetApp Files configuration), then the permissions for files on Azure NetApp Files volumes that are mounted on the VMs display as `nobody`. -3. **[1]** Mount the node-specific volumes on node1 (**hanadb1**) +1. **[1]** Mount the node-specific volumes on node1 (**hanadb1**). ```bash sudo mount -o rw,nfsvers=4.1,hard,timeo=600,rsize=262144,wsize=262144,noatime,lock,_netdev,sec=sys 10.32.2.4:/hanadb1-shared-mnt00001 /hana/shared For more information about the required ports for SAP HANA, read the chapter [Co sudo mount -o rw,nfsvers=4.1,hard,timeo=600,rsize=262144,wsize=262144,noatime,lock,_netdev,sec=sys 10.32.2.4:/hanadb1-data-mnt00001 /hana/data ``` -4. **[2]** Mount the node-specific volumes on node2 (**hanadb2**) +1. **[2]** Mount the node-specific volumes on node2 (**hanadb2**). ```bash sudo mount -o rw,nfsvers=4.1,hard,timeo=600,rsize=262144,wsize=262144,noatime,lock,_netdev,sec=sys 10.32.2.4:/hanadb2-shared-mnt00001 /hana/shared For more information about the required ports for SAP HANA, read the chapter [Co sudo mount -o rw,nfsvers=4.1,hard,timeo=600,rsize=262144,wsize=262144,noatime,lock,_netdev,sec=sys 10.32.2.4:/hanadb2-data-mnt00001 /hana/data ``` -5. **[A]** Verify that all HANA volumes are mounted with NFS protocol version NFSv4. +1. **[A]** Verify that all HANA volumes are mounted with NFS protocol version NFSv4. ```bash sudo nfsstat -m ``` - Verify that flag vers is set to 4.1 - Example from hanadb1 + Verify that the flag `vers` is set to **4.1**. + Example from hanadb1: ```output /hana/log from 10.32.2.4:/hanadb1-log-mnt00001 For more information about the required ports for SAP HANA, read the chapter [Co Flags: rw,noatime,vers=4.1,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.32.0.4,local_lock=none,addr=10.32.2.4 ``` -6. **[A]** Verify **nfs4_disable_idmapping**. It should be set to **Y**. To create the directory structure where **nfs4_disable_idmapping** is located, execute the mount command. You won't be able to manually create the directory under /sys/modules, because access is reserved for the kernel / drivers. +1. **[A]** Verify **nfs4_disable_idmapping**. It should be set to **Y**. To create the directory structure where **nfs4_disable_idmapping** is located, run the mount command. You can't manually create the directory under `/sys/modules` because access is reserved for the kernel and drivers. - Check nfs4_disable_idmapping + Check `nfs4_disable_idmapping`. ```bash sudo cat /sys/module/nfs/parameters/nfs4_disable_idmapping ``` - If you need to set nfs4_disable_idmapping to + If you need to set `nfs4_disable_idmapping` to: ```bash sudo echo "Y" > /sys/module/nfs/parameters/nfs4_disable_idmapping ``` - Make the configuration permanent + Make the configuration permanent. ```bash sudo echo "options nfs nfs4_disable_idmapping=Y" >> /etc/modprobe.d/nfs.conf ``` - ΓÇïFor more information on how to change nfs_disable_idmapping parameter, see [https://access.redhat.com/solutions/1749883](https://access.redhat.com/solutions/1749883). + ΓÇïFor more information on how to change the `nfs_disable_idmapping` parameter, see the [Red Hat Knowledge Base](https://access.redhat.com/solutions/1749883). ## SAP HANA installation -1. **[A]** Set up host name resolution for all hosts. +1. **[A]** Set up hostname resolution for all hosts. - You can either use a DNS server or modify the /etc/hosts file on all nodes. This example shows you how to use the /etc/hosts file. Replace the IP address and the hostname in the following commands: + You can either use a DNS server or modify the `/etc/hosts` file on all nodes. This example shows you how to use the `/etc/hosts` file. Replace the IP address and the hostname in the following commands: ```bash sudo vi /etc/hosts ``` - Insert the following lines in the /etc/hosts file. Change the IP address and hostname to match your environment + Insert the following lines in the `/etc/hosts` file. Change the IP address and hostname to match your environment. ```output 10.32.0.4 hanadb1 10.32.0.5 hanadb2 ``` -2. **[A]** Prepare the OS for running SAP HANA on Azure NetApp with NFS, as described in SAP note [3024346 - Linux Kernel Settings for NetApp NFS](https://launchpad.support.sap.com/#/notes/3024346). Create configuration file */etc/sysctl.d/91-NetApp-HANA.conf* for the NetApp configuration settings. +1. **[A]** Prepare the OS for running SAP HANA on Azure NetApp with NFS, as described in SAP Note [3024346 - Linux Kernel Settings for NetApp NFS](https://launchpad.support.sap.com/#/notes/3024346). Create configuration file `/etc/sysctl.d/91-NetApp-HANA.conf` for the NetApp configuration settings. ```bash sudo vi /etc/sysctl.d/91-NetApp-HANA.conf ``` - Add the following entries in the configuration file + Add the following entries in the configuration file. ```output net.core.rmem_max = 16777216 For more information about the required ports for SAP HANA, read the chapter [Co net.ipv4.tcp_sack = 1 ``` -3. **[A]** Create configuration file */etc/sysctl.d/ms-az.conf* with additional optimization settings. +1. **[A]** Create the configuration file `/etc/sysctl.d/ms-az.conf` with more optimization settings. ```bash sudo vi /etc/sysctl.d/ms-az.conf ``` - Add the following entries in the configuration file + Add the following entries in the configuration file. ```output net.ipv6.conf.all.disable_ipv6 = 1 For more information about the required ports for SAP HANA, read the chapter [Co ``` > [!TIP]- > Avoid setting net.ipv4.ip_local_port_range and net.ipv4.ip_local_reserved_ports explicitly in the sysctl configuration files to allow SAP Host Agent to manage the port ranges. For more information, see SAP note [2382421](https://launchpad.support.sap.com/#/notes/2382421). + > Avoid setting `net.ipv4.ip_local_port_range` and `net.ipv4.ip_local_reserved_ports` explicitly in the `sysctl` configuration files to allow the SAP Host Agent to manage the port ranges. For more information, see SAP Note [2382421](https://launchpad.support.sap.com/#/notes/2382421). -4. **[A]** Adjust the sunrpc settings, as recommended in SAP note [3024346 - Linux Kernel Settings for NetApp NFS](https://launchpad.support.sap.com/#/notes/3024346). +1. **[A]** Adjust the `sunrpc` settings, as recommended in SAP Note [3024346 - Linux Kernel Settings for NetApp NFS](https://launchpad.support.sap.com/#/notes/3024346). ```bash sudo vi /etc/modprobe.d/sunrpc.conf For more information about the required ports for SAP HANA, read the chapter [Co options sunrpc tcp_max_slot_table_entries=128 ``` -5. **[A]** RHEL for HANA Configuration +1. **[A]** Perform RHEL OS configuration for HANA. - Configure RHEL as described in below SAP Note based on your RHEL version + Configure the OS as described in the following SAP Notes based on your RHEL version: - [2292690 - SAP HANA DB: Recommended OS settings for RHEL 7](https://launchpad.support.sap.com/#/notes/2292690) - [2777782 - SAP HANA DB: Recommended OS Settings for RHEL 8](https://launchpad.support.sap.com/#/notes/2777782) For more information about the required ports for SAP HANA, read the chapter [Co - [2593824 - Linux: Running SAP applications compiled with GCC 7.x](https://launchpad.support.sap.com/#/notes/2593824) - [2886607 - Linux: Running SAP applications compiled with GCC 9.x](https://launchpad.support.sap.com/#/notes/2886607) -6. **[A]** Install the SAP HANA +1. **[A]** Install the SAP HANA. - Started with HANA 2.0 SPS 01, MDC is the default option. When you install HANA system, SYSTEMDB and a tenant with same SID will be created together. In some case you don't want the default tenant. In case, if you donΓÇÖt want to create initial tenant along with the installation you can follow SAP Note [2629711](https://launchpad.support.sap.com/#/notes/2629711) + Starting with HANA 2.0 SPS 01, MDC is the default option. When you install the HANA system, SYSTEMDB and a tenant with the same SID are created together. In some cases, you don't want the default tenant. If you don't want to create an initial tenant along with the installation, you can follow SAP Note [2629711](https://launchpad.support.sap.com/#/notes/2629711). Run the **hdblcm** program from the HANA DVD. Enter the following values at the prompt: - Choose installation: Enter **1** (for install) - Select additional components for installation: Enter **1**. - Enter Installation Path [/hana/shared]: press Enter to accept the default - Enter Local Host Name [..]: Press Enter to accept the default - Do you want to add additional hosts to the system? (y/n) [n]: **n** - Enter SAP HANA System ID: Enter **HN1**. - Enter Instance Number [00]: Enter **03** - Select Database Mode / Enter Index [1]: press Enter to accept the default - Select System Usage / Enter Index [4]: enter **4** (for custom) - Enter Location of Data Volumes [/hana/data]: press Enter to accept the default - Enter Location of Log Volumes [/hana/log]: press Enter to accept the default - Restrict maximum memory allocation? [n]: press Enter to accept the default - Enter Certificate Host Name For Host '...' [...]: press Enter to accept the default - Enter SAP Host Agent User (sapadm) Password: Enter the host agent user password - Confirm SAP Host Agent User (sapadm) Password: Enter the host agent user password again to confirm - Enter System Administrator (hn1adm) Password: Enter the system administrator password - Confirm System Administrator (hn1adm) Password: Enter the system administrator password again to confirm - Enter System Administrator Home Directory [/usr/sap/HN1/home]: press Enter to accept the default - Enter System Administrator Login Shell [/bin/sh]: press Enter to accept the default - Enter System Administrator User ID [1001]: press Enter to accept the default - Enter ID of User Group (sapsys) [79]: press Enter to accept the default - Enter Database User (SYSTEM) Password: Enter the database user password - Confirm Database User (SYSTEM) Password: Enter the database user password again to confirm - Restart system after machine reboot? [n]: press Enter to accept the default - Do you want to continue? (y/n): Validate the summary. Enter **y** to continue --4. **[A]** Upgrade SAP Host Agent + 1. Choose installation: Enter **1** (for install). + 1. Select more components for installation: Enter **1**. + 1. Enter **Installation Path** [/hana/shared]: Select Enter to accept the default. + 1. Enter **Local Host Name** [..]: Select Enter to accept the default. + **Do you want to add additional hosts to the system? (y/n)** [n]: **n**. + 1. Enter **SAP HANA System ID**: Enter **HN1**. + 1. Enter **Instance Number** [00]: Enter **03**. + 1. Select **Database Mode / Enter Index** [1]: Select Enter to accept the default. + 1. Select **System Usage / Enter Index** [4]: Enter **4** (for custom). + 1. Enter **Location of Data Volumes** [/hana/data]: Select Enter to accept the default. + 1. Enter **Location of Log Volumes** [/hana/log]: Select Enter to accept the default. + 1. **Restrict maximum memory allocation?** [n]: Select Enter to accept the default. + 1. Enter **Certificate Host Name For Host '...'** [...]: Select Enter to accept the default. + 1. Enter **SAP Host Agent User (sapadm) Password**: Enter the host agent user password. + 1. Confirm **SAP Host Agent User (sapadm) Password**: Enter the host agent user password again to confirm. + 1. Enter **System Administrator (hn1adm) Password**: Enter the system administrator password. + 1. Confirm **System Administrator (hn1adm) Password**: Enter the system administrator password again to confirm. + 1. Enter **System Administrator Home Directory** [/usr/sap/HN1/home]: Select Enter to accept the default. + 1. Enter **System Administrator Login Shell** [/bin/sh]: Select Enter to accept the default. + 1. Enter **System Administrator User ID** [1001]: Select Enter to accept the default. + 1. Enter **ID of User Group (sapsys)** [79]: Select Enter to accept the default. + 1. Enter **Database User (SYSTEM) Password**: Enter the database user password. + 1. Confirm **Database User (SYSTEM) Password**: Enter the database user password again to confirm. + 1. **Restart system after machine reboot?** [n]: Select Enter to accept the default. + 1. **Do you want to continue? (y/n)**: Validate the summary. Enter **y** to continue. ++1. **[A]** Upgrade the SAP Host Agent. Download the latest SAP Host Agent archive from the [SAP Software Center](https://launchpad.support.sap.com/#/softwarecenter) and run the following command to upgrade the agent. Replace the path to the archive to point to the file that you downloaded: For more information about the required ports for SAP HANA, read the chapter [Co sudo /usr/sap/hostctrl/exe/saphostexec -upgrade -archive <path to SAP Host Agent SAR> ``` -5. **[A]** Configure firewall +1. **[A]** Configure a firewall. - Create the firewall rule for the Azure load balancer probe port. + Create the firewall rule for the Azure Load Balancer probe port. ```bash sudo firewall-cmd --zone=public --add-port=62503/tcp sudo firewall-cmd --zone=public --add-port=62503/tcp ΓÇôpermanent ``` -## Configure SAP HANA system replication +## Configure SAP HANA System Replication -Follow the steps in Set up [SAP HANA System Replication](./sap-hana-high-availability-rhel.md#configure-sap-hana-20-system-replication) to configure SAP HANA System Replication. +Follow the steps in [Set up SAP HANA System Replication](./sap-hana-high-availability-rhel.md#configure-sap-hana-20-system-replication) to configure SAP HANA System Replication. ## Cluster configuration -This section describes necessary steps required for cluster to operate seamlessly when SAP HANA is installed on NFS shares using Azure NetApp Files. +This section describes the steps required for a cluster to operate seamlessly when SAP HANA is installed on NFS shares by using Azure NetApp Files. ### Create a Pacemaker cluster -Follow the steps in [Setting up Pacemaker on Red Hat Enterprise Linux](./high-availability-guide-rhel-pacemaker.md) in Azure to create a basic Pacemaker cluster for this HANA server. +Follow the steps in [Set up Pacemaker on Red Hat Enterprise Linux](./high-availability-guide-rhel-pacemaker.md) in Azure to create a basic Pacemaker cluster for this HANA server. ### Implement the Python system replication hook SAPHanaSR -This is important step to optimize the integration with the cluster and improve the detection when a cluster failover is needed. It's highly recommended to configure the SAPHanaSR Python hook. Follow the steps mentioned in [Implement the Python system replication hook SAPHanaSR](sap-hana-high-availability-rhel.md#implement-the-python-system-replication-hook-saphanasr) +This step is an important one to optimize the integration with the cluster and improve the detection when a cluster failover is needed. We highly recommend that you configure the SAPHanaSR Python hook. Follow the steps in [Implement the Python system replication hook SAPHanaSR](sap-hana-high-availability-rhel.md#implement-the-python-system-replication-hook-saphanasr). -### Configure filesystem resources +### Configure file system resources -In this example each cluster node has its own HANA NFS filesystems /hana/shared, /hana/data, and /hana/log. +In this example, each cluster node has its own HANA NFS file systems `/hana/shared`, `/hana/data`, and `/hana/log`. 1. **[1]** Put the cluster in maintenance mode. In this example each cluster node has its own HANA NFS filesystems /hana/shared, sudo pcs property set maintenance-mode=true ``` -2. **[1]** Create the Filesystem resources for the **hanadb1** mounts. +1. **[1]** Create the file system resources for the **hanadb1** mounts. ```bash sudo pcs resource create hana_data1 ocf:heartbeat:Filesystem device=10.32.2.4:/hanadb1-data-mnt00001 directory=/hana/data fstype=nfs options=rw,nfsvers=4.1,hard,timeo=600,rsize=262144,wsize=262144,noatime,lock,_netdev,sec=sys op monitor interval=20s on-fail=fence timeout=120s OCF_CHECK_LEVEL=20 --group hanadb1_nfs In this example each cluster node has its own HANA NFS filesystems /hana/shared, sudo pcs resource create hana_shared1 ocf:heartbeat:Filesystem device=10.32.2.4:/hanadb1-shared-mnt00001 directory=/hana/shared fstype=nfs options=rw,nfsvers=4.1,hard,timeo=600,rsize=262144,wsize=262144,noatime,lock,_netdev,sec=sys op monitor interval=20s on-fail=fence timeout=120s OCF_CHECK_LEVEL=20 --group hanadb1_nfs ``` -3. **[2]** Create the Filesystem resources for the **hanadb2** mounts. +1. **[2]** Create the file system resources for the **hanadb2** mounts. ```bash sudo pcs resource create hana_data2 ocf:heartbeat:Filesystem device=10.32.2.4:/hanadb2-data-mnt00001 directory=/hana/data fstype=nfs options=rw,nfsvers=4.1,hard,timeo=600,rsize=262144,wsize=262144,noatime,lock,_netdev,sec=sys op monitor interval=20s on-fail=fence timeout=120s OCF_CHECK_LEVEL=20 --group hanadb2_nfs In this example each cluster node has its own HANA NFS filesystems /hana/shared, sudo pcs resource create hana_shared2 ocf:heartbeat:Filesystem device=10.32.2.4:/hanadb2-shared-mnt00001 directory=/hana/shared fstype=nfs options=rw,nfsvers=4.1,hard,timeo=600,rsize=262144,wsize=262144,noatime,lock,_netdev,sec=sys op monitor interval=20s on-fail=fence timeout=120s OCF_CHECK_LEVEL=20 --group hanadb2_nfs ``` - `OCF_CHECK_LEVEL=20` attribute is added to the monitor operation so that each monitor performs a read/write test on the filesystem. Without this attribute, the monitor operation only verifies that the filesystem is mounted. This can be a problem because when connectivity is lost, the filesystem may remain mounted despite being inaccessible. + The `OCF_CHECK_LEVEL=20` attribute is added to the monitor operation so that each monitor performs a read/write test on the file system. Without this attribute, the monitor operation only verifies that the file system is mounted. This can be a problem because when connectivity is lost, the file system might remain mounted despite being inaccessible. ++ The `on-fail=fence` attribute is also added to the monitor operation. With this option, if the monitor operation fails on a node, that node is immediately fenced. Without this option, the default behavior is to stop all resources that depend on the failed resource, restart the failed resource, and then start all the resources that depend on the failed resource. - `on-fail=fence` attribute is also added to the monitor operation. With this option, if the monitor operation fails on a node, that node is immediately fenced. Without this option, the default behavior is to stop all resources that depend on the failed resource, then restart the failed resource, then start all the resources that depend on the failed resource. Not only can this behavior take a long time when an SAPHana resource depends on the failed resource, but it also can fail altogether. The SAPHana resource cannot stop successfully if the NFS server holding the HANA executables is inaccessible. + Not only can this behavior take a long time when an SAPHana resource depends on the failed resource, but it also can fail altogether. The SAPHana resource can't stop successfully if the NFS server holding the HANA executables is inaccessible. - The suggested timeouts values allow the cluster resources to withstand protocol-specific pause, related to NFSv4.1 lease renewals. For more information, see [NFS in NetApp Best practice](https://www.netapp.com/media/10720-tr-4067.pdf). The timeouts in the above configuration may need to be adapted to the specific SAP setup. + The suggested timeout values allow the cluster resources to withstand protocol-specific pause, related to NFSv4.1 lease renewals. For more information, see [NFS in NetApp Best practice](https://www.netapp.com/media/10720-tr-4067.pdf). The timeouts in the preceding configuration might need to be adapted to the specific SAP setup. - For workloads that require higher throughput consider using the `nconnect` mount option, as described in [NFS v4.1 volumes on Azure NetApp Files for SAP HANA](./hana-vm-operations-netapp.md#nconnect-mount-option). Check if `nconnect` is [supported by Azure NetApp Files](../../azure-netapp-files/performance-linux-mount-options.md#nconnect) on your Linux release. + For workloads that require higher throughput, consider using the `nconnect` mount option, as described in [NFS v4.1 volumes on Azure NetApp Files for SAP HANA](./hana-vm-operations-netapp.md#nconnect-mount-option). Check if `nconnect` is [supported by Azure NetApp Files](../../azure-netapp-files/performance-linux-mount-options.md#nconnect) on your Linux release. -4. **[1]** Configuring Location Constraints +1. **[1]** Configure location constraints. - Configure location constraints to ensure that the resources that manage hanadb1 unique mounts can never run on hanadb2, and vice-versa. + Configure location constraints to ensure that the resources that manage hanadb1 unique mounts can never run on hanadb2, and vice versa. ```bash sudo pcs constraint location hanadb1_nfs rule score=-INFINITY resource-discovery=never \#uname eq hanadb2 sudo pcs constraint location hanadb2_nfs rule score=-INFINITY resource-discovery=never \#uname eq hanadb1 ``` - The `resource-discovery=never` option is set because the unique mounts for each node share the same mount point. For example, `hana_data1` uses mount point `/hana/data`, and `hana_data2` also uses mount point `/hana/data`. This can cause a false positive for a probe operation, when resource state is checked at cluster startup, and this can in turn cause unnecessary recovery behavior. This can be avoided by setting `resource-discovery=never` + The `resource-discovery=never` option is set because the unique mounts for each node share the same mount point. For example, `hana_data1` uses mount point `/hana/data`, and `hana_data2` also uses mount point `/hana/data`. Sharing the same mount point can cause a false positive for a probe operation, when resource state is checked at cluster startup, and it can in turn cause unnecessary recovery behavior. To avoid this scenario, set `resource-discovery=never`. -5. **[1]** Configuring Attribute Resources +1. **[1]** Configure attribute resources. - Configure attribute resources. These attributes will be set to true if all of a node's NFS mounts (/hana/data, /hana/log, and /hana/data) are mounted and will be set to false otherwise. + Configure attribute resources. These attributes are set to true if all of a node's NFS mounts (`/hana/data`, `/hana/log`, and `/hana/data`) are mounted. Otherwise, they're set to false. ```bash sudo pcs resource create hana_nfs1_active ocf:pacemaker:attribute active_value=true inactive_value=false name=hana_nfs1_active sudo pcs resource create hana_nfs2_active ocf:pacemaker:attribute active_value=true inactive_value=false name=hana_nfs2_active ``` -6. **[1]** Configuring Location Constraints +1. **[1]** Configure location constraints. - Configure location constraints to ensure that hanadb1ΓÇÖs attribute resource never runs on hanadb2, and vice-versa. + Configure location constraints to ensure that hanadb1's attribute resource never runs on hanadb2, and vice versa. ```bash sudo pcs constraint location hana_nfs1_active avoids hanadb2 sudo pcs constraint location hana_nfs2_active avoids hanadb1 ``` -7. **[1]** Creating Ordering Constraints +1. **[1]** Create ordering constraints. Configure ordering constraints so that a node's attribute resources start only after all of the node's NFS mounts are mounted. In this example each cluster node has its own HANA NFS filesystems /hana/shared, ``` > [!TIP]- > If your configuration includes file systems, outside of group `hanadb1_nfs` or `hanadb2_nfs`, then include the `sequential=false` option, so that there are no ordering dependencies among the file systems. All file systems must start before `hana_nfs1_active`, but they do not need to start in any order relative to each other. For more information, see [How do I configure SAP HANA System Replication in Scale-Up in a Pacemaker cluster when the HANA filesystems are on NFS shares](https://access.redhat.com/solutions/5156571) + > If your configuration includes file systems, outside of group `hanadb1_nfs` or `hanadb2_nfs`, include the `sequential=false` option so that there are no ordering dependencies among the file systems. All file systems must start before `hana_nfs1_active`, but they don't need to start in any order relative to each other. For more information, see [How do I configure SAP HANA System Replication in Scale-Up in a Pacemaker cluster when the HANA file systems are on NFS shares](https://access.redhat.com/solutions/5156571) ### Configure SAP HANA cluster resources -1. Follow the steps in [Create SAP HANA cluster resources](./sap-hana-high-availability-rhel.md#create-sap-hana-cluster-resources) to create the SAP HANA Resources in the cluster. Once SAP HANA resources are created, we need to create a location rule constraint between SAP HANA resources and Filesystems (NFS Mounts) +1. Follow the steps in [Create SAP HANA cluster resources](./sap-hana-high-availability-rhel.md#create-sap-hana-cluster-resources) to create the SAP HANA resources in the cluster. After SAP HANA resources are created, you need to create a location rule constraint between SAP HANA resources and file systems (NFS mounts). -2. **[1]** Configure constraints between the SAP HANA resources and the NFS mounts +1. **[1]** Configure constraints between the SAP HANA resources and the NFS mounts. Location rule constraints are set so that the SAP HANA resources can run on a node only if all of the node's NFS mounts are mounted. In this example each cluster node has its own HANA NFS filesystems /hana/shared, sudo pcs constraint location SAPHanaTopology_HN1_03-clone rule score=-INFINITY hana_nfs1_active ne true and hana_nfs2_active ne true ``` - On RHEL 7.x + On RHEL 7.x: ```bash sudo pcs constraint location SAPHana_HN1_03-master rule score=-INFINITY hana_nfs1_active ne true and hana_nfs2_active ne true ``` - On RHEL 8.x/9.x + On RHEL 8.x/9.x: ```bash sudo pcs constraint location SAPHana_HN1_03-clone rule score=-INFINITY hana_nfs1_active ne true and hana_nfs2_active ne true ``` - Take the cluster out of maintenance mode + Take the cluster out of maintenance mode. ```bash sudo pcs property set maintenance-mode=false ``` - Check the status of cluster and all the resources + Check the status of the cluster and all the resources. + > [!NOTE]- > This article contains references to a term that Microsoft no longer uses. When the term is removed from the software, weΓÇÖll remove it from this article. + > This article contains references to a term that Microsoft no longer uses. When the term is removed from the software, we'll remove it from this article. ```bash sudo pcs status ``` - Example output + Example output: ```output Online: [ hanadb1 hanadb2 ] In this example each cluster node has its own HANA NFS filesystems /hana/shared, vip_HN1_03 (ocf::heartbeat:IPaddr2): Started hanadb1 ``` -## Configure HANA active/read enabled system replication in Pacemaker cluster +## Configure HANA active/read-enabled system replication in Pacemaker cluster ++Starting with SAP HANA 2.0 SPS 01, SAP allows active/read-enabled setups for SAP HANA System Replication, where the secondary systems of SAP HANA System Replication can be used actively for read-intense workloads. To support such a setup in a cluster, a second virtual IP address is required, which allows clients to access the secondary read-enabled SAP HANA database. -Starting with SAP HANA 2.0 SPS 01 SAP allows Active/Read-Enabled setups for SAP HANA System Replication, where the secondary systems of SAP HANA system replication can be used actively for read-intense workloads. To support such setup in a cluster a second virtual IP address is required which allows clients to access the secondary read-enabled SAP HANA database. To ensure that the secondary replication site can still be accessed after a takeover has occurred the cluster needs to move the virtual IP address around with the secondary of the SAPHana resource. +To ensure that the secondary replication site can still be accessed after a takeover has occurred, the cluster needs to move the virtual IP address around with the secondary of the SAPHana resource. -The additional configuration, required to manage HANA Active/Read enabled system replication in a Red Hat high availability cluster with second virtual IP is described in [Configure HANA Active/Read Enabled System Replication in Pacemaker cluster](./sap-hana-high-availability-rhel.md#configure-hana-activeread-enabled-system-replication-in-pacemaker-cluster). +The extra configuration, which is required to manage HANA active/read-enabled System Replication in a Red Hat HA cluster with a second virtual IP, is described in [Configure HANA Active/Read-Enabled System Replication in Pacemaker cluster](./sap-hana-high-availability-rhel.md#configure-hana-activeread-enabled-system-replication-in-pacemaker-cluster). -Before proceeding further, make sure you have fully configured Red Hat High Availability Cluster managing SAP HANA database as described in above segments of the documentation. +Before you proceed further, make sure you've fully configured Red Hat High Availability Cluster managing SAP HANA database as described in the preceding sections of the documentation. ## Test the cluster setup This section describes how you can test your setup. -1. Before you start a test, make sure that Pacemaker doesn't have any failed action (via pcs status), there are no unexpected location constraints (for example leftovers of a migration test) and that HANA system replication is sync state, for example with systemReplicationStatus: +1. Before you start a test, make sure that Pacemaker doesn't have any failed action (via pcs status), there are no unexpected location constraints (for example, leftovers of a migration test), and that HANA system replication is in sync state, for example, with `systemReplicationStatus`: ```bash sudo su - hn1adm -c "python /usr/sap/HN1/HDB03/exe/python_support/systemReplicationStatus.py" ``` -2. Verify the cluster configuration for a failure scenario when a node loses access to the NFS share (/hana/shared) +1. Verify the cluster configuration for a failure scenario when a node loses access to the NFS share (`/hana/shared`). - The SAP HANA resource agents depend on binaries, stored on `/hana/shared` to perform operations during failover. File system `/hana/shared` is mounted over NFS in the presented scenario. - It's difficult to simulate a failure, where one of the servers loses access to the NFS share. A test that can be performed is to remount the file system as read-only. - This approach validates that the cluster will be able to fail over, if access to `/hana/shared` is lost on the active node. + The SAP HANA resource agents depend on binaries stored on `/hana/shared` to perform operations during failover. File system `/hana/shared` is mounted over NFS in the presented scenario. + + It's difficult to simulate a failure where one of the servers loses access to the NFS share. As a test, you can remount the file system as read-only. This approach validates that the cluster can fail over, if access to `/hana/shared` is lost on the active node. - **Expected Result:** On making `/hana/shared` as read-only file system, the `OCF_CHECK_LEVEL` attribute of the resource `hana_shared1` which performs read/write operation on file system fails as it isn't able to write anything on the file system and will perform HANA resource failover. The same result is expected when your HANA node loses access to the NFS shares. + **Expected result:** On making `/hana/shared` as a read-only file system, the `OCF_CHECK_LEVEL` attribute of the resource `hana_shared1`, which performs read/write operations on file systems, fails. It isn't able to write anything on the file system and performs HANA resource failover. The same result is expected when your HANA node loses access to the NFS shares. Resource state before starting the test: This section describes how you can test your setup. sudo pcs status ``` - Example output + Example output: ```output Full list of resources: This section describes how you can test your setup. vip_HN1_03 (ocf::heartbeat:IPaddr2): Started hanadb1 ``` - You can place /hana/shared in read-only mode on the active cluster node, using below command: + You can place `/hana/shared` in read-only mode on the active cluster node by using this command: ```bash sudo mount -o ro 10.32.2.4:/hanadb1-shared-mnt00001 /hana/shared ``` - hanadb1 will either reboot or poweroff based on the action set on stonith (`pcs property show stonith-action`). Once the server (hanadb1) is down, HANA resource move to hanadb2. You can check the status of cluster from hanadb2. + `hanadb` will either reboot or power off based on the action set on `stonith` (`pcs property show stonith-action`). Once the server (`hanadb1`) is down, the HANA resource moves to `hanadb2`. You can check the status of the cluster from `hanadb2`. ```bash sudo pcs status ``` - Example output + Example output: ```output Full list of resources: This section describes how you can test your setup. vip_HN1_03 (ocf::heartbeat:IPaddr2): Started hanadb2 ``` - We recommend to thoroughly test the SAP HANA cluster configuration, by also performing the tests described in [Setup SAP HANA System Replication on RHEL](./sap-hana-high-availability-rhel.md#test-the-cluster-setup). + We recommend that you thoroughly test the SAP HANA cluster configuration by also performing the tests described in [Set up SAP HANA System Replication on RHEL](./sap-hana-high-availability-rhel.md#test-the-cluster-setup). ## Next steps |
sap | Sap Hana High Availability Rhel | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/sap/workloads/sap-hana-high-availability-rhel.md | -For on-premises development, you can use either HANA System Replication or use shared storage to establish high availability for SAP HANA. -On Azure virtual machines (VMs), HANA System Replication on Azure is currently the only supported high availability function. +For on-premises development, you can use either HANA System Replication or shared storage to establish high availability (HA) for SAP HANA. On Azure Virtual Machines, HANA System Replication on Azure is currently the only supported HA function. + SAP HANA Replication consists of one primary node and at least one secondary node. Changes to the data on the primary node are replicated to the secondary node synchronously or asynchronously. -This article describes how to deploy and configure the virtual machines, install the cluster framework, and install and configure SAP HANA System Replication. +This article describes how to deploy and configure virtual machines (VMs), install the cluster framework, and install and configure SAP HANA System Replication. + In the example configurations, installation commands, instance number **03**, and HANA System ID **HN1** are used. +## Prerequisites + Read the following SAP Notes and papers first: * SAP Note [1928533], which has: * The list of Azure VM sizes that are supported for the deployment of SAP software. * Important capacity information for Azure VM sizes.- * The supported SAP software, and operating system (OS) and database combinations. + * The supported SAP software and operating system (OS) and database combinations. * The required SAP kernel version for Windows and Linux on Microsoft Azure. * SAP Note [2015553] lists prerequisites for SAP-supported SAP software deployments in Azure.-* SAP Note [2002167] has recommended OS settings for Red Hat Enterprise Linux -* SAP Note [2009879] has SAP HANA Guidelines for Red Hat Enterprise Linux -* SAP Note [3108302] has SAP HANA Guidelines for Red Hat Enterprise Linux 9.x +* SAP Note [2002167] has recommended OS settings for Red Hat Enterprise Linux. +* SAP Note [2009879] has SAP HANA Guidelines for Red Hat Enterprise Linux. +* SAP Note [3108302] has SAP HANA Guidelines for Red Hat Enterprise Linux 9.x. * SAP Note [2178632] has detailed information about all monitoring metrics reported for SAP in Azure. * SAP Note [2191498] has the required SAP Host Agent version for Linux in Azure. * SAP Note [2243692] has information about SAP licensing on Linux in Azure.-* SAP Note [1999351] has additional troubleshooting information for the Azure Enhanced Monitoring Extension for SAP. +* SAP Note [1999351] has more troubleshooting information for the Azure Enhanced Monitoring Extension for SAP. * [SAP Community WIKI](https://wiki.scn.sap.com/wiki/display/HOME/SAPonLinuxNotes) has all required SAP Notes for Linux. * [Azure Virtual Machines planning and implementation for SAP on Linux][planning-guide] * [Azure Virtual Machines deployment for SAP on Linux (this article)][deployment-guide] * [Azure Virtual Machines DBMS deployment for SAP on Linux][dbms-guide]-* [SAP HANA system replication in pacemaker cluster](https://access.redhat.com/articles/3004101) -* General RHEL documentation +* [SAP HANA System Replication in a Pacemaker cluster](https://access.redhat.com/articles/3004101) +* General RHEL documentation: * [High Availability Add-On Overview](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_overview/index) * [High Availability Add-On Administration](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_administration/index) * [High Availability Add-On Reference](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/index)- * [HANA scale-up system replication with RHEL HA add-on](https://access.redhat.com/articles/3004101) -* Azure specific RHEL documentation: + * [HANA Scale-Up System Replication with RHEL HA Add-On](https://access.redhat.com/articles/3004101) +* Azure-specific RHEL documentation: * [Support Policies for RHEL High Availability Clusters - Microsoft Azure Virtual Machines as Cluster Members](https://access.redhat.com/articles/3131341) * [Installing and Configuring a Red Hat Enterprise Linux 7.4 (and later) High-Availability Cluster on Microsoft Azure](https://access.redhat.com/articles/3252491) * [Install SAP HANA on Red Hat Enterprise Linux for Use in Microsoft Azure](https://access.redhat.com/solutions/3193782) ## Overview -To achieve high availability, SAP HANA is installed on two virtual machines. The data is replicated by using HANA System Replication. +To achieve HA, SAP HANA is installed on two VMs. The data is replicated by using HANA System Replication. -![SAP HANA high availability overview](./media/sap-hana-high-availability-rhel/ha-hana.png) +![Diagram that shows SAP HANA high availability overview.](./media/sap-hana-high-availability-rhel/ha-hana.png) -SAP HANA System Replication setup uses a dedicated virtual hostname and virtual IP addresses. On Azure, a load balancer is required to use a virtual IP address. The presented configuration shows a load balancer with: +The SAP HANA System Replication setup uses a dedicated virtual hostname and virtual IP addresses. On Azure, a load balancer is required to use a virtual IP address. The presented configuration shows a load balancer with: * Front-end IP address: 10.0.0.13 for hn1-db-* Probe Port: 62503 +* Probe port: 62503 ## Deploy for Linux -The Azure Marketplace contains images qualified for SAP HANA with the High Availability add-on, which you can use to deploy new virtual machines using various versions of Red Hat. +Azure Marketplace contains images qualified for SAP HANA with the High Availability add-on, which you can use to deploy new VMs by using various versions of Red Hat. -### Deploy Linux VMs manually via Azure portal +### Deploy Linux VMs manually via the Azure portal -This document assumes that you've already deployed a resource group, [Azure Virtual Network](../../virtual-network/virtual-networks-overview.md), and subnet. +This document assumes that you've already deployed a resource group, an [Azure virtual network](../../virtual-network/virtual-networks-overview.md), and a subnet. -Deploy virtual machines for SAP HANA. Choose a suitable RHEL image that is supported for HANA system. You can deploy VM in any one of the availability options - scale set, availability zone or availability set. +Deploy VMs for SAP HANA. Choose a suitable RHEL image that's supported for the HANA system. You can deploy a VM in any one of the availability options: scale set, availability zone, or availability set. > [!IMPORTANT] > > Make sure that the OS you select is SAP certified for SAP HANA on the specific VM types that you plan to use in your deployment. You can look up SAP HANA-certified VM types and their OS releases in [SAP HANA Certified IaaS Platforms](https://www.sap.com/dmc/exp/2014-09-02-hana-hardware/enEN/#/solutions?filters=v:deCertified;ve:24;iaas;v:125;v:105;v:99;v:120). Make sure that you look at the details of the VM type to get the complete list of SAP HANA-supported OS releases for the specific VM type. -During VM configuration, you can create or select exiting load balancer in networking section. If you're creating a new load balancer, follow below steps - +During VM configuration, you can create or select an existing load balancer in the networking section. If you're creating a new load balancer, follow these steps: - 1. First, create a front-end IP pool: + 1. Create a front-end IP pool: 1. Open the load balancer, select **frontend IP pool**, and select **Add**.- 2. Enter the name of the new front-end IP pool (for example, **hana-frontend**). - 3. Set the **Assignment** to **Static** and enter the IP address (for example, **10.0.0.13**). - 4. Select **OK**. - 5. After the new front-end IP pool is created, note the pool IP address. + 1. Enter the name of the new front-end IP pool (for example, **hana-frontend**). + 1. Set **Assignment** to **Static** and enter the IP address (for example, **10.0.0.13**). + 1. Select **OK**. + 1. After the new front-end IP pool is created, note the pool IP address. - 2. Create a single back-end pool: + 1. Create a single back-end pool: 1. Open the load balancer, select **Backend pools**, and then select **Add**.- 2. Enter the name of the new back-end pool (for example, **hana-backend**). - 3. Select **NIC** for Backend Pool Configuration. - 4. Select **Add a virtual machine**. - 5. Select the virtual machines of the HANA cluster. - 6. Select **Add**. - 7. Select **Save**. + 1. Enter the name of the new back-end pool (for example, **hana-backend**). + 1. Select **NIC** for **Backend Pool Configuration**. + 1. Select **Add a virtual machine**. + 1. Select the VMs of the HANA cluster. + 1. Select **Add**. + 1. Select **Save**. - 3. Next, create a health probe: + 1. Create a health probe: 1. Open the load balancer, select **health probes**, and select **Add**.- 2. Enter the name of the new health probe (for example, **hana-hp**). - 3. Select **TCP** as the protocol and port 625**03**. Keep the **Interval** value set to 5. - 4. Select **OK**. + 1. Enter the name of the new health probe (for example, **hana-hp**). + 1. Select **TCP** as the protocol and port 625**03**. Keep the **Interval** value set to **5**. + 1. Select **OK**. - 4. Next, create the load-balancing rules: + 1. Create the load-balancing rules: 1. Open the load balancer, select **load balancing rules**, and select **Add**.- 2. Enter the name of the new load balancer rule (for example, **hana-lb**). - 3. Select the front-end IP address, the back-end pool, and the health probe that you created earlier (for example, **hana-frontend**, **hana-backend** and **hana-hp**). - 4. Increase idle timeout to 30 minutes - 5. Select **HA Ports**. - 6. Increase the **idle timeout** to 30 minutes. - 7. Make sure to **enable Floating IP**. - 8. Select **OK**. + 1. Enter the name of the new load balancer rule (for example, **hana-lb**). + 1. Select the front-end IP address, the back-end pool, and the health probe that you created earlier (for example, **hana-frontend**, **hana-backend**, and **hana-hp**). + 1. Increase the idle timeout to **30 minutes**. + 1. Select **HA Ports**. + 1. Increase the idle timeout to **30 minutes**. + 1. Make sure to enable **Floating IP**. + 1. Select **OK**. -For more information about the required ports for SAP HANA, read the chapter [Connections to Tenant Databases](https://help.sap.com/viewer/78209c1d3a9b41cd8624338e42a12bf6/latest/en-US/7a9343c9f2a2436faa3cfdb5ca00c052.html) in the [SAP HANA Tenant Databases](https://help.sap.com/viewer/78209c1d3a9b41cd8624338e42a12bf6) guide or [SAP Note 2388694][2388694]. +For more information about the required ports for SAP HANA, read the chapter [Connections to Tenant Databases](https://help.sap.com/viewer/78209c1d3a9b41cd8624338e42a12bf6/latest/en-US/7a9343c9f2a2436faa3cfdb5ca00c052.html) in the [SAP HANA Tenant Databases](https://help.sap.com/viewer/78209c1d3a9b41cd8624338e42a12bf6) guide or SAP Note [2388694][2388694]. > [!IMPORTANT]-> Floating IP is not supported on a NIC secondary IP configuration in load-balancing scenarios. For details see [Azure Load balancer Limitations](../../load-balancer/load-balancer-multivip-overview.md#limitations). If you need additional IP address for the VM, deploy a second NIC. +> Floating IP isn't supported on a NIC secondary IP configuration in load-balancing scenarios. For more information, see [Azure Load Balancer limitations](../../load-balancer/load-balancer-multivip-overview.md#limitations). If you need another IP address for the VM, deploy a second NIC. > [!NOTE]-> When VMs without public IP addresses are placed in the backend pool of internal (no public IP address) Standard Azure load balancer, there will be no outbound internet connectivity, unless additional configuration is performed to allow routing to public end points. For details on how to achieve outbound connectivity see [Public endpoint connectivity for Virtual Machines using Azure Standard Load Balancer in SAP high-availability scenarios](./high-availability-guide-standard-load-balancer-outbound-connections.md). +> When VMs without public IP addresses are placed in the back-end pool of an internal (no public IP address) instance of Standard Azure Load Balancer, there's no outbound internet connectivity unless more configuration is performed to allow routing to public endpoints. For more information on how to achieve outbound connectivity, see [Public endpoint connectivity for VMs using Azure Standard Load Balancer in SAP high-availability scenarios](./high-availability-guide-standard-load-balancer-outbound-connections.md). > [!IMPORTANT]-> Do not enable TCP timestamps on Azure VMs placed behind Azure Load Balancer. Enabling TCP timestamps will cause the health probes to fail. Set parameter **net.ipv4.tcp_timestamps** to **0**. For details see [Load Balancer health probes](../../load-balancer/load-balancer-custom-probe-overview.md). -> See also SAP note [2382421](https://launchpad.support.sap.com/#/notes/2382421). +> Don't enable TCP timestamps on Azure VMs placed behind Azure Load Balancer. Enabling TCP timestamps could cause the health probes to fail. Set the parameter **net.ipv4.tcp_timestamps** to **0**. For more information, see [Load Balancer health probes](../../load-balancer/load-balancer-custom-probe-overview.md). +> See also SAP Note [2382421](https://launchpad.support.sap.com/#/notes/2382421). ## Install SAP HANA The steps in this section use the following prefixes: 1. **[A]** Set up the disk layout: **Logical Volume Manager (LVM)**. - We recommend that you use LVM for volumes that store data and log files. The following example assumes that the virtual machines have four data disks attached that are used to create two volumes. + We recommend that you use LVM for volumes that store data and log files. The following example assumes that the VMs have four data disks attached that are used to create two volumes. - List all of the available disks: + List all the available disks: ```bash ls /dev/disk/azure/scsi1/lun* The steps in this section use the following prefixes: /dev/disk/azure/scsi1/lun0 /dev/disk/azure/scsi1/lun1 /dev/disk/azure/scsi1/lun2 /dev/disk/azure/scsi1/lun3 ``` - Create physical volumes for all of the disks that you want to use: + Create physical volumes for all the disks that you want to use: ```bash sudo pvcreate /dev/disk/azure/scsi1/lun0 The steps in this section use the following prefixes: sudo vgcreate vg_hana_shared_HN1 /dev/disk/azure/scsi1/lun3 ``` - Create the logical volumes. A linear volume is created when you use `lvcreate` without the `-i` switch. We suggest that you create a striped volume for better I/O performance, and align the stripe sizes to the values documented in [SAP HANA VM storage configurations](./hana-vm-operations-storage.md). The `-i` argument should be the number of the underlying physical volumes and the `-I` argument is the stripe size. In this document, two physical volumes are used for the data volume, so the `-i` switch argument is set to **2**. The stripe size for the data volume is **256KiB**. One physical volume is used for the log volume, so no `-i` or `-I` switches are explicitly used for the log volume commands. + Create the logical volumes. A linear volume is created when you use `lvcreate` without the `-i` switch. We suggest that you create a striped volume for better I/O performance. Align the stripe sizes to the values documented in [SAP HANA VM storage configurations](./hana-vm-operations-storage.md). The `-i` argument should be the number of the underlying physical volumes, and the `-I` argument is the stripe size. ++ In this document, two physical volumes are used for the data volume, so the `-i` switch argument is set to **2**. The stripe size for the data volume is **256KiB**. One physical volume is used for the log volume, so no `-i` or `-I` switches are explicitly used for the log volume commands. > [!IMPORTANT]- > Use the `-i` switch and set it to the number of the underlying physical volume when you use more than one physical volume for each data, log, or shared volumes. Use the `-I` switch to specify the stripe size, when creating a striped volume. - > See [SAP HANA VM storage configurations](./hana-vm-operations-storage.md) for recommended storage configurations, including stripe sizes and number of disks. The following layout examples do not necessarily meet the performance guidelines for a particular system size, they are for illustration only. + > Use the `-i` switch and set it to the number of the underlying physical volume when you use more than one physical volume for each data, log, or shared volumes. Use the `-I` switch to specify the stripe size when you're creating a striped volume. + > See [SAP HANA VM storage configurations](./hana-vm-operations-storage.md) for recommended storage configurations, including stripe sizes and number of disks. The following layout examples don't necessarily meet the performance guidelines for a particular system size. They're for illustration only. ```bash sudo lvcreate -i 2 -I 256 -l 100%FREE -n hana_data vg_hana_data_HN1 The steps in this section use the following prefixes: sudo mkfs.xfs /dev/vg_hana_shared_HN1/hana_shared ``` - Don't mount the directories by issuing mount commands, rather enter the configurations into the fstab and issue a final `mount -a` to validate the syntax. Start by creating the mount directories for each volume: + Don't mount the directories by issuing mount commands. Instead, enter the configurations into the `fstab` and issue a final `mount -a` to validate the syntax. Start by creating the mount directories for each volume: ```bash sudo mkdir -p /hana/data The steps in this section use the following prefixes: sudo mkdir -p /hana/shared ``` - Next create `fstab` entries for the three logical volumes by inserting the following lines in the `/etc/fstab` file: + Next, create `fstab` entries for the three logical volumes by inserting the following lines in the `/etc/fstab` file: /dev/mapper/vg_hana_data_HN1-hana_data /hana/data xfs defaults,nofail 0 2 /dev/mapper/vg_hana_log_HN1-hana_log /hana/log xfs defaults,nofail 0 2 /dev/mapper/vg_hana_shared_HN1-hana_shared /hana/shared xfs defaults,nofail 0 2 - Finally mount the new volumes all at once: + Finally, mount the new volumes all at once: ```bash sudo mount -a ``` -1. **[A]** Set up host name resolution for all hosts. +1. **[A]** Set up hostname resolution for all hosts. - You can either use a DNS server or modify the /etc/hosts file on all nodes by creating entries for all nodes like this in `/etc/hosts` + You can either use a DNS server or modify the `/etc/hosts` file on all nodes by creating entries for all nodes like this in `/etc/hosts`: 10.0.0.5 hn1-db-0 10.0.0.6 hn1-db-1 -1. **[A]** RHEL for HANA configuration +1. **[A]** Perform RHEL for HANA configuration. Configure RHEL as described in the following notes: * [2447641 - Additional packages required for installing SAP HANA SPS 12 on RHEL 7.X](https://access.redhat.com/solutions/2447641) The steps in this section use the following prefixes: * [2593824 - Linux: Running SAP applications compiled with GCC 7.x](https://launchpad.support.sap.com/#/notes/2593824) * [2886607 - Linux: Running SAP applications compiled with GCC 9.x](https://launchpad.support.sap.com/#/notes/2886607) -1. **[A]** Install the SAP HANA +1. **[A]** Install the SAP HANA. To install SAP HANA System Replication, see [Automating SAP HANA Scale-Up System Replication using the RHEL HA Add-On](https://access.redhat.com/articles/3004101). - * Run the **hdblcm** program from the HANA DVD. Enter the following values at the prompt: - * Choose installation: Enter **1**. - * Select additional components for installation: Enter **1**. - * Enter Installation Path [/hana/shared]: Select Enter. - * Enter Local Host Name [..]: Select Enter. - * Do you want to add additional hosts to the system? (y/n) [n]: Select Enter. - * Enter SAP HANA System ID: Enter the SID of HANA, for example: **HN1**. - * Enter Instance Number [00]: Enter the HANA Instance number. Enter **03** if you used the Azure template or followed the manual deployment section of this article. - * Select Database Mode / Enter Index [1]: Select Enter. - * Select System Usage / Enter Index [4]: Select the system usage value. - * Enter Location of Data Volumes [/hana/data]: Select Enter. - * Enter Location of Log Volumes [/hana/log]: Select Enter. - * Restrict maximum memory allocation? [n]: Select Enter. - * Enter Certificate Host Name For Host '...' [...]: Select Enter. - * Enter SAP Host Agent User (sapadm) Password: Enter the host agent user password. - * Confirm SAP Host Agent User (sapadm) Password: Enter the host agent user password again to confirm. - * Enter System Administrator (hdbadm) Password: Enter the system administrator password. - * Confirm System Administrator (hdbadm) Password: Enter the system administrator password again to confirm. - * Enter System Administrator Home Directory [/usr/sap/HN1/home]: Select Enter. - * Enter System Administrator Login Shell [/bin/sh]: Select Enter. - * Enter System Administrator User ID [1001]: Select Enter. - * Enter ID of User Group (sapsys) [79]: Select Enter. - * Enter Database User (SYSTEM) Password: Enter the database user password. - * Confirm Database User (SYSTEM) Password: Enter the database user password again to confirm. - * Restart system after machine reboot? [n]: Select Enter. - * Do you want to continue? (y/n): Validate the summary. Enter **y** to continue. + Run the **hdblcm** program from the HANA DVD. Enter the following values at the prompt: + 1. Choose installation: Enter **1**. + 1. Select additional components for installation: Enter **1**. + 1. Enter **Installation Path** [/hana/shared]: Select Enter. + 1. Enter **Local Host Name [..]**: Select Enter. + 1. **Do you want to add additional hosts to the system? (y/n)** [n]: Select Enter. + 1. Enter **SAP HANA System ID**: Enter the SID of HANA, for example: **HN1**. + 1. Enter **Instance Number** [00]: Enter the HANA Instance number. Enter **03** if you used the Azure template or followed the manual deployment section of this article. + 1. Select **Database Mode / Enter Index** [1]: Select Enter. + 1. Select **System Usage / Enter Index** [4]: Select the system usage value. + 1. Enter **Location of Data Volumes** [/hana/data]: Select Enter. + 1. Enter **Location of Log Volumes** [/hana/log]: Select Enter. + 1. **Restrict maximum memory allocation?** [n]: Select Enter. + 1. Enter **Certificate Host Name For Host '...'** [...]: Select Enter. + 1. Enter **SAP Host Agent User (sapadm) Password**: Enter the host agent user password. + 1. Confirm **SAP Host Agent User (sapadm) Password**: Enter the host agent user password again to confirm. + 1. Enter **System Administrator (hdbadm) Password**: Enter the system administrator password. + 1. Confirm **System Administrator (hdbadm) Password**: Enter the system administrator password again to confirm. + 1. Enter **System Administrator Home Directory** [/usr/sap/HN1/home]: Select Enter. + 1. Enter **System Administrator Login Shell** [/bin/sh]: Select Enter. + 1. Enter **System Administrator User ID** [1001]: Select Enter. + 1. Enter **ID of User Group (sapsys)** [79]: Select Enter. + 1. Enter **Database User (SYSTEM) Password**: Enter the database user password. + 1. Confirm **Database User (SYSTEM) Password**: Enter the database user password again to confirm. + 1. **Restart system after machine reboot?** [n]: Select Enter. + 1. **Do you want to continue? (y/n)**: Validate the summary. Enter **y** to continue. 1. **[A]** Upgrade the SAP Host Agent. The steps in this section use the following prefixes: sudo /usr/sap/hostctrl/exe/saphostexec -upgrade -archive <path to SAP Host Agent>; ``` -1. **[A]** Configure firewall +1. **[A]** Configure the firewall. - Create the firewall rule for the Azure load balancer probe port. + Create the firewall rule for the Azure Load Balancer probe port. ```bash sudo firewall-cmd --zone=public --add-port=62503/tcp The steps in this section use the following prefixes: * **[1]**: The step applies to node 1 only. * **[2]**: The step applies to node 2 of the Pacemaker cluster only. -1. **[A]** Configure firewall +1. **[A]** Configure the firewall. - Create firewall rules to allow HANA System Replication and client traffic. The required ports are listed on [TCP/IP Ports of All SAP Products](https://help.sap.com/viewer/ports). The following commands are just an example to allow HANA 2.0 System Replication and client traffic to database SYSTEMDB, HN1 and NW1. + Create firewall rules to allow HANA System Replication and client traffic. The required ports are listed on [TCP/IP Ports of All SAP Products](https://help.sap.com/viewer/ports). The following commands are just an example to allow HANA 2.0 System Replication and client traffic to database SYSTEMDB, HN1, and NW1. ```bash sudo firewall-cmd --zone=public --add-port=40302/tcp --permanent The steps in this section use the following prefixes: If you're using SAP HANA 2.0 or MDC, create a tenant database for your SAP NetWeaver system. Replace **NW1** with the SID of your SAP system. - Execute as <hanasid\>adm the following command: + Run the following command as <hanasid\>adm: ```bash hdbsql -u SYSTEM -p "[passwd]" -i 03 -d SYSTEMDB 'CREATE DATABASE NW1 SYSTEM USER PASSWORD "<passwd>"' ``` -1. **[1]** Configure System Replication on the first node: +1. **[1]** Configure system replication on the first node. - Backup the databases as <hanasid\>adm: + Back up the databases as <hanasid\>adm: ```bash hdbsql -d SYSTEMDB -u SYSTEM -p "<passwd>" -i 03 "BACKUP DATA USING FILE ('initialbackupSYS')" The steps in this section use the following prefixes: hdbnsutil -sr_enable --name=SITE1 ``` -1. **[2]** Configure System Replication on the second node: +1. **[2]** Configure system replication on the second node. Register the second node to start the system replication. Run the following command as <hanasid\>adm: The steps in this section use the following prefixes: hdbnsutil -sr_register --remoteHost=hn1-db-0 --remoteInstance=03 --replicationMode=sync --name=SITE2 ``` -1. **[1]** Check replication status +1. **[1]** Check replication status. Check the replication status and wait until all databases are in sync. If the status remains UNKNOWN, check your firewall settings. The steps in this section use the following prefixes: * **[1]**: The step applies to node 1 only. * **[2]**: The step applies to node 2 of the Pacemaker cluster only. -1. **[A]** Configure firewall +1. **[A]** Configure the firewall. Create firewall rules to allow HANA System Replication and client traffic. The required ports are listed on [TCP/IP Ports of All SAP Products](https://help.sap.com/viewer/ports). The following commands are just an example to allow HANA 2.0 System Replication. Adapt it to your SAP HANA 1.0 installation. The steps in this section use the following prefixes: 1. **[1]** Create the required users. - Run the following command as root. Make sure to replace the values for HANA System ID (ex. **HN1**), instance number (**03**), and any user names, with the values of your SAP HANA installation: + Run the following command as root. Make sure to replace the values for HANA System ID (for example, **HN1**), instance number (**03**), and any usernames, with the values of your SAP HANA installation: ```bash PATH="$PATH:/usr/sap/HN1/HDB03/exe" The steps in this section use the following prefixes: hdbsql -d SYSTEMDB -u system -i 03 "BACKUP DATA USING FILE ('initialbackup')" ``` - If you use a multi-tenant installation, also back up the tenant database: + If you use a multitenant installation, also back up the tenant database: ```bash hdbsql -d HN1 -u system -i 03 "BACKUP DATA USING FILE ('initialbackup')" ``` -1. **[1]** Configure System Replication on the first node. +1. **[1]** Configure system replication on the first node. Create the primary site as <hanasid\>adm: The steps in this section use the following prefixes: hdbnsutil -sr_enable ΓÇô-name=SITE1 ``` -1. **[2]** Configure System Replication on the secondary node. +1. **[2]** Configure system replication on the secondary node. Register the secondary site as <hanasid\>adm: Follow the steps in [Setting up Pacemaker on Red Hat Enterprise Linux in Azure]( ## Implement the Python system replication hook SAPHanaSR -This is important step to optimize the integration with the cluster and improve the detection when a cluster failover is needed. It is highly recommended to configure the SAPHanaSR Python hook. +This important step optimizes the integration with the cluster and improves the detection when a cluster failover is needed. We highly recommend that you configure the SAPHanaSR Python hook. -1. **[A]** Install the SAP HANA resource agents on **all nodes**. Make sure to enable a repository that contains the package. You don't need to enable additional repositories, if using RHEL 8.x HA-enabled image. +1. **[A]** Install the SAP HANA resource agents on **all nodes**. Make sure to enable a repository that contains the package. You don't need to enable more repositories, if you're using an RHEL 8.x HA-enabled image. ```bash # Enable repository that contains SAP HANA resource agents This is important step to optimize the integration with the cluster and improve sudo yum install -y resource-agents-sap-hana ``` -2. **[A]** Install the HANA "system replication hook". The hook needs to be installed on both HANA DB nodes. +1. **[A]** Install the HANA `system replication hook`. The hook needs to be installed on both HANA DB nodes. > [!TIP] > The Python hook can only be implemented for HANA 2.0. - 1. Prepare the hook as `root`. + 1. Prepare the hook as `root`. ```bash mkdir -p /hana/shared/myHooks This is important step to optimize the integration with the cluster and improve chown -R hn1adm:sapsys /hana/shared/myHooks ``` - 1. Stop HANA on both nodes. Execute as <sid\>adm: + 1. Stop HANA on both nodes. Run as <sid\>adm. ```bash sapcontrol -nr 03 -function StopSystem ``` - 1. Adjust `global.ini` on each cluster node. + 1. Adjust `global.ini` on each cluster node. ```output [ha_dr_provider_SAPHanaSR] This is important step to optimize the integration with the cluster and improve ha_dr_saphanasr = info ``` -3. **[A]** The cluster requires sudoers configuration on each cluster node for <sid\>adm. In this example that is achieved by creating a new file. Use the `visudo` command to edit the 20-saphana dropin file, as `root`. +1. **[A]** The cluster requires `sudoers` configuration on each cluster node for <sid\>adm. In this example, that's achieved by creating a new file. Use the `visudo` command to edit the `20-saphana` drop-in file as `root`. ```bash sudo visudo -f /etc/sudoers.d/20-saphana ``` - Insert the following lines and then save + Insert the following lines and then save: ```output Cmnd_Alias SITE1_SOK = /usr/sbin/crm_attribute -n hana_hn1_site_srHook_SITE1 -v SOK -t crm_config -s SAPHanaSR This is important step to optimize the integration with the cluster and improve Defaults!SITE1_SOK, SITE1_SFAIL, SITE2_SOK, SITE2_SFAIL !requiretty ``` -4. **[A]** Start SAP HANA on both nodes. Execute as <sid\>adm. +1. **[A]** Start SAP HANA on both nodes. Run as <sid\>adm. ```bash sapcontrol -nr 03 -function StartSystem ``` -5. **[1]** Verify the hook installation. Execute as <sid\>adm on the active HANA system replication site. +1. **[1]** Verify the hook installation. Run as <sid\>adm on the active HANA system replication site. ```bash cdtrace This is important step to optimize the integration with the cluster and improve # 2021-04-12 21:37:04.898680 ha_dr_SAPHanaSR SOK ``` -For more details on the implementation of the SAP HANA system replication hook, see [Enable the SAP HA/DR provider hook](https://access.redhat.com/articles/3004101#enable-srhook). +For more information on the implementation of the SAP HANA System Replication hook, see [Enable the SAP HA/DR provider hook](https://access.redhat.com/articles/3004101#enable-srhook). ## Create SAP HANA cluster resources -Create the HANA topology. Run the following commands on one of the Pacemaker cluster nodes. Throughout these instructions, be sure to substitute your instance number, HANA system ID, IP addresses, and system names, where appropriate: +Create the HANA topology. Run the following commands on one of the Pacemaker cluster nodes. Throughout these instructions, be sure to substitute your instance number, HANA system ID, IP addresses, and system names, where appropriate. ```bash sudo pcs property set maintenance-mode=true sudo pcs resource create SAPHanaTopology_HN1_03 SAPHanaTopology SID=HN1 Instance Next, create the HANA resources. > [!NOTE]-> This article contains references to a term that Microsoft no longer uses. When the term is removed from the software, weΓÇÖll remove it from this article. +> This article contains references to a term that Microsoft no longer uses. When the term is removed from the software, we'll remove it from this article. -If building a cluster on **RHEL 7.x**, use the following commands: +If you're building a cluster on **RHEL 7.x**, use the following commands: ```bash sudo pcs resource create SAPHana_HN1_03 SAPHana SID=HN1 InstanceNumber=03 PREFER_SITE_TAKEOVER=true DUPLICATE_PRIMARY_TIMEOUT=7200 AUTOMATED_REGISTER=false \ sudo pcs resource defaults migration-threshold=5000 sudo pcs property set maintenance-mode=false ``` -If building a cluster on **RHEL 8.x/9.x**, use the following commands: +If you're building a cluster on **RHEL 8.x/9.x**, use the following commands: ```bash sudo pcs resource create SAPHana_HN1_03 SAPHana SID=HN1 InstanceNumber=03 PREFER_SITE_TAKEOVER=true DUPLICATE_PRIMARY_TIMEOUT=7200 AUTOMATED_REGISTER=false \ sudo pcs resource defaults update migration-threshold=5000 sudo pcs property set maintenance-mode=false ``` -To configure priority-fencing-delay for SAP HANA (applicable only as of pacemaker-2.0.4-6.el8 or higher), following commands needs to be executed. +To configure `priority-fencing-delay` for SAP HANA (applicable only as of pacemaker-2.0.4-6.el8 or higher), the following commands need to be executed. > [!NOTE]-> If you have two-node cluster, you have option to configure priority-fencing-delay cluster property. This property introduces additional delay in fencing a node that has higher total resource priority when a split-brain scenario occurs. For more information, see [Can Pacemaker fence the cluster node with the fewest running resources?](https://access.redhat.com/solutions/5110521). +> If you have a two-node cluster, you can configure the `priority-fencing-delay` cluster property. This property introduces a delay in fencing a node that has higher total resource priority when a split-brain scenario occurs. For more information, see [Can Pacemaker fence the cluster node with the fewest running resources?](https://access.redhat.com/solutions/5110521). >-> The property priority-fencing-delay is applicable for pacemaker-2.0.4-6.el8 version or higher. If you are setting up priority-fencing-delay on existing cluster, make sure to unset `pcmk_delay_max` option in fencing device. +> The property `priority-fencing-delay` is applicable for pacemaker-2.0.4-6.el8 version or higher. If you're setting up `priority-fencing-delay` on an existing cluster, make sure to unset the `pcmk_delay_max` option in the fencing device. ```bash sudo pcs property set maintenance-mode=true sudo pcs property set maintenance-mode=false ``` > [!IMPORTANT]-> It's a good idea to set `AUTOMATED_REGISTER` to `false`, while you're performing failover tests, to prevent a failed primary instance to automatically register as secondary. After testing, as a best practice, set `AUTOMATED_REGISTER` to `true`, so that after takeover, system replication can resume automatically. +> It's a good idea to set `AUTOMATED_REGISTER` to `false`, while you're performing failover tests, to prevent a failed primary instance to automatically register as secondary. After testing, as a best practice, set `AUTOMATED_REGISTER` to `true` so that after takeover, system replication can resume automatically. -Make sure that the cluster status is ok and that all of the resources are started. It's not important on which node the resources are running. +Make sure that the cluster status is okay and that all of the resources are started. Which node the resources are running on isn't important. > [!NOTE]-> The timeouts in the above configuration are just examples and may need to be adapted to the specific HANA setup. For instance, you may need to increase the start timeout, if it takes longer to start the SAP HANA database. +> The timeouts in the preceding configuration are only examples and might need to be adapted to the specific HANA setup. For instance, you might need to increase the start timeout, if it takes longer to start the SAP HANA database. Use the command `sudo pcs status` to check the state of the cluster resources created: Use the command `sudo pcs status` to check the state of the cluster resources cr # vip_HN1_03 (ocf::heartbeat:IPaddr2): Started hn1-db-0 ``` -## Configure HANA active/read enabled system replication in Pacemaker cluster +## Configure HANA active/read-enabled system replication in Pacemaker cluster -Starting with SAP HANA 2.0 SPS 01 SAP allows Active/Read-Enabled setups for SAP HANA System Replication, where the secondary systems of SAP HANA system replication can be used actively for read-intense workloads. To support such setup in a cluster a second virtual IP address is required which allows clients to access the secondary read-enabled SAP HANA database. To ensure that the secondary replication site can still be accessed after a takeover has occurred the cluster needs to move the virtual IP address around with the secondary of the SAPHana resource. +Starting with SAP HANA 2.0 SPS 01, SAP allows active/read-enabled setups for SAP HANA System Replication, where the secondary systems of SAP HANA System Replication can be used actively for read-intense workloads. -This section describes the additional steps that are required to manage HANA Active/Read enabled system replication in a Red Hat high availability cluster with second virtual IP. +To support such a setup in a cluster, a second virtual IP address is required, which allows clients to access the secondary read-enabled SAP HANA database. To ensure that the secondary replication site can still be accessed after a takeover has occurred, the cluster needs to move the virtual IP address around with the secondary SAPHana resource. -Before proceeding further, make sure you have fully configured Red Hat High Availability Cluster managing SAP HANA database as described in above segments of the documentation. +This section describes the other steps that are required to manage HANA active/read-enabled system replication in a Red Hat HA cluster with a second virtual IP. -![SAP HANA high availability with read-enabled secondary](./media/sap-hana-high-availability/ha-hana-read-enabled-secondary.png) +Before you proceed further, make sure that you've fully configured the Red Hat HA cluster managing an SAP HANA database, as described in preceding segments of the documentation. -### Additional setup in Azure load balancer for active/read-enabled setup +![Diagram that shows SAP HANA HA with read-enabled secondary.](./media/sap-hana-high-availability/ha-hana-read-enabled-secondary.png) -To proceed with additional steps on provisioning second virtual IP, make sure you have configured Azure Load Balancer as described in [Deploy Linux VMs manually via Azure portal](#deploy-linux-vms-manually-via-azure-portal) section. +### Additional setup in Azure Load Balancer for active/read-enabled setup -1. For **standard** load balancer, follow below additional steps on the same load balancer that you had created in earlier section. +To proceed with more steps on provisioning a second virtual IP, make sure that you've configured Azure Load Balancer as described in the [Deploy Linux VMs manually via Azure portal](#deploy-linux-vms-manually-via-the-azure-portal) section. ++1. For a **standard** load balancer, follow these steps on the same load balancer that you created in an earlier section. a. Create a second front-end IP pool: * Open the load balancer, select **frontend IP pool**, and select **Add**. * Enter the name of the second front-end IP pool (for example, **hana-secondaryIP**).- * Set the **Assignment** to **Static** and enter the IP address (for example, **10.0.0.14**). + * Set **Assignment** to **Static** and enter the IP address (for example, **10.0.0.14**). * Select **OK**. * After the new front-end IP pool is created, note the pool IP address. - b. Next, create a health probe: + b. Create a health probe: * Open the load balancer, select **health probes**, and select **Add**. * Enter the name of the new health probe (for example, **hana-secondaryhp**).- * Select **TCP** as the protocol and port **62603**. Keep the **Interval** value set to 5, and the **Unhealthy threshold** value set to 2. + * Select **TCP** as the protocol and port **62603**. Keep the **Interval** value set to **5** and the **Unhealthy threshold** value set to **2**. * Select **OK**. - c. Next, create the load-balancing rules: + c. Create the load-balancing rules: * Open the load balancer, select **load balancing rules**, and select **Add**. * Enter the name of the new load balancer rule (for example, **hana-secondarylb**).- * Select the front-end IP address, the back-end pool, and the health probe that you created earlier (for example, **hana-secondaryIP**, **hana-backend** and **hana-secondaryhp**). + * Select the front-end IP address, the back-end pool, and the health probe that you created earlier (for example, **hana-secondaryIP**, **hana-backend**, and **hana-secondaryhp**). * Select **HA Ports**.- * Make sure to **enable Floating IP**. + * Make sure to enable **Floating IP**. * Select **OK**. -### Configure HANA active/read enabled system replication +### Configure HANA active/read-enabled system replication -The steps to configure HANA system replication are described in [Configure SAP HANA 2.0 System Replication](#configure-sap-hana-20-system-replication) section. If you're deploying read-enabled secondary scenario, while configuring system replication on the second node, execute following command as **hanasid**adm: +The steps to configure HANA System Replication are described in the [Configure SAP HANA 2.0 System Replication](#configure-sap-hana-20-system-replication) section. If you're deploying a read-enabled secondary scenario while you're configuring system replication on the second node, run the following command as **hanasid**adm: ```bash sapcontrol -nr 03 -function StopWait 600 10 sapcontrol -nr 03 -function StopWait 600 10 hdbnsutil -sr_register --remoteHost=hn1-db-0 --remoteInstance=03 --replicationMode=sync --name=SITE2 --operationMode=logreplay_readaccess ``` -### Adding a secondary virtual IP address resource for an active/read-enabled setup +### Add a secondary virtual IP address resource for an active/read-enabled setup The second virtual IP and the appropriate colocation constraint can be configured with the following commands: pcs constraint location g_secip_HN1_03 rule score=4000 hana_hn1_sync_state eq PR pcs property set maintenance-mode=false ``` -Make sure that the cluster status is ok and that all of the resources are started. The second virtual IP runs on the secondary site along with SAPHana secondary resource. +Make sure that the cluster status is okay and that all the resources are started. The second virtual IP runs on the secondary site along with the SAPHana secondary resource. ```output sudo pcs status sudo pcs status # secvip_HN1_03 (ocf::heartbeat:IPaddr2): Started hn1-db-1 ``` -In next section, you can find the typical set of failover tests to execute. +In the next section, you can find the typical set of failover tests to run. -Be aware of the second virtual IP behavior, while testing a HANA cluster configured with read-enabled secondary: +Be aware of the second virtual IP behavior while you're testing a HANA cluster configured with read-enabled secondary: -1. When you migrate **SAPHana_HN1_03** cluster resource to secondary site **hn1-db-1**, the second virtual IP continues to run on the same site **hn1-db-1**. If you have set AUTOMATED_REGISTER="true" for the resource and HANA system replication is registered automatically on **hn1-db-0**, then your second virtual IP will also move to **hn1-db-0**. +1. When you migrate the **SAPHana_HN1_03** cluster resource to the secondary site **hn1-db-1**, the second virtual IP continues to run on the same site **hn1-db-1**. If you've set `AUTOMATED_REGISTER="true"` for the resource and HANA system replication is registered automatically on **hn1-db-0**, your second virtual IP also moves to **hn1-db-0**. -2. On testing server crash, second virtual IP resources (**secvip_HN1_03**) and Azure load balancer port resource (**secnc_HN1_03**) will run on primary server alongside the primary virtual IP resources. So, till the time secondary server is down, application that are connected to read-enabled HANA database connects to primary HANA database. The behavior is expected as you don't want applications that are connected to read-enabled HANA database to be inaccessible till the time secondary server is unavailable. - -3. During failover and fallback of second virtual IP address, it may happen that the existing connections on applications that use second virtual IP to connect to the HANA database may get interrupted. +1. On testing a server crash, the second virtual IP resources (**secvip_HN1_03**) and the Azure Load Balancer port resource (**secnc_HN1_03**) run on the primary server alongside the primary virtual IP resources. So, until the time that the secondary server is down, applications that are connected to the read-enabled HANA database connect to the primary HANA database. The behavior is expected because you don't want applications that are connected to the read-enabled HANA database to be inaccessible until the time the secondary server is unavailable. -The setup maximizes the time that the second virtual IP resource will be assigned to a node where a healthy SAP HANA instance is running. +1. During failover and fallback of the second virtual IP address, the existing connections on applications that use the second virtual IP to connect to the HANA database might get interrupted. ++The setup maximizes the time that the second virtual IP resource is assigned to a node where a healthy SAP HANA instance is running. ## Test the cluster setup -This section describes how you can test your setup. Before you start a test, make sure that Pacemaker doesn't have any failed action (via pcs status), there are no unexpected location constraints (for example leftovers of a migration test) and that HANA is sync state, for example with systemReplicationStatus: +This section describes how you can test your setup. Before you start a test, make sure that Pacemaker doesn't have any failed action (via pcs status), there are no unexpected location constraints (for example, leftovers of a migration test), and that HANA is in sync state, for example, with `systemReplicationStatus`. ```bash sudo su - hn1adm -c "python /usr/sap/HN1/HDB03/exe/python_support/systemReplicationStatus.py" Resource Group: g_ip_HN1_03 vip_HN1_03 (ocf::heartbeat:IPaddr2): Started hn1-db-0 ``` -You can migrate the SAP HANA master node by executing the following command as root: +You can migrate the SAP HANA master node by running the following command as root: ```bash # On RHEL 7.x pcs resource move SAPHana_HN1_03-master pcs resource move SAPHana_HN1_03-clone --master ``` -If you set `AUTOMATED_REGISTER="false"`, this command should migrate the SAP HANA master node and the group that contains the virtual IP address to hn1-db-1. +If you set `AUTOMATED_REGISTER="false"`, this command should migrate the SAP HANA master node and the group that contains the virtual IP address to `hn1-db-1`. -Once the migration is done, the 'sudo pcs status' output looks like this +After the migration is done, the `sudo pcs status` output looks like: ```output Clone Set: SAPHanaTopology_HN1_03-clone [SAPHanaTopology_HN1_03] Resource Group: g_ip_HN1_03 vip_HN1_03 (ocf::heartbeat:IPaddr2): Started hn1-db-1 ``` -The SAP HANA resource on hn1-db-0 is stopped. In this case, configure the HANA instance as secondary by executing these commands, as **hn1adm**: +The SAP HANA resource on `hn1-db-0` is stopped. In this case, configure the HANA instance as secondary by running these commands, as **hn1adm**: ```bash sapcontrol -nr 03 -function StopWait 600 10 hdbnsutil -sr_register --remoteHost=hn1-db-1 --remoteInstance=03 --replicationMode=sync --name=SITE1 ``` -The migration creates location constraints that need to be deleted again. Do the following as root, or via sudo: +The migration creates location constraints that need to be deleted again. Run the following command as root, or via `sudo`: ```bash pcs resource clear SAPHana_HN1_03-master ``` -Monitor the state of the HANA resource using `pcs status`. Once HANA is started on hn1-db-0, the output should look like this +Monitor the state of the HANA resource by using `pcs status`. After HANA is started on `hn1-db-0`, the output should look like: ```output Clone Set: SAPHanaTopology_HN1_03-clone [SAPHanaTopology_HN1_03] Resource Group: g_ip_HN1_03 vip_HN1_03 (ocf::heartbeat:IPaddr2): Started hn1-db-1 ``` -### Blocking network communication +### Block network communication Resource state before starting the test: Resource Group: g_ip_HN1_03 vip_HN1_03 (ocf::heartbeat:IPaddr2): Started hn1-db-1 ``` -Execute firewall rule to block the communication on one of the nodes. +Run the firewall rule to block the communication on one of the nodes. ```bash # Execute iptable rule on hn1-db-1 (10.0.0.6) to block the incoming and outgoing traffic to hn1-db-0 (10.0.0.5) iptables -A INPUT -s 10.0.0.5 -j DROP; iptables -A OUTPUT -d 10.0.0.5 -j DROP ``` -When cluster nodes can't communicate to each other, there's a risk of a split-brain scenario. In such situations, cluster nodes will try to simultaneously fence each other, resulting in fence race. To avoid such situation, it's recommended to set [priority-fencing-delay](#create-sap-hana-cluster-resources) property in cluster configuration (applicable only for [pacemaker-2.0.4-6.el8](https://access.redhat.com/errata/RHEA-2020:4804) or higher). +When cluster nodes can't communicate with each other, there's a risk of a split-brain scenario. In such situations, cluster nodes try to simultaneously fence each other, resulting in a fence race. To avoid such a situation, we recommend that you set the [priority-fencing-delay](#create-sap-hana-cluster-resources) property in cluster configuration (applicable only for [pacemaker-2.0.4-6.el8](https://access.redhat.com/errata/RHEA-2020:4804) or higher). -By enabling priority-fencing-delay property, the cluster introduces an additional delay in the fencing action specifically on the node hosting HANA master resource, allowing the node to win the fence race. +By enabling the `priority-fencing-delay` property, the cluster introduces a delay in the fencing action specifically on the node hosting the HANA master resource, allowing the node to win the fence race. -Execute below command to delete the firewall rule. +Run the following command to delete the firewall rule: ```bash # If the iptables rule set on the server gets reset after a reboot, the rules will be cleared out. In case they have not been reset, please proceed to remove the iptables rule using the following command. iptables -D INPUT -s 10.0.0.5 -j DROP; iptables -D OUTPUT -d 10.0.0.5 -j DROP ### Test the Azure fencing agent > [!NOTE]-> This article contains references to a term that Microsoft no longer uses. When the term is removed from the software, weΓÇÖll remove it from this article. +> This article contains references to a term that Microsoft no longer uses. When the term is removed from the software, we'll remove it from this article. Resource state before starting the test: Resource Group: g_ip_HN1_03 vip_HN1_03 (ocf::heartbeat:IPaddr2): Started hn1-db-1 ``` -You can test the setup of the Azure fencing agent by disabling the network interface on the node where SAP HANA is running as Master. -See [Red Hat Knowledgebase article 79523](https://access.redhat.com/solutions/79523) for a description on how to simulate a network failure. In this example we use the net_breaker script, as root, to block all access to the network. +You can test the setup of the Azure fencing agent by disabling the network interface on the node where SAP HANA is running as Master. For a description on how to simulate a network failure, see [Red Hat Knowledge Base article 79523](https://access.redhat.com/solutions/79523). ++In this example, we use the `net_breaker` script as root to block all access to the network: ```bash sh ./net_breaker.sh BreakCommCmd 10.0.0.6 ``` -The virtual machine should now restart or stop depending on your cluster configuration. -If you set the `stonith-action` setting to off, the virtual machine is stopped and the resources are migrated to the running virtual machine. +The VM should now restart or stop depending on your cluster configuration. +If you set the `stonith-action` setting to `off`, the VM is stopped and the resources are migrated to the running VM. -After you start the virtual machine again, the SAP HANA resource fails to start as secondary if you set `AUTOMATED_REGISTER="false"`. In this case, configure the HANA instance as secondary by executing this command as the **hn1adm** user: +After you start the VM again, the SAP HANA resource fails to start as secondary if you set `AUTOMATED_REGISTER="false"`. In this case, configure the HANA instance as secondary by running this command as the **hn1adm** user: ```bash sapcontrol -nr 03 -function StopWait 600 10 hdbnsutil -sr_register --remoteHost=hn1-db-0 --remoteInstance=03 --replicationMode=sync --name=SITE2 ``` -Switch back to root and clean up the failed state +Switch back to root and clean up the failed state: ```bash # On RHEL 7.x Resource Group: g_ip_HN1_03 vip_HN1_03 (ocf::heartbeat:IPaddr2): Started hn1-db-0 ``` -You can test a manual failover by stopping the cluster on the hn1-db-0 node, as root: +You can test a manual failover by stopping the cluster on the `hn1-db-0` node, as root: ```bash pcs cluster stop ``` -After the failover, you can start the cluster again. If you set `AUTOMATED_REGISTER="false"`, the SAP HANA resource on the hn1-db-0 node fails to start as secondary. In this case, configure the HANA instance as secondary by executing this command as root: +After the failover, you can start the cluster again. If you set `AUTOMATED_REGISTER="false"`, the SAP HANA resource on the `hn1-db-0` node fails to start as secondary. In this case, configure the HANA instance as secondary by running this command as root: ```bash pcs cluster start ``` -Execute the following as **hn1adm** +Run the following as **hn1adm**: ```bash sapcontrol -nr 03 -function StopWait 600 10 hdbnsutil -sr_register --remoteHost=hn1-db-1 --remoteInstance=03 --replicationMode=sync --name=SITE1 ``` -Then as root - +Then as root: ```bash # On RHEL 7.x |
search | Retrieval Augmented Generation Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/search/retrieval-augmented-generation-overview.md | print("\n-\nPrompt:\n" + prompt) + [Use Azure AI Studio and "bring your own data"](/azure/ai-services/openai/concepts/use-your-data) to experiment with prompts on an existing search index. This step helps you decide what model to use, and shows you how well your existing index works in a RAG scenario. ++ ["Chat with your data" solution accelerator](https://github.com/Azure-Samples/chat-with-your-data-solution-accelerator) to create your own RAG solution.+ + [Review the azure-search-openai-demo demo](https://github.com/Azure-Samples/azure-search-openai-demo) to see a working RAG solution that includes Cognitive Search, and to study the code that builds the experience. This demo uses a fictitious Northwind Health Plan for its data. Here's a [similar end-to-end demo](https://github.com/Azure-Samples/openai/blob/main/End_to_end_Solutions/AOAISearchDemo/README.md) from the Azure OpenAI team. This demo uses an unstructured .pdf data consisting of publicly available documentation on Microsoft Surface devices. print("\n-\nPrompt:\n" + prompt) + [Review creating queries](search-query-create.md) to learn more search request syntax and requirements. -<!-- + Use this accelerator to create your own RAG solution. --> - > [!NOTE] > Some Cognitive Search features are intended for human interaction and aren't useful in a RAG pattern. Specifically, you can skip autocomplete and suggestions. Other features like facets and orderby might be useful, but would be uncommon in a RAG scenario. |
search | Search Dotnet Sdk Migration Version 11 | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/search/search-dotnet-sdk-migration-version-11.md | If you're using Newtonsoft.Json for JSON serialization, you can pass in global n ## Inside v11 -Each version of an Azure Cognitive Search client library targets a corresponding version of the REST API. The REST API is considered foundational to the service, with individual SDKs wrapping a version of the REST API. As a .NET developer, it can be helpful to review the more verbose [REST API documentation](/rest/api/searchservice/) for more in depth coverage of specific objects or operations. Version 11 targets the [2020-06-30 search service](https://github.com/Azure/azure-rest-api-specs/blob/master/specification/search/data-plane/Azure.Search/preview/2020-06-30-Preview/searchservice.json). +Each version of an Azure Cognitive Search client library targets a corresponding version of the REST API. The REST API is considered foundational to the service, with individual SDKs wrapping a version of the REST API. As a .NET developer, it can be helpful to review the more verbose [REST API documentation](/rest/api/searchservice/) for more in depth coverage of specific objects or operations. Version 11 targets the [2020-06-30 search service specification](https://github.com/Azure/azure-rest-api-specs/tree/main/specification/search/data-plane/Azure.Search/stable/2020-06-30). Version 11.0 fully supports the following objects and operations: |
search | Search Get Started Semantic | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/search/search-get-started-semantic.md | Title: 'Quickstart: semantic search' + Title: 'Quickstart: semantic ranking' -description: Change an existing index to use semantic search. +description: Change an existing index to use semantic ranking. -# Quickstart: Semantic search with .NET or Python +# Quickstart: Semantic ranking with .NET or Python > [!IMPORTANT] > Semantic search is in public preview under [supplemental terms of use](https://azure.microsoft.com/support/legal/preview-supplemental-terms/). It's available through Azure portal, preview REST APIs, and beta SDKs. This feature is billable. See [Availability and pricing](semantic-search-overview.md#availability-and-pricing). |
search | Search Get Started Vector | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/search/search-get-started-vector.md | You're now ready to send the requests to your search service. For each request, Use the [Create or Update Index](/rest/api/searchservice/2023-10-01-preview/indexes/create-or-update) REST API for this request. -The index schema is organized around hotels content. Sample data consists of the names, descriptions, and locations of seven fictitious hotels. This schema includes fields for vector and traditional keyword search, with configurations for vector and semantic search. +The index schema is organized around hotels content. Sample data consists of the names, descriptions, and locations of seven fictitious hotels. This schema includes fields for vector and traditional keyword search, with configurations for vector and semantic ranking. The following example is a subset of the full index. We trimmed the definition so that you can focus on field definitions, vector configuration, and optional semantic configuration. |
search | Search Howto Connecting Azure Sql Database To Azure Search Using Indexers | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/search/search-howto-connecting-azure-sql-database-to-azure-search-using-indexers.md | Yes. However, you need to allow your search service to connect to your database. **Q: Can I use Azure SQL indexer with SQL databases running on-premises?** -Not directly. We don't recommend or support a direct connection, as doing so would require you to open your databases to Internet traffic. Customers have succeeded with this scenario using bridge technologies like Azure Data Factory. For more information, see [Push data to an Azure Cognitive Search index using Azure Data Factory](../data-factory/v1/data-factory-azure-search-connector.md). +Not directly. We don't recommend or support a direct connection, as doing so would require you to open your databases to Internet traffic. Customers have succeeded with this scenario using bridge technologies like Azure Data Factory. For more information, see [Push data to an Azure Cognitive Search index using Azure Data Factory](../data-factory/connector-azure-search.md). **Q: Can I use a secondary replica in a [failover cluster](/azure/azure-sql/database/auto-failover-group-overview) as a data source?** |
search | Search Howto Dotnet Sdk | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/search/search-howto-dotnet-sdk.md | The client library defines classes like `SearchIndex`, `SearchField`, and `Searc + [`Azure.Search.Documents.Indexes.Models`](/dotnet/api/azure.search.documents.indexes.models) + [`Azure.Search.Documents.Models`](/dotnet/api/azure.search.documents.models) -Azure.Search.Documents (version 11) targets version [`2020-06-30` of the Azure Cognitive Search REST API](https://github.com/Azure/azure-rest-api-specs/tree/master/specification/search/data-plane/Azure.Search/preview/2020-06-30-Preview). +Azure.Search.Documents (version 11) targets the [2020-06-30 search service specification](https://github.com/Azure/azure-rest-api-specs/tree/main/specification/search/data-plane/Azure.Search/stable/2020-06-30). The client library doesn't provide [service management operations](/rest/api/searchmanagement/), such as creating and scaling search services and managing API keys. If you need to manage your search resources from a .NET application, use the [Microsoft.Azure.Management.Search](/dotnet/api/microsoft.azure.management.search) library in the Azure SDK for .NET. |
search | Search Index Azure Sql Managed Instance With Managed Identity | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/search/search-index-azure-sql-managed-instance-with-managed-identity.md | This article describes how to set up an Azure Cognitive Search indexer connectio You can use a system-assigned managed identity or a user-assigned managed identity (preview). Managed identities are Microsoft Entra logins and require Azure role assignments to access data in SQL Managed Instance. -Before learning more about this feature, it is recommended that you have an understanding of what an indexer is and how to set up an indexer for your data source. More information can be found at the following links: +Before learning more about this feature, it's recommended that you understand what an indexer is and how to set up an indexer for your data source. More information can be found at the following links: * [Indexer overview](search-indexer-overview.md) * [SQL Managed Instance indexer](search-howto-connecting-azure-sql-database-to-azure-search-using-indexers.md) Before learning more about this feature, it is recommended that you have an unde To assign read permissions on SQL Managed Instance, you must be an Azure Global Admin with a SQL Managed Instance. See [Configure and manage Microsoft Entra authentication with SQL Managed Instance](/azure/azure-sql/database/authentication-aad-configure) and follow the steps to provision a Microsoft Entra admin (SQL Managed Instance). -* [Configure a public endpoint and network security group in SQL Managed Instance](search-howto-connecting-azure-sql-mi-to-azure-search-using-indexers.md) to allow connections from Azure Cognitive Search. If your Azure SQL Managed Instance is configured for private connections, [create a shared private link](search-indexer-how-to-access-private-sql.md) in Cognitive Search to allow the connection. +* [Configure a public endpoint and network security group in SQL Managed Instance](search-howto-connecting-azure-sql-mi-to-azure-search-using-indexers.md) to allow connections from Azure Cognitive Search. Connecting through a Shared Private Link when using a managed identity isn't currently supported. ## 1 - Assign permissions to read the database Follow these steps to assign the search service system managed identity permissi :::image type="content" source="./media/search-index-azure-sql-managed-instance-with-managed-identity/sql-login.png" alt-text="Showing screenshot of the Connect to Server dialog."::: -3. From the left pane, locate the SQL database you will be using as data source for indexing and right-click it. Select **New Query**. +3. From the left pane, locate the SQL database you are using as data source for indexing and right-click it. Select **New Query**. :::image type="content" source="./media/search-index-azure-sql-managed-instance-with-managed-identity/new-sql-query.png" alt-text="Showing screenshot of new SQL query."::: DROP USER IF EXISTS [insert your search service name or user-assigned managed id ## 2 - Add a role assignment -In this step you will give your Azure Cognitive Search service permission to read data from your SQL Managed Instance. +In this step, you'll give your Azure Cognitive Search service permission to read data from your SQL Managed Instance. -1. In the Azure portal navigate to your SQL Managed Instance page. +1. In the Azure portal, navigate to your SQL Managed Instance page. 1. Select **Access control (IAM)**. 1. Select **Add** then **Add role assignment**. In this step you will give your Azure Cognitive Search service permission to rea 4. Select **Reader** role.-1. Leave **Assign access to** as **Microsoft Entra user, group or service principal**. +1. Leave **Assign access to** as **Microsoft Entra user, group, or service principal**. 1. If you're using a system-assigned managed identity, search for your search service, then select it. If you're using a user-assigned managed identity, search for the name of the user-assigned managed identity, then select it. Select **Save**. Example for SQL Managed Instance using a system-assigned managed identity: Create the data source and provide a system-assigned managed identity. The [REST API](/rest/api/searchservice/create-data-source), Azure portal, and the [.NET SDK](/dotnet/api/azure.search.documents.indexes.models.searchindexerdatasourceconnection) support system-assigned managed identity. -When you're connecting with a system-assigned managed identity, the only change to the data source definition is the format of the "credentials" property. You'll provide an Initial Catalog or Database name and a ResourceId that has no account key or password. The ResourceId must include the subscription ID of SQL Managed Instance, the resource group of SQL Managed instance, and the name of the SQL database. +When you're connecting with a system-assigned managed identity, the only change to the data source definition is the format of the "credentials" property. You'll provide an Initial Catalog or Database name and a `ResourceId` that has no account key or password. The `ResourceId` must include the subscription ID of SQL Managed Instance, the resource group of SQL Managed instance, and the name of the SQL database. -Here is an example of how to create a data source to index data from a storage account using the [Create Data Source](/rest/api/searchservice/create-data-source) REST API and a managed identity connection string. The managed identity connection string format is the same for the REST API, .NET SDK, and the Azure portal. +Here's an example of how to create a data source to index data from a storage account using the [Create Data Source](/rest/api/searchservice/create-data-source) REST API and a managed identity connection string. The managed identity connection string format is the same for the REST API, .NET SDK, and the Azure portal. ```http POST https://[service name].search.windows.net/datasources?api-version=2020-06-30 api-key: [admin key] An indexer connects a data source with a target search index, and provides a schedule to automate the data refresh. Once the index and data source have been created, you're ready to create the indexer. -Here's a [Create Indexer](/rest/api/searchservice/create-indexer) REST API call with an Azure SQL indexer definition. The indexer will run when you submit the request. +Here's a [Create Indexer](/rest/api/searchservice/create-indexer) REST API call with an Azure SQL indexer definition. The indexer runs when you submit the request. ```http POST https://[service name].search.windows.net/indexers?api-version=2020-06-30 |
search | Vector Search How To Create Index | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/search/vector-search-how-to-create-index.md | In the following REST API example, "title" and "content" contain textual content REST API version [**2023-07-01-Preview**](/rest/api/searchservice/index-preview) enables vector scenarios. This version adds: -In the following REST API example, "title" and "content" contain textual content used in full text search and semantic search, while "titleVector" and "contentVector" contain vector data. +In the following REST API example, "title" and "content" contain textual content used in full text search and semantic ranking, while "titleVector" and "contentVector" contain vector data. > [!TIP] > Updating an existing index to include vector fields? Make sure the `allowIndexDowntime` query parameter is set to `true`. |
search | Vector Search Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/search/vector-search-overview.md | Within an index definition, you can specify one or more algorithms, and then for + [Create a vector index](vector-search-how-to-create-index.md) to specify an algorithm in the index and on fields. -+ For `exhaustiveKnn`, use [2023-10-01-Preview](/rest/api/searchservice/2023-10-01-preview/indexes/create-or-update) REST APIs or Azure SDK beta libraries that target the 2023-10-01-Preview version. ++ For exhaustive KNN, use [2023-10-01-Preview](/rest/api/searchservice/2023-10-01-preview/indexes/create-or-update) REST APIs or Azure SDK beta libraries that target the 2023-10-01-Preview version. -Algorithm parameters that are used to initialize the index during index creation are immutable and can't be changed after the index is built. Some parameters that affect the query-time characteristics might be modified. +Algorithm parameters that are used to initialize the index during index creation are immutable and can't be changed after the index is built. However, parameters that affect the query-time characteristics (`efSearch`) can be modified. -In addition, fields that specify HNSW algorithm also support exhaustive knn search using the [query request](vector-search-how-to-query.md) parameter `"exhaustive": true`. The opposite isn't true however. If a field is indexed for `exhaustiveKnn`, you can't use HNSW in the query because the additional data structures that enable efficient search donΓÇÖt exist. +In addition, fields that specify HNSW algorithm also support exhaustive KNN search using the [query request](vector-search-how-to-query.md) parameter `"exhaustive": true`. The opposite isn't true however. If a field is indexed for `exhaustiveKnn`, you can't use HNSW in the query because the additional data structures that enable efficient search donΓÇÖt exist. ### Approximate Nearest Neighbors |
search | Vector Search Ranking | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/search/vector-search-ranking.md | In vector query execution, the search engine looks for similar vectors to find t ## Determine relevance in vector search -The algorithms that determine relevance are exhaustive k-nearest neighbors (KNN) and Hierarchical Navigable Small World (HNSW). +The algorithms used in vector search are used to navigate the vector database and find matching vectors. Supported algorithms include exhaustive k-nearest neighbors (KNN) and Hierarchical Navigable Small World (HNSW). Exhaustive KNN performs a brute-force search that enables users to search the entire vector space for matches that are most similar to the query. It does this by calculating the distances between all pairs of data points and finding the exact `k` nearest neighbors for a query point. In the HNSW algorithm, a vector query search operation is executed by navigating ## Similarity metrics used to measure nearness -A similarity metric measures the distance between neighboring vectors. +The algorithm finds candidate vectors to evaluate similarity. To perform this task, a similarity metric calculation compares the candidate vector to the query vector and measures the similarity. The algorithm keeps track of the ordered set of most similar vectors that its found, which forms the ranked result set when the algorithm has reached completion. | Metric | Description | |--|-| A similarity metric measures the distance between neighboring vectors. Whenever results are ranked, **`@search.score`** property contains the value used to order the results. -| Search method | Parameter | Scoring algorithm | Range | +| Search method | Parameter | Scoring metric | Range | ||--|-|-|-| vector search | `@search.score` | HNSW or KNN algorithm, using the similarity metric specified in the algorithm configuration. | 0.333 - 1.00 (Cosine) | +| vector search | `@search.score` | Cosine | 0.333 - 1.00 | If you're using the `cosine` metric, it's important to note that the calculated `@search.score` isn't the cosine value between the query vector and the document vectors. Instead, Cognitive Search applies transformations such that the score function is monotonically decreasing, meaning score values will always decrease in value as the similarity becomes worse. This transformation ensures that search scores are usable for ranking purposes. |
search | Whats New | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/search/whats-new.md | Learn about the latest updates to Azure Cognitive Search functionality, docs, an | Item | Type | Description | |--||--|+| [**"Chat with your data" solution accelerator**](https://github.com/Azure-Samples/chat-with-your-data-solution-accelerator) | Sample | End-to-end RAG pattern that uses Azure Cognitive Search as a retriever. It provides indexing, data chunking, orchestration and chat based on Azure OpenAI GPT. | | [**Exhaustive K-Nearest Neighbors (KNN)**](vector-search-overview.md#eknn) | Feature | Exhaustive K-Nearest Neighbors (KNN) is a new scoring algorithm for similarity search in vector space. It performs an exhaustive search for the nearest neighbors, useful for situations where high recall is more important than query performance. Available in the 2023-10-01-Preview REST API only. | | [**Prefilters in vector search**](vector-search-how-to-query.md) | Feature | Evaluates filter criteria before query execution, reducing the amount of content that needs to be searched. Available in the 2023-10-01-Preview REST API only, through a new `vectorFilterMode` property on the query that can be set to `preFilter` (default) or `postFilter`, depending on your requirements. | | [**2023-10-01-Preview Search REST API**](/rest/api/searchservice/search-service-api-versions#2023-10-01-Preview) | API | New preview version of the Search REST APIs that changes the definition for [vector fields](vector-search-how-to-create-index.md) and [vector queries](vector-search-how-to-query.md). This API version introduces breaking changes from **2023-07-01-Preview**, otherwise it's inclusive of all previous preview features. We recommend [creating new indexes](vector-search-how-to-create-index.md) for **2023-10-01-Preview**. You might encounter an HTTP 400 on some features on a migrated index, even if you migrated correctly.| |
security | Threat Modeling Tool Communication Security | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/security/develop/threat-modeling-tool-communication-security.md | -| **References** |[Moving data between On-premises and Azure Data Factory](../../data-factory/v1/data-factory-move-data-between-onprem-and-cloud.md#create-gateway), [Data management gateway](../../data-factory/v1/data-factory-data-management-gateway.md) | +| **References** |[Moving data between On-premises and Azure Data Factory](../../data-factory/create-self-hosted-integration-runtime.md?tabs=data-factory) | | **Steps** | <p>The Data Management Gateway (DMG) tool is required to connect to data sources which are protected behind corpnet or a firewall.</p><ol><li>Locking down the machine isolates the DMG tool and prevents malfunctioning programs from damaging or snooping on the data source machine. (E.g. latest updates must be installed, enable minimum required ports, controlled accounts provisioning, auditing enabled, disk encryption enabled etc.)</li><li>Data Gateway key must be rotated at frequent intervals or whenever the DMG service account password renews</li><li>Data transits through Link Service must be encrypted</li></ol> | ## <a id="identity-https"></a>Ensure that all traffic to Identity Server is over HTTPS connection |
security | Best Practices And Patterns | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/security/fundamentals/best-practices-and-patterns.md | The articles below contain security best practices to use when you're designing, The best practices are intended to be a resource for IT pros. This might include designers, architects, developers, and testers who build and deploy secure Azure solutions. -* [Azure boundary security best practices](./network-best-practices.md#adopt-a-zero-trust-approach) * [Azure database security best practices](/azure/azure-sql/database/security-best-practice) * [Azure data security and encryption best practices](data-encryption-best-practices.md) * [Azure identity management and access control security best practices](identity-management-best-practices.md) The best practices are intended to be a resource for IT pros. This might include * [Securing PaaS databases in Azure](paas-applications-using-sql.md) * [Securing PaaS web and mobile applications using Azure App Service](paas-applications-using-app-services.md) * [Securing PaaS web and mobile applications using Azure Storage](paas-applications-using-storage.md)-* [Security best practices for IaaS workloads in Azure](iaas.md) ## Next steps |
service-connector | Concept Availability | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/service-connector/concept-availability.md | -Service Connector supports Azure availability zones to help you achieve resiliency and reliability for your business-critical workloads. The goal of the high availability architecture in Service Connector is to guarantee that your service connections are up and running at least 99.9% of time, so that you don't have to worry about the effects of potential maintenance operations and outages. Service Connector is designed to provide high availability support for all types of applications you're running on Azure. +Service Connector supports Azure availability zones to help you achieve resiliency and reliability for your business-critical workloads. The goal of the high availability architecture in Service Connector is to guarantee that your service connections are up and running at least 99.9% of time, so that you don't have to worry about the effects of potential maintenance operations and outages. Service Connector is designed to provide high availability support for all types of applications you're running in Azure. Users can distribute Azure compute services across availability zones in many regions. Service Connector is an extension resource provider to these compute services. When you create a service connection in a compute service with availability zones enabled, Azure will also automatically set up the corresponding service connection availability zone for your service connection. Microsoft is responsible for setting up availability zones and disaster recovery for your service connections. ## Zone redundancy in Service Connector -Service Connector is an Azure extension resource provider. It extends Azure App Service, Azure Spring Apps and Azure Container Apps. When you create a new service connection in one of these compute services with Service Connector, there's a connection resource provisioned as part of your top-level parent compute service. +Service Connector is an Azure extension resource provider. It extends Azure App Service, Azure Spring Apps and Azure Container Apps. When you create a new service connection in one of these compute services with Service Connector, a connection resource is provisioned as part of your top-level parent compute service. To enable zone redundancy for your connection, you must enable zone redundancy for your compute service. Once the compute service has been configured with zone redundancy, your service connections will also automatically become zone-redundant. For example, if you have an app service with zone redundancy enabled, the platform automatically spreads your app service instances across three zones in the selected region. When you create a service connection in this app service with Service Connector, the service connection resource is also automatically created in the three corresponding zones in the selected region. Traffic is routed to all of your available connection resources. When a zone goes down, the platform detects the lost instances, automatically attempts to find new replacement instances, and spreads the traffic as needed. > [!NOTE]-> To create, update, validate and list service connections, Service Connector calls APIs from a compute service and a target service. As Service Connector relies on the responses from both the compute service and the target service, requests to Service Connector in a zone-down scenario may not succeed if the target service can't be reached. This limitation applies to App Service, Container Apps and Spring Apps. +> To create, update, validate and list service connections, Service Connector calls APIs from a compute service and a target service. As Service Connector relies on the responses from both the compute service and the target service, requests to Service Connector in a zone-down scenario might not succeed if the target service can't be reached. This limitation applies to App Service, Azure Container Apps and Azure Spring Apps. ## How to create a zone-redundant service connection with Service Connector -Follow the instructions below to create a zone-redundant Service Connection in App Service using the Azure CLI or the Azure portal. The same process can be used to create a zone-redundant connection for Spring Apps and Container Apps compute services. +Follow the instructions below to create a zone-redundant service connection in App Service using the Azure CLI or the Azure portal. The same process can be used to create a zone-redundant connection for Azure Spring Apps and Azure Container Apps compute services. ### [Azure CLI](#tab/azure-cli) -To enable zone redundancy for a service connection using the Azure CLI, you must first create a zone-redundant app service. +To enable zone redundancy for a service connection using the Azure CLI, start by creating a zone-redundant App Service. 1. Create an App Service plan and include a `--zone-redundant` parameter. Optionally include the `--number-of-workers` parameter to specify capacity. Learn more details in [How to deploy a zone-redundant App Service](../app-service/environment/overview-zone-redundancy.md). Disaster recovery is the process of restoring application functionality after a In the cloud, we acknowledge upfront that failures will certainly happen. Instead of trying to prevent failures altogether, the goal is to minimize the effects of a single failing component. If there's a disaster, Service Connector will fail over to the paired region. Customers donΓÇÖt need to do anything if the outage is decided/declared by the Service Connector team. -We'll use the terms RTO (Recovery Time Objective), to indicate the time between the beginning of an outage impacting Service Connector and the recovery to full availability. We'll use RPO (Recovery Point Objective), to indicate the time between the last operation correctly restored and the time of the start of the outage affecting Service Connector. Expected and maximum RPO is 24 hours and RTO is 24 hours. +We'll use the terms *RTO (Recovery Time Objective)*, to indicate the time between the beginning of an outage impacting Service Connector and the recovery to full availability. We'll use *RPO (Recovery Point Objective)*, to indicate the time between the last operation correctly restored and the time of the start of the outage affecting Service Connector. Expected and maximum RPO is 24 hours and RTO is 24 hours. -Operations against Service Connector may fail during the disaster time, before the failover happens. Once the failover is completed, data will be restored and the customer isn't required to take any action. +Operations against Service Connector might fail during the disaster time, before the failover happens. Once the failover is completed, data will be restored and the customer isn't required to take any action. Service connector handles business continuity and disaster recovery (BCRD) for storage and compute. The platform strives to have as minimal of an impact as possible in case of issues in storage/compute, in any region. The data layer design prioritizes availability over latency in the event of a disaster ΓÇô meaning that if a region goes down, Service Connector will attempt to serve the end-user request from its paired region. |
service-connector | Concept Permission | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/service-connector/concept-permission.md | Title: Permission requirement for Service Connector description: Resource permission requirement--++ Last updated 08/04/2023 Last updated 08/04/2023 # Permission requirement for Service Connector -Service Connector creates connection between compute service and target service with the [On-Behalf-Of token](../active-directory/develop/v2-oauth2-on-behalf-of-flow.md). Creating a connection to a specific Azure resource requires its corresponding permissions. +Service Connector creates connections between Azure services using an [on-behalf-of token](../active-directory/develop/v2-oauth2-on-behalf-of-flow.md). Creating a connection to a specific Azure resource requires its corresponding permissions. -### App service +### App Service > [!div class="mx-tableFixed"] > | Action | Description | Service Connector creates connection between compute service and target service > |`microsoft.web/sites/slots/config/delete`|Delete Web Apps Slots Config.| > |`Microsoft.Web/sites/slots/config/list/Action`|List Web App Slot's security sensitive settings, such as publishing credentials, app settings and connection strings| -### Spring App +### Azure Spring App > [!div class="mx-tableFixed"] > | Action | Description | Service Connector creates connection between compute service and target service > |`Microsoft.AppPlatform/Spring/apps/deployments/*/write`|Create or update the deployment for a specific application| > |`Microsoft.AppPlatform/Spring/apps/deployments/*/delete`|Delete the deployment for a specific application| -### Container App +### Azure Container Apps > [!div class="mx-tableFixed"] > | Action | Description | Service Connector creates connection between compute service and target service > |`microsoft.app/locations/containerappoperationresults/read`|Get a Container App Long Running Operation Result| > |`microsoft.app/locations/managedenvironmentoperationresults/read`|Get a Managed Environment Long Running Operation Result| -### Dapr on container app +### Dapr in Azure Container Apps > [!div class="mx-tableFixed"] > | Action | Description | Service Connector creates connection between compute service and target service > |`Microsoft.DBforMySQL/flexibleServers/databases/read`|Returns the list of databases for a server or gets the properties for the specified database.| > |`Microsoft.DBforMySQL/flexibleServers/configurations/read`|Returns the list of MySQL server configurations or gets the configurations for the specified server.| -### App configuration +### Azure App Configuration > [!div class="mx-tableFixed"] > | Action | Description | Service Connector may need to grant permissions to your identity if a connection > |`Microsoft.Network/routeTables/join/action`|Joins a route table. Not Alertable.| > [!div class="nextstepaction"]-> [High availability](./concept-availability.md) +> [High availability](./concept-availability.md) |
service-connector | Concept Region Support | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/service-connector/concept-region-support.md | -## Supported regions with regional endpoint +## Supported regions with regional endpoints If your compute service instance is located in one of the regions that Service Connector supports below, you can use Service Connector to create and manage service connections. In regions where Service Connector isn't supported, you will still find Service ## Next steps -Go to the concept article below to learn more about Service Connector. +Go to the articles below for more information about how Service Connector works, and learn about service availability. ++> [!div class="nextstepaction"] +> [Service internals](./concept-service-connector-internals.md) > [!div class="nextstepaction"] > [High availability](./concept-availability.md) |
service-connector | How To Troubleshoot Front End Error | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/service-connector/how-to-troubleshoot-front-end-error.md | Title: Service Connector troubleshooting guidance -description: This article lists error messages and suggested actions of Service Connector to use for troubleshooting issues. +description: This article lists error messages and suggested actions to troubleshooting issues with Service Connector. Previously updated : 5/25/2022 Last updated : 10/19/2023 -# How to troubleshoot with Service Connector +# How to troubleshoot Service Connector This article lists error messages and suggestions to troubleshoot Service Connector. ## Error message and suggested actions from the Azure portal -| Error message | Suggested Action | -| | | -| Unknown resource type | Check source and target resource to verify whether the service types are supported by Service Connector. | -| | Check whether the specified source-target connection combination is supported by Service Connector. | -| | Check whether the target resource exists. | -| | Check the correctness of the target resource ID. | -| Unsupported resource | Check whether the authentication type is supported by the specified source-target connection combination. | +| Error message | Suggested Action | +|--|--| +| Unknown resource type | Check source and target resource to verify whether the service types are supported by Service Connector. | +| | Check whether the specified source-target connection combination is supported by Service Connector. | +| | Check whether the target resource exists. | +| | Check the correctness of the target resource ID. | +| Unsupported resource | Check whether the authentication type is supported by the specified source-target connection combination. | ## Error type,error message, and suggested actions using Azure CLI ### InvalidArgumentValueError -| Error message | Suggested Action | -| | | -| The source resource ID is invalid: `{SourceId}` | Check whether the source resource ID supported by Service Connector. | -| | Check the correctness of source resource ID. | -| Target resource ID is invalid: `{TargetId}` | Check whether the target service type is supported by Service Connector. | -| | Check the correctness of target resource ID. | -| Connection ID is invalid: `{ConnectionId}` | Check the correctness of the connection ID. | -+| Error message | Suggested Action | +|-|--| +| The source resource ID is invalid: `{SourceId}` | Check whether the source resource ID supported by Service Connector. | +| | Check the correctness of source resource ID. | +| Target resource ID is invalid: `{TargetId}` | Check whether the target service type is supported by Service Connector. | +| | Check the correctness of target resource ID. | +| Connection ID is invalid: `{ConnectionId}` | Check the correctness of the connection ID. | #### RequiredArgumentMissingError -| Error message | Suggested Action | -| | | -| `{Argument}` shouldn't be blank | User should provide argument value for interactive input | +| Error message | Suggested Action | +||-| +| `{Argument}` shouldn't be blank | User should provide argument value for interactive input | | Required keys missing for parameter `{Parameter}`. All possible keys are: `{Keys}` | Provide value for the authentication information parameter, usually in the form of `--param key1=val1 key2=val2`. |-| Required argument is missing, please provide the arguments: `{Arguments}` | Provide the required argument. | +| Required argument is missing, please provide the arguments: `{Arguments}` | Provide the required argument. | #### ValidationError -| Error message | Suggested Action | -| | | -| Only one auth info is needed | User can only provide one authentication information parameter. Check whether it isn't provided or multiple parameters are provided. | -| Auth info argument should be provided when updating the connection: `{ConnectionName}` | The authentication information should be provided when updating a secret type connection. This error occurs because a user's secret can't be accessed through the Azure Resource Manager API. | -| Either client type or auth info should be specified to update | Either client type or authentication information should be provided when updating a connection. | -| Usage error: `{} [KEY=VALUE ...]` | Check the available keys and provide values for the auth info parameter, usually in the form of `--param key1=val1 key2=val2`. | -| Unsupported Key `{Key}` is provided for parameter `{Parameter}`. All possible keys are: `{Keys}` | Check the available keys and provide values for the authentication information parameter, usually in the form of `--param key1=val1 key2=val2`. | -| Provision failed, please create the target resource manually and then create the connection. Error details: `{ErrorTrace}` | Retry. Create the target resource manually and then create the connection. | +| Error message | Suggested Action | +|-|--| +| Only one auth info is needed | User can only provide one authentication information parameter. Check whether it isn't provided or multiple parameters are provided. | +| Auth info argument should be provided when updating the connection: `{ConnectionName}` | The authentication information should be provided when updating a secret type connection. This error occurs because a user's secret can't be accessed through the Azure Resource Manager API. | +| Either client type or auth info should be specified to update | Either client type or authentication information should be provided when updating a connection. | +| Usage error: `{} [KEY=VALUE ...]` | Check the available keys and provide values for the auth info parameter, usually in the form of `--param key1=val1 key2=val2`. | +| Unsupported Key `{Key}` is provided for parameter `{Parameter}`. All possible keys are: `{Keys}` | Check the available keys and provide values for the authentication information parameter, usually in the form of `--param key1=val1 key2=val2`. | +| Provision failed, please create the target resource manually and then create the connection. Error details: `{ErrorTrace}` | Retry. Create the target resource manually and then create the connection. | ## Next steps Follow the tutorials listed below to learn more about Service Connector. > [!div class="nextstepaction"] > [Learn about Service Connector concepts](./concept-service-connector-internals.md)++> [!div class="nextstepaction"] +> [Known limitations](./known-limitations.md) |
service-connector | Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/service-connector/overview.md | -Service Connector helps you connect Azure compute services to other backing services. This service configures the network settings and connection information (for example, generating environment variables) between compute services and target backing services in management plane. Developers use their preferred SDK or library that consumes the connection information to do data plane operations against the target backing service. +Service Connector helps you connect Azure compute services to other backing services. Service Connector configures the network settings and connection information (for example, generating environment variables) between compute services and target backing services in management plane. Developers use their preferred SDK or library that consumes the connection information to do data plane operations against the target backing service. This article provides an overview of Service Connector. Any application that runs on Azure compute services and requires a backing servi * **WebApp/Container Apps/Spring Apps + Storage:** Use Service Connector to connect to Azure Storage accounts and use your preferred storage products easily for any of your apps. * **WebApp/Container Apps/Spring Apps + Messaging -See [what services are supported in Service Connector](#what-services-are-supported-in-service-connector) to see more supported services and application patterns. +See [what services are supported in Service Connector](#what-services-are-supported-by-service-connector) to see more supported services and application patterns. ## What are the benefits to using Service Connector? Service Connector is designed for your ease of use. To create a connection, you' Once a service connection is created, developers can validate and check the health status of their connections. Service Connector can suggest some actions to take to fix broken connections. -## What services are supported in Service Connector? +## What services are supported by Service Connector? **Compute * Azure App Service+* Azure Functions * Azure Spring Apps * Azure Container Apps |
service-connector | Quickstart Portal App Service Connection | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/service-connector/quickstart-portal-app-service-connection.md | Sign in to the Azure portal at [https://portal.azure.com/](https://portal.azure. | Setting | Example | Description | ||-|-| | **Service type** | Storage - Blob | The target service type. If you don't have a Microsoft Blob Storage, you can [create one](../storage/blobs/storage-quickstart-blobs-portal.md) or use another service type. |- | **Subscription** | My subscription | The subscription for your target service (the service you want to connect to). The default value is the subscription for this App Service resource. | | **Connection name** | *my_connection* | The connection name that identifies the connection between your App Service and target service. Use the connection name provided by Service Connector or choose your own connection name. |+ | **Subscription** | My subscription | The subscription for your target service (the service you want to connect to). The default value is the subscription for this App Service resource. | | **Storage account** | *my_storage_account* | The target storage account you want to connect to. Target service instances to choose from vary according to the selected service type. | | **Client type** | The same app stack on this App Service | The default value comes from the App Service runtime stack. Select the app stack that's on this App Service instance. | - :::image type="content" source="./media/app-service-quickstart/basics-tab.png" alt-text="Screenshot of the Azure portal, filling out the Basics tab."::: - 1. Select **Next: Authentication** to choose an authentication method. ### [System-assigned managed identity](#tab/SMI) Sign in to the Azure portal at [https://portal.azure.com/](https://portal.azure. 1. Select **Next: Networking** to configure the network access to your target service and select **Configure firewall rules to enable access to your target service**. -1. Select **Next: Review + Create** to review the provided information. Then select **Create** to create the service connection. This operation may take a minute to complete. +1. Select **Next: Review + Create** to review the provided information. Then select **Create** to create the service connection. This operation might take a minute to complete. > [!NOTE] > You need enough permissions to create connection successfully, for more details, see [Permission requirements](./concept-permission.md). ## View service connections in App Service -1. The **Service Connector** tab displays existing App Service connections. +1. Once the connection has successfully been created, the **Service Connector** page displays existing App Service connections. 1. Select the **>** button to expand the list and see the environment variables required by your application code. Select **Hidden value** to view the hidden value. :::image type="content" source="./media/app-service-quickstart/show-values.png" alt-text="Screenshot of the Azure portal, viewing connection details."::: -1. Select **Validate** to check your connection. You can see the connection validation details in the panel on the right. +1. Select **Validate** to check your connection. Select **Learn more** to see the connection validation details in the panel on the right. :::image type="content" source="./media/app-service-quickstart/validation.png" alt-text="Screenshot of the Azure portal, validating the connection."::: |
service-connector | Quickstart Portal Spring Cloud Connection | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/service-connector/quickstart-portal-spring-cloud-connection.md | You'll use Service Connector to create a new service connection in Azure Spring | Setting | Example | Description | ||-|-| | **Service type** | *Storage - Blob* | The type of service you're going to connect to your app. |- | **Subscription** | *my-subscription* | The subscription that contains your target service (the service you want to connect to). The default value is the subscription that contains the app deployed to Azure Spring Apps. | | **Connection name** | *storageblob_17d38* | The connection name that identifies the connection between your app and target service. Use the connection name provided by Service Connector or enter your own connection name. |+ | **Subscription** | *my-subscription* | The subscription that contains your target service (the service you want to connect to). The default value is the subscription that contains the app deployed to Azure Spring Apps. | | **Storage account** | *my-storage-account* | The target storage account you want to connect to. If you choose a different service type, select the corresponding target service instance. | | **Client type** | *SpringBoot* | The application stack that works with the target service you selected. Besides SpringBoot and Java, other stacks are also supported. | - :::image type="content" source="./media/azure-spring-apps-quickstart/basics.png" alt-text="Screenshot of the Azure portal, filling out the Basics tab."::: - 1. Select **Next: Authentication** to select the authentication type. Then select **Connection string** to use an access key to connect your storage account. :::image type="content" source="./media/azure-spring-apps-quickstart/authentication.png" alt-text="Screenshot of the Azure portal, filling out the Authentication tab."::: You'll use Service Connector to create a new service connection in Azure Spring 1. Select **Next: Review + Create** to review the provided information. Wait a few seconds for Service Connector to validate the information and select **Create** to create the service connection. - :::image type="content" source="./media/azure-spring-apps-quickstart/validation.png" alt-text="Screenshot of the Azure portal, validation tab."::: --## View service connection +## View service connections Azure Spring Apps connections are displayed under **Settings > Service Connector**. |
storage-mover | Endpoint Manage | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/storage-mover/endpoint-manage.md | The following steps describe the process of creating a source endpoint. :::image type="content" source="media/endpoint-manage/endpoint-source-create.png" alt-text="Screenshot of the Endpoint Overview page highlighting the location of the Create Endpoint link." lightbox="media/endpoint-manage/endpoint-source-create-lrg.png"::: - 1. Within the **Create source endpoint** pane, provide values for the required **Host name or IP** and **Share name** values. The host name or IP address value must be either an IPv4 address, or fully qualified domain or host name. You may also add an optional **Description** value of up to 1024 characters in length. Next, select **Protocol version** to expand the protocol selection menu and select the appropriate option for your source target. + 1. Within the **Create source endpoint** pane, provide values for the required **Host name or IP** and **Share name** values. The host name or IP address value must be either an IPv4 address, or fully qualified domain or host name. ++ > [!IMPORTANT] + > Depending on your DNS configuration, you may need to use your fully qualified domain name (FQDN) instead of your hostname. ++ You may also add an optional **Description** value of up to 1024 characters in length. Next, select **Protocol version** to expand the protocol selection menu and select the appropriate option for your source target. Storage mover agents use secrets stored within Key Vault to connect to SMB endpoints. When you create an SMB source endpoint, you need to provide both the name of the Key Vault containing the secrets and the names of the secrets themselves. The following steps describe the process of creating a source endpoint. > [!CAUTION] > Renaming endpoint resources is not supported. It's a good idea to ensure that you've named the project appropriately since you won't be able to change much of the endpoint name after it is provisioned. You may, however, choose to create a new endpoint with the same properties and a different name as shown in a later section. Refer to the [resource naming convention](../azure-resource-manager/management/resource-name-rules.md#microsoftstoragesync) to choose a supported name. - 1. It's always a good idea to create and use variables to store lengthy or potentially complex strings. Copy the sample code block and supply values for the required parameters. The `-Description` parameter is optional and is added in the [View and edit an endpoint's properties](#view-and-edit-an-endpoints-properties) section. + 1. It's always a good idea to create and use variables to store lengthy or potentially complex strings. ++ > [!IMPORTANT] + > Depending on your DNS configuration, you may need to use your fully qualified domain name (FQDN) instead of your hostname. ++ Copy the sample code block and supply values for the required parameters. The `-Description` parameter is optional and is added in the [View and edit an endpoint's properties](#view-and-edit-an-endpoints-properties) section. ```powershell |
storage-mover | Status Code | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/storage-mover/status-code.md | Each of these scopes further divides statuses into categories and subcategories. |Error Code |Error Message | Details/Troubleshooting steps/Mitigation | ||--||-| <a name="AZSM1001"></a>AZSM1001 |Failed to mount source path | Verify the provided server name or IP-address is valid, or the source location is correct. If using SMB, verify the provided username and password is correct. | -| <a name="AZSM1002"></a>AZSM1002 |Encountered an error while scanning the source | Retry or create a support ticket. | -| <a name="AZSM1003"></a>AZSM1003 |Failed to access source folder due to permission issues | Verify that the agent has been granted permissions to the source file share. | -| <a name="AZSM1004"></a>AZSM1004 |Source path provided is invalid | Create a new endpoint with a valid source share path and update the job definition and retry. | -| <a name="AZSM1020"></a>AZSM1020 |Miscellaneous error while accessing source | Retry or create a support ticket. | -| <a name="AZSM1021"></a>AZSM1021 |Failed to access target folder due to permission issues | Retry or create a support ticket. | -| <a name="AZSM1022"></a>AZSM1022 |Target path provided is invalid | Create a new endpoint with a valid target container and path and update the job definition and retry. | -| <a name="AZSM1023"></a>AZSM1023 |Lease expired for this agent on the target container | Retry or create a support ticket. | -| <a name="AZSM1024"></a>AZSM1024 |Authorization failure accessing the target location | The agent doesn't have sufficient permission to access the target location. RBAC (role-based access control) role assignments are performed automatically when resources are created using the Azure portal. If you're using the APIs, PowerShell cmdlets, or SDKs, manually create a role assignment for the agent's managed identity to access the target location. For NFS, use the *Storage Blob Data Contributor* role assignment. For SMB, use *Storage File Data Privileged Contributor*. The [Assign an Azure role for access to blob data](/azure/storage/blobs/assign-azure-role-data-access) article may help resolve this issue. | +| <a name="AZSM1001"></a>**AZSM1001** |Failed to mount source path | Verify the provided server name or IP-address is valid, or the source location is correct. If using SMB, verify the provided username and password is correct. | +| <a name="AZSM1002"></a>**AZSM1002** |Encountered an error while scanning the source | Retry or create a support ticket. | +| <a name="AZSM1003"></a>**AZSM1003** |Failed to access source folder due to permission issues | Verify that the agent has been granted permissions to the source file share. | +| <a name="AZSM1004"></a>**AZSM1004** |Source path provided is invalid | Create a new endpoint with a valid source share path and update the job definition and retry. | +| <a name="AZSM1020"></a>**AZSM1020** |Miscellaneous error while accessing source | Retry or create a support ticket. | +| <a name="AZSM1021"></a>**AZSM1021** |Failed to access target folder due to permission issues | Retry or create a support ticket. | +| <a name="AZSM1022"></a>**AZSM1022** |Target path provided is invalid | Create a new endpoint with a valid target container and path and update the job definition and retry. | +| <a name="AZSM1023"></a>**AZSM1023** |Lease expired for this agent on the target container | Retry or create a support ticket. | +| <a name="AZSM1024"></a>**AZSM1024** |Authorization failure accessing the target location | The agent doesn't have sufficient permission to access the target location. RBAC (role-based access control) role assignments are performed automatically when resources are created using the Azure portal. If you're using the APIs, PowerShell cmdlets, or SDKs, manually create a role assignment for the agent's managed identity to access the target location. For NFS, use the *Storage Blob Data Contributor* role assignment. For SMB, use *Storage File Data Privileged Contributor*. The [Assign an Azure role for access to blob data](/azure/storage/blobs/assign-azure-role-data-access) article may help resolve this issue. | | <a name="AZSM1025"></a>**AZSM1025** |Authentication failure accessing the source location | Verify that the agent has been granted permissions to the source location. | | <a name="AZSM1026"></a>**AZSM1026** |Target type is not supported by the agent | This target type is unsupported by the current Storage Mover agent. | | <a name="AZSM1027"></a>**AZSM1027** |The target location is busy | The agent can't access the target location because an existing lease is active. This error may be caused by another agent writing to the location. Ensure no other job is running against the target. Retry or create support ticket. | | <a name="AZSM1028"></a>**AZSM1028** |Key Vault access failure | Verify that the agent has been granted permissions to the relevant Key Vault. |+| <a name="AZSM1030"></a>**AZSM1030** |Cannot verify storage account type | Verify that the storage account and container exist, and that the agent has been granted permissions to them. | | <a name="AZSM1040"></a>**AZSM1040** |Miscellaneous error while accessing target | It's likely that this error is temporary. Retry the migration job again. If the issue persists, please create a support ticket for further assistance. | | <a name="AZSM1041"></a>**AZSM1041** |Failed to send job progress | It's likely that this error is temporary. Retry the migration job again. If the issue persists, please create a support ticket for further assistance. | | <a name="AZSM1042"></a>**AZSM1042** |Failed to create job | It's likely that this error is temporary. Retry the migration job again. If the issue persists, please create a support ticket for further assistance. | |
storage | Elastic San Best Practices | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/storage/elastic-san/elastic-san-best-practices.md | + + Title: Best practices for configuring an Elastic SAN Preview +description: Elastic SAN best practices +++ Last updated : 10/19/2023++++# Elastic SAN Preview best practices ++This article provides some general guidance on getting optimal performance with an environment that uses an Azure Elastic SAN. ++## Client-side optimizations ++### General recommendations (Windows & Linux Virtual Machines) ++- For best performance, deploy your VMs and Elastic SAN in the same zone and the same region. ++- VM storage I/O to Elastic SAN volumes uses VM network bandwidth, so traditional disk throughput limits on a VM don't apply to Elastic SAN volumes. Choose a VM that can provide sufficient bandwidth for production/VM-to-VM I/O and iSCSI I/O to attached Elastic SAN volumes. Generally, you should use Gen 5 (D / E / M series) VMs for the best performance. ++- Enable ΓÇ£Accelerated NetworkingΓÇ¥ on the VM, during VM creation. To do it via Azure PowerShell or Azure CLI or to enable Accelerated Networking on existing VMs, see [Use Azure PowerShell to create a VM with Accelerated Networking](../../virtual-network/create-vm-accelerated-networking-powershell.md) +++- You must use 32 sessions per target volume for each volume to achieve its maximum IOPS and/or throughput limits. Use Multipath I/O (MPIO) on the client to manage these multiple sessions to each volume for load balancing. Scripts are available for [Windows](elastic-san-connect-windows.md#connect-to-volumes), [Linux](elastic-san-connect-linux.md#connect-to-volumes), or on the Connect to volume page for your volumes in the Azure portal, which uses 32 sessions by default. Windows software iSCSI initiator has a limit of maximum 256 sessions. If you need to connect more than eight volumes to a Windows VM, reduce the number of sessions to each volume as needed. +++### MPIO ++#### Windows +Use the following commands to update your settings: ++```powershell +# Enable multipath support for iSCSI devices +Enable-MSDSMAutomaticClaim -BusType iSCSI ++# Set the default load balancing policy based on your requirements. In this example, we set it to round robin which should be optimal for most workloads. +Set-MSDSMGlobalDefaultLoadBalancePolicy -Policy RR +# You can also use mpclaim.exe to set the policy to round robin +mpclaim -L -M 2 ++# Set disk time out to 30 seconds +Set-MPIOSetting -NewDiskTimeout 30 +``` ++For more information regarding MPIO cmdlets, see [https://learn.microsoft.com/en-us/powershell/module/mpio/?view=windowsserver2022-ps](/powershell/module/mpio/?view=windowsserver2022-ps) ++#### Linux ++Update /etc/multipath.conf file with the following: ++```config +defaults { + user_friendly_names yes # To create ΓÇÿmpathnΓÇÖ names for multipath devices + path_grouping_policy multibus # To place all the paths in one priority group + path_selector "round-robin 0" # To use round robin algorithm to determine path for next I/O operation + failback immediate # For immediate failback to highest priority path group with active paths + no_path_retry 1 # To disable I/O queueing after retrying once when all paths are down +} +devices { + device { + vendor "MSFT" + product "Virtual HD" + } +} +``` ++### iSCSI ++#### Windows ++Update the below registry settings for iSCSI initiator on Windows. ++1. Open Registry Editor: +1. Select Start, type regedit in the search box and press enter. +1. Navigate to the following location: + [\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e97b-e325-11ce-bfc1-08002be10318}\0004 (Microsoft iSCSI Initiator)\Parameters] +1. Update the following settings. Right-click on each setting and select **Modify**. Change **Base** to **Decimal**, update the value and select **OK**. ++|Description |Parameter and value | +||| +|Sets maximum data the initiator sends in an iSCSI PDU to the target to 256 KB |MaxTransferLength=262144 | +|Sets maximum SCSI payload that the initiator negotiates with the target to 256 KB |MaxBurstLength=262144 | +|Sets maximum unsolicited data the initiator can send in an iSCSI PDU to a target to 256 KB |FirstBurstLength=262144 | +|Sets maximum data the initiator can receive in an iSCSI PDU from the target to 256 KB |MaxRecvDataSegmentLength=262144 | +|Disables R2T flow control |InitialR2T=0 | +|Enables immediate data |ImmediateData=1 | +|Sets timeout value for WMI requests to 15 seconds |WMIRequestTimeout = 15 seconds | +++In cluster configurations, ensure iSCSI initiator names unique across all nodes that are sharing volumes. In Windows, you can update them via iSCSI Initiator app. ++1. Select **Start**, search for **iSCSI Initiator** in the search box. This opens the iSCSI Initiator. +1. Select **Configuration** to see current initiator name. + + :::image type="content" source="media/elastic-san-best-practices/iscsi-initiator-config-widnows.png" alt-text="Screenshot of iSCSI Initiator configuration on Windows." lightbox="media/elastic-san-best-practices/iscsi-initiator-config-widnows.png"::: ++1. To modify it, select **Change**, enter new initiator name and select OK. + + :::image type="content" source="media/elastic-san-best-practices/update-iscsi-initiator-name-windows.png" alt-text="Screenshot of updating the iSCSI Initiator Name on Windows." lightbox="media/elastic-san-best-practices/update-iscsi-initiator-name-windows.png"::: +++#### Linux ++Update /etc/iscsi/iscsid.conf file with the following values: ++|Description |Parameter and value | +||| +|# Set maximum data the initiator sends in an iSCSI PDU to the target to 256 KB |node.conn[0].iscsi.MaxXmitDataSegmentLength = 262144 | +|# Set maximum SCSI payload that the initiator negotiates with the target to 256 KB |node.session.iscsi.MaxBurstLength = 262144 | +|# Set maximum unsolicited data the initiator can send in an iSCSI PDU to a target to 256 KB |node.session.iscsi.FirstBurstLength = 262144 | +|# Set maximum data the initiator can receive in an iSCSI PDU from the target to 256 KB |node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144 | +|# Disable R2T flow control |node.session.iscsi.InitialR2T = No | +|# Enable immediate data |node.session.iscsi.ImmediateData = Yes | +|# Set timeout value for WMI requests to 15 seconds |node.conn[0].timeo.login_timeout = 15<br></br>node.conn[0].timeo.logout_timeout = 15 | +|# Enable CRC digest checking for header and data |node.conn[0].iscsi.HeaderDigest = CRC32C<br></br>node.conn[0].iscsi.DataDigest = CRC32C | +++In cluster configurations, ensure iSCSI initiator names are unique across all nodes that are sharing volumes. In Linux, you can modify /etc/iscsi/initiatorname.iscsi to update the initiator name. +++## Elastic SAN optimizations ++Before deploying an Elastic SAN, determining the optimal size of the Elastic SAN you deploy is necessary to achieving the right balance of performance for your workloads and cost. Use the following steps to determine the best sizing for you: ++With your existing storage solution, select a time interval (day/week/quarter) to track performance. The best time interval is one that is a good snapshot of your applications/workloads. Over that time period, record the combined maximum IOPS and throughput for all workloads. If you use an interval higher than a minute, or if any of your workloads have bottlenecks with your current configuration, consider adding more base capacity to your Elastic SAN deployment. You should leave some headroom when determining your base capacity, to account for growth. The rest of your Elastic SAN's storage should use additional-capacity, to save on cost. ++For more information on performance, see [Elastic SAN Preview and virtual machine performance](elastic-san-performance.md). |
storage | Storage Quickstart Queues Python | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/storage/queues/storage-quickstart-queues-python.md | pip install azure-storage-queue azure-identity ```python import os, uuid from azure.identity import DefaultAzureCredential- from azure.storage.queue import QueueServiceClient, QueueClient, QueueMessage + from azure.storage.queue import QueueServiceClient, QueueClient, QueueMessage, BinaryBase64DecodePolicy, BinaryBase64EncodePolicy try: print("Azure Queue storage - Python quickstart sample") |
stream-analytics | Kafka Output | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/stream-analytics/kafka-output.md | You can use four types of security protocols to connect to your Kafka clusters: > Azure Stream Analytics integrates seamlessly with Azure Key vault to access stored secrets needed for authentication and encryption when using mTLS or SASL_SSL security protocols. Your Azure Stream Analytics job connects to Azure Key vault using managed identity to ensure a secure connection and avoid the exfiltration of secrets. -You can store the certificates as Key vault certificates or Key vault secrets. Private keys are in PEM format. +You'll need to use Azure CLI to upload the certificates as a secret into Key vault in PEM format. ### VNET integration When configuring your Azure Stream Analytics job to connect to your Kafka clusters, depending on your configuration, you may have to configure your job to access your Kafka clusters, which are behind a firewall or inside a virtual network. You can visit the Azure Stream Analytics VNET documentation to learn more about configuring private endpoints to access resources inside a virtual network or behind a firewall. |
synapse-analytics | Overview Database Templates | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/synapse-analytics/database-designer/overview-database-templates.md | Database templates are a set of business and technical data definitions that are These schema blueprints can be used by organizations to plan, architect, and design data solutions for data governance, reporting, business intelligence, and advanced analytics. The data models provide integrated business-wide information architectures that can help you implement, in a timely and predictable way, a proven industry data architecture. -For example, if you're building a product recommendation solution for your retail customers, you'll need to understand what the customer purchased and the transaction that led to the purchase. You may also need information about the store where the purchase was made, and whether the customer is part of a loyalty program. Just to accomplish this use case, consider the following schemas: +For example, if you're building a product recommendation solution for your retail customers, you'll need to understand what the customer purchased and the transaction that led to the purchase. You might also need information about the store where the purchase was made, and whether the customer is part of a loyalty program. Just to accomplish this use case, consider the following schemas: * Product * Transaction Currently, you can choose from the following database templates in Azure Synapse * **Pharmaceuticals** - For companies engaged in creating, manufacturing, and marketing pharmaceutical and bio-pharmaceutical products and medical devices. * **Property & Casualty Insurance** - For companies that provide insurance against risks to property and various forms of liability coverage. * **R&D and Clinical Trials** - For companies involved in research and development and clinical trials of pharmaceutical products or devices.+* **Restaurants** - For companies that prepare and serve food. * **Retail** - For sellers of consumer goods or services to customers through multiple channels. * **Travel Services** - For companies providing booking services for airlines, hotels, car rentals, cruises, and vacation packages. * **Utilities**ΓÇè-ΓÇèFor gas, electric, and water utilities; power generators; and water desalinators. |
virtual-desktop | Multimedia Redirection Intro | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/virtual-desktop/multimedia-redirection-intro.md | The following sites work with video playback redirection: - Bleacher Report - Brightcove - CNBC- - Coursera + - Coursera\* - Daily Mail - Facebook - Fidelity The following sites work with video playback redirection: - Fox Weather - IMDB - Infosec Institute- - LinkedIn Learning - - Microsoft Learn + - LinkedIn Learning\* + - Microsoft Learn\* - Microsoft Stream - NBC Sports - The New York Times The following sites work with video playback redirection: - The Guardian - Twitch - Twitter- - Udemy + - Udemy\* - UMU - U.S. News - Vidazoo The following sites work with video playback redirection: :::column-end::: :::row-end::: +\* Multimedia redirection only supports non-DRM content on these sites. Any digital rights managed content can only be played with regular video playback without multimedia redirection. + ### Call redirection The following websites work with call redirection: |
virtual-desktop | Client Features Android Chrome Os | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/virtual-desktop/users/client-features-android-chrome-os.md | To remove an account you no longer want to use: You can set the orientation of the Remote Desktop client to landscape, portrait, or auto-adjust, where it will match the orientation of your device. Auto-adjust is supported when your remote session is running Windows 10 or later. The window will maintain the same scaling and update the resolution to match the new orientation. This setting applies to all workspaces. - > [!IMPORTANT] - > Support for Windows Server 2012 R2 ended on October 10, 2023. For more information, view [SQL Server 2012 and Windows Server 2012/2012 R2 end of support](/lifecycle/announcements/sql-server-2012-windows-server-2012-2012-r2-end-of-support). - To set the orientation: 1. Open the **RD Client** app on your device. |
virtual-desktop | Client Features Ios Ipados | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/virtual-desktop/users/client-features-ios-ipados.md | Learn how to set display preferences, such as orientation and resolution. You can set the orientation of the Remote Desktop client to landscape, portrait, or auto-adjust, where it will match the orientation of your device. Auto-adjust is supported when your remote session is running Windows 10 or later. The window will maintain the same scaling and update the resolution to match the new orientation. This setting applies to all workspaces. - > [!IMPORTANT] - > Support for Windows Server 2012 R2 ended on October 10, 2023. For more information, view [SQL Server 2012 and Windows Server 2012/2012 R2 end of support](/lifecycle/announcements/sql-server-2012-windows-server-2012-2012-r2-end-of-support). - To set the orientation: 1. Open the **RD Client** application on your device. |
virtual-machines | Bsv2 Series | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/virtual-machines/bsv2-series.md | Bsv2-series virtual machines offer a balance of compute, memory, and network res | Size | vCPU | RAM | Base CPU Performance of VM (%) | Initial Credits (#) | Credits banked/hour | Max Banked Credits (#) | Max uncached disk throughput: IOPS/MBps | Max burst uncached disk throughput: IOPS/MBps | Max Data Disks | Max Network Bandwidth (Gbps) | Max NICs | |-||--|--||||--|--|-||-|+| Standard_B2ts_v2 | 2 | 1 | 20% | 60 | 24 | 576 | 3750/85 | 10,000/960 | 4 | 6.25 | 2 | | Standard_B2ls_v2 | 2 | 4 | 30% | 60 | 36 | 864 | 3750/85 | 10,000/960 | 4 | 6.50 | 2 | | Standard_B2s_v2 | 2 | 8 | 40% | 60 | 48 | 1152 | 3750/85 | 10,000/960 | 4 | 6.50 | 2 | | Standard_B4ls_v2 | 4 | 8 | 30% | 120 | 72 | 1728 | 6,400/145 | 20,000/960 | 8 | 6.250 | 2 | |
virtual-machines | Disks Migrate Lrs Zrs | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/virtual-machines/disks-migrate-lrs-zrs.md | + + Title: Convert a disk from LRS to ZRS +description: Learn how to convert a managed disk from locally redundant storage to zone-redundant storage. +++ Last updated : 10/19/2023+++++# Convert a disk from LRS to ZRS ++This article explains how to convert an existing disk from locally redundant storage (LRS) to zone-redundant storage (ZRS). ++For conceptual information on ZRS, see [Zone-redundant storage for managed disks](disks-redundancy.md#zone-redundant-storage-for-managed-disks) ++## Limitations +++## Regional availability +++## Determine infrastructure redundancy ++Before you convert a disk from LRS to ZRS, you must first identify whether the disk is zonal or regional. How you migrate your disk changes based on whether it's zonal or regional. ++# [Portal](#tab/azure-portal) ++1. Sign in to the Azure portal +1. Navigate to your disk and look under **Availability zone** on the **Overview** tab. + 1. If the value is **No infrastructure redundancy required** you have a regional disk. Use [Regional migration](#regional-migration) to migrate your disk. + 1. If the value is anything else, you have a zonal disk. Use [Zonal migration](#zonal-migration) to migrate your disk. ++# [Azure PowerShell](#tab/azure-powershell) ++Run the following command and observe the output. If there's no value in **Zones**, you have a regional disk. Use [Regional migration](#regional-migration) to migrate your disk. Otherwise, you have a zonal disk. Use [Zonal migration](#zonal-migration) to migrate your disk. ++```azurepowershell +Get-AzDisk -DiskName your_disk_name_here -ResourceGroupName your_RG_name_here +``` ++# [Azure CLI](#tab/azure-cli) ++Run the following command and observe the output. If there's no value in **zones** you have a regional disk. Use [Regional migration](#regional-migration) to migrate your disk. Otherwise, you have a zonal disk. Use [Zonal migration](#zonal-migration) to migrate your disk. ++```azurecli +az disk show --name your_disk_name_here --resource-group your_RG_name_here +``` ++++## Regional migration ++You can directly change the disk type of regional disks with the following options. ++# [Portal](#tab/azure-portal) ++Follow these steps: ++1. Sign in to the [Azure portal](https://portal.azure.com). +1. Select the VM from the list of **Virtual machines**. +1. If the VM isn't stopped, select **Stop** at the top of the VM **Overview** pane, and wait for the VM to stop. +1. In the pane for the VM, select **Disks** from the menu. +1. Select the disk that you want to convert. +1. Select **Size + performance** from the menu. +1. Change the **Account type** from the original disk type to the desired disk type. +1. Select **Save**, and close the disk pane. ++The disk type conversion is instantaneous. You can start your VM after the conversion. ++# [Azure PowerShell](#tab/azure-powershell) ++```azurepowershell-interactive ++$diskName = 'yourDiskName' +# resource group that contains the managed disk +$rgName = 'yourResourceGroupName' +# Choose between StandardSSD_ZRS or Premium_ZRS based on your scenario +$storageType = 'Premium_ZRS' +# Premium capable size +$size = 'Standard_DS2_v2' ++$disk = Get-AzDisk -DiskName $diskName -ResourceGroupName $rgName ++# Get parent VM resource +$vmResource = Get-AzResource -ResourceId $disk.ManagedBy ++# Stop and deallocate the VM before changing the storage type +Stop-AzVM -ResourceGroupName $vmResource.ResourceGroupName -Name $vmResource.Name -Force ++$vm = Get-AzVM -ResourceGroupName $vmResource.ResourceGroupName -Name $vmResource.Name ++# Change the VM size to a size that supports Premium storage +# Skip this step if converting storage from Premium to Standard +$vm.HardwareProfile.VmSize = $size +Update-AzVM -VM $vm -ResourceGroupName $rgName ++# Update the storage type +$disk.Sku = [Microsoft.Azure.Management.Compute.Models.DiskSku]::new($storageType) +$disk | Update-AzDisk ++Start-AzVM -ResourceGroupName $vm.ResourceGroupName -Name $vm.Name +``` ++# [Azure CLI](#tab/azure-cli) +++ ```azurecli ++#resource group that contains the managed disk +$rgName='yourResourceGroup' ++#Name of your managed disk +diskName='yourManagedDiskName' ++#Premium capable size +#Required only if converting from Standard to Premium +size='Standard_DS2_v2' ++#Choose between StandardSSD_ZRS or Premium_ZRS based on your scenario +sku='Premium_ZRS' ++#Get the parent VM Id +vmId=$(az disk show --name $diskName --resource-group $rgName --query managedBy --output tsv) ++#Deallocate the VM before changing the size of the VM +az vm deallocate --ids $vmId ++#Change the VM size to a size that supports Premium storage +#Skip this step if converting storage from Premium to Standard +az vm resize --ids $vmId --size $size ++# Update the SKU +az disk update --sku $sku --name $diskName --resource-group $rgName ++az vm start --ids $vmId +``` ++++## Zonal migration ++In this section, you migrate the data from your current managed disks to ZRS managed disks. ++If you have a zonal disk, you can't directly change its type. You must take a snapshot and use that snapshot to create a new ZRS disk. ++#### Step 1: Create your snapshot ++The easiest and cleanest way to create a snapshot is to take one while the VM is offline. See [Snapshots](../virtual-machines/backup-and-disaster-recovery-for-azure-iaas-disks.md#snapshots). If you choose this approach, some downtime should be expected. To create a snapshot of your VM using the Azure portal, PowerShell, or Azure CLI, see [Create a snapshot of a virtual hard disk](../virtual-machines/snapshot-copy-managed-disk.md) ++If you're taking a snapshot of a disk that's attached to a running VM, read the guidance in [Snapshots](backup-and-disaster-recovery-for-azure-iaas-disks.md#snapshots) before continuing. ++>[!NOTE] +> The source managed disks remain intact with their current configurations and you'll continue to be billed for them. To avoid this, you must manually delete the disks once you've finished your migration and confirmed the new disks are working. For more information, see [Find and delete unattached Azure managed and unmanaged disks](windows/find-unattached-disks.md). +++#### Step 2: Migrate the data on your managed disks ++Now that you have snapshots of your original disks, you can use them to create ZRS managed disks. ++##### Migrate your data to ZRS managed disks ++1. Create a ZRS managed disk from the source disk snapshot by using the following Azure CLI snippet: ++ ```azurecli + # Create a new ZRS Managed Disks using the snapshot Id and the SKU supported + storageType=Premium_ZRS + location=westus2 ++ az disk create --resource-group $resourceGroupName --name $diskName --sku $storageType --size-gb $diskSize --source $snapshotId + + ``` ++#### Step 3: Create a new VM with your new disks ++Now that you have migrated your data to ZRS managed disks or zonal managed disks, create a new VM with these new disks set as the OS and data disks: ++```azurecli ++ az vm create -g MyResourceGroup -n MyVm --attach-os-disk newZonalOSDiskCopy --attach-data-disks newZonalDataDiskCopy --os-type linux ++``` |
vpn-gateway | Vpn Gateway About Vpn Devices | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/vpn-gateway/vpn-gateway-about-vpn-devices.md | A VPN device is required to configure a Site-to-Site (S2S) cross-premises VPN co ## <a name="devicetable"></a>Validated VPN devices and device configuration guides -In partnership with device vendors, we have validated a set of standard VPN devices. All of the devices in the device families in the following list should work with VPN gateways. +In partnership with device vendors, we have validated a set of standard VPN devices. All of the devices in the device families in the following list should work with VPN gateways. These are the recommended algorithms for your device configuration. -To help configure your VPN device, refer to the links that correspond to the appropriate device family. The links to configuration instructions are provided on a best-effort basis. For VPN device support, contact your device manufacturer. ++To help configure your VPN device, refer to the links that correspond to the appropriate device family. The links to configuration instructions are provided on a best-effort basis and defaults listed in configuration guide need not contain the best cryptographic algorithms. For VPN device support, contact your device manufacturer. |**Vendor** |**Device family** |**Minimum OS version** |**PolicyBased configuration instructions** |**RouteBased configuration instructions** | | | | | | | To help configure your VPN device, refer to the links that correspond to the app | Cisco | CSR | RouteBased: IOS-XE 16.10 | Not tested | [Configuration script](vpn-gateway-download-vpndevicescript.md) | | Cisco |ISR |PolicyBased: IOS 15.0<br>RouteBased*: IOS 15.1 |Supported |Supported | | Cisco |Meraki (MX) | MX v15.12 |Not compatible | [Configuration guide](https://documentation.meraki.com/MX/Site-to-site_VPN/Configuring_Site_to_Site_VPN_tunnels_to_Azure_VPN_Gateway) |-| Cisco | vEdge (Viptela OS) | 18.4.0 (Active/Passive Mode)<br><br>19.2 (Active/Active Mode) | Not compatible | [Manual configuration (Active/Passive)](https://community.cisco.com/t5/networking-documents/how-to-configure-ipsec-vpn-connection-between-cisco-vedge-and/ta-p/3841454)<br><br>[Cloud Onramp configuration (Active/Active)](https://www.cisco.com/c/en/us/td/docs/routers/sdwan/configuration/Network-Optimization-and-High-Availability/Network-Optimization-High-Availability-book/b_Network-Optimization-and-HA_chapter_00.html) | +| Cisco | vEdge (Viptela OS) | 18.4.0 (Active/Passive Mode) | Not compatible | [Manual configuration (Active/Passive)](https://community.cisco.com/t5/networking-documents/how-to-configure-ipsec-vpn-connection-between-cisco-vedge-and/ta-p/3841454) | | Citrix |NetScaler MPX, SDX, VPX |10.1 and later |[Configuration guide](https://docs.citrix.com/en-us/netscaler/11-1/system/cloudbridge-connector-introduction/cloudbridge-connector-azure.html) |Not compatible | | F5 |BIG-IP series |12.0 |[Configuration guide](https://community.f5.com/t5/technical-articles/connecting-to-windows-azure-with-the-big-ip/ta-p/282476) |[Configuration guide](https://community.f5.com/t5/technical-articles/big-ip-to-azure-dynamic-ipsec-tunneling/ta-p/282665) | | Fortinet |FortiGate |FortiOS 5.6 | Not tested |[Configuration guide](https://docs.fortinet.com/document/fortigate/5.6.0/cookbook/255100/ipsec-vpn-to-azure) | For certain devices, you can download configuration scripts directly from Azure. ## <a name="additionaldevices"></a>Nonvalidated VPN devices -If you donΓÇÖt see your device listed in the Validated VPN devices table, your device still may work with a Site-to-Site connection. Contact your device manufacturer for support and configuration instructions. +If you donΓÇÖt see your device listed in the Validated VPN devices table, your device still might work with a Site-to-Site connection. Contact your device manufacturer for support and configuration instructions. ## <a name="editing"></a>Editing device configuration samples The following table lists IPsec SA (IKE Quick Mode) Offers. Offers are listed th | 25|AES128 |SHA256 |14 | | 26|3DES |SHA1 |14 | -* You can specify IPsec ESP NULL encryption with RouteBased and HighPerformance VPN gateways. Null based encryption doesn't provide protection to data in transit, and should only be used when maximum throughput and minimum latency is required. Clients may choose to use this in VNet-to-VNet communication scenarios, or when encryption is being applied elsewhere in the solution. +* You can specify IPsec ESP NULL encryption with RouteBased and HighPerformance VPN gateways. Null based encryption doesn't provide protection to data in transit, and should only be used when maximum throughput and minimum latency is required. Clients might choose to use this in VNet-to-VNet communication scenarios, or when encryption is being applied elsewhere in the solution. * For cross-premises connectivity through the Internet, use the default Azure VPN gateway settings with encryption and hashing algorithms listed in the preceding tables to ensure security of your critical communication. ## <a name="known"></a>Known device compatibility issues |
web-application-firewall | Custom Waf Rules Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/web-application-firewall/ag/custom-waf-rules-overview.md | -The Azure Application Gateway Web Application Firewall (WAF) v2 comes with a pre-configured, platform-managed ruleset that offers protection from many different types of attacks. These attacks include cross site scripting, SQL injection, and others. If you're a WAF admin, you may want to write your own rules to augment the core rule set (CRS) rules. Your custom rules can either block, allow, or log requested traffic based on matching criteria. If the WAF policy is set to detection mode, and a custom block rule is triggered, the request is logged and no blocking action is taken. +The Azure Application Gateway Web Application Firewall (WAF) v2 comes with a preconfigured, platform-managed ruleset that offers protection from many different types of attacks. These attacks include cross site scripting, SQL injection, and others. If you're a WAF admin, you may want to write your own rules to augment the core rule set (CRS) rules. Your custom rules can either block, allow, or log requested traffic based on matching criteria. If the WAF policy is set to detection mode, and a custom block rule is triggered, the request is logged and no blocking action is taken. Custom rules allow you to create your own rules that are evaluated for each request that passes through the WAF. These rules hold a higher priority than the rest of the rules in the managed rule sets. The custom rules contain a rule name, rule priority, and an array of matching conditions. If these conditions are met, an action is taken (to allow, block, or log). If a custom rule is triggered, and an allow or block action is taken, no further custom or managed rules are evaluated. Custom rules can be enabled/disabled on demand. For example, you can block all requests from an IP address in the range 192.168.5.0/24. In this rule, the operator is *IPMatch*, the matchValues is the IP address range (192.168.5.0/24), and the action is to block the traffic. You also set the rule's name, priority and enabled/disabled state. -Custom rules support using compounding logic to make more advanced rules that address your security needs. For example, ((Condition 1 **and** Condition 2) **or** Condition 3). This means that if Condition 1 **and** Condition 2 are met, **or** if Condition 3 is met, the WAF should take the action specified in the custom rule. +Custom rules support using compounding logic to make more advanced rules that address your security needs. For example, you can use two custom rules to create the following logic ((rule1:Condition 1 **and** rule1:Condition 2) **or** rule2:Condition 3). This logic means that if Condition 1 **and** Condition 2 are met, **or** if Condition 3 is met, the WAF should take the action specified in the custom rules. Different matching conditions within the same rule are always compounded using **and**. For example, block traffic from a specific IP address, and only if they're using a certain browser. If you want to use **or** between two different conditions,then the two conditions must be in different rules. For example, block traffic from a specific IP address or block traffic if they're using a specific browser. +Regular expressions are also supported in custom rules, just like in the CRS rulesets. For examples, see Examples 3 and 5 in [Create and use custom web application firewall rules](create-custom-waf-rules.md). + > [!NOTE] > The maximum number of WAF custom rules is 100. For more information about Application Gateway limits, see [Azure subscription and service limits, quotas, and constraints](../../azure-resource-manager/management/azure-subscription-service-limits.md#application-gateway-limits). -Regular expressions are also supported in custom rules, just like in the CRS rulesets. For examples, see Examples 3 and 5 in [Create and use custom web application firewall rules](create-custom-waf-rules.md). - > [!CAUTION] > Any redirect rules applied at the application gateway level will bypass WAF custom rules. See [Application Gateway redirect overview](../../application-gateway/redirect-overview.md) for more information about redirect rules. Must be one of the variables: - RemoteAddr ΓÇô IPv4 Address/Range of the remote computer connection - RequestMethod ΓÇô HTTP Request method - QueryString ΓÇô Variable in the URI-- PostArgs ΓÇô Arguments sent in the POST body. Custom Rules using this match variable are only applied if the 'Content-Type' header is set to 'application/x-www-form-urlencoded' and 'multipart/form-data'. Additional content type of `application/json` is supported with CRS version 3.2 or greater, bot protection rule set, and geo-match custom rules. +- PostArgs ΓÇô Arguments sent in the POST body. Custom Rules using this match variable are only applied if the 'Content-Type' header is set to 'application/x-www-form-urlencoded' and 'multipart/form-data.' Additional content type of `application/json` is supported with CRS version 3.2 or greater, bot protection rule set, and geo-match custom rules. - RequestUri ΓÇô URI of the request - RequestHeaders ΓÇô Headers of the request-- RequestBody ΓÇô This contains the entire request body as a whole. Custom rules using this match variable are only applied if the 'Content-Type' header is set to `application/x-www-form-urlencoded` media type. Additional content types of `application/soap+xml, application/xml, text/xml` are supported with CRS version 3.2 or greater, bot protection rule set, and geo-match custom rules.+- RequestBody ΓÇô This variable contains the entire request body as a whole. Custom rules using this match variable are only applied if the 'Content-Type' header is set to `application/x-www-form-urlencoded` media type. Additional content types of `application/soap+xml, application/xml, text/xml` are supported with CRS version 3.2 or greater, bot protection rule set, and geo-match custom rules. - RequestCookies ΓÇô Cookies of the request ### Selector [optional] In WAF policy detection mode, if a custom rule is triggered, the action is alway - Detection mode - Logs the transaction based on SecDefaultAction after which evaluation is stopped. Any request after that meets the same conditions won't be evaluated and will just be logged. - Log ΓÇô Lets the rule write to the log, but lets the rest of the rules run for evaluation. The other custom rules are evaluated in order of priority, followed by the managed rules. +## Copying and duplicating custom rules ++Custom rules can be duplicated within a given policy. When duplicating a rule, you need to specify a unique name for the rule and a unique priority value. Additionally, custom rules can be copied from one Application Gateway WAF policy to another as long as the policies are both in the same subscription. When copying a rule from one policy to another you need to select the Application Gateway WAF policy you wish to copy the rule into. Once you select the WAF policy you need to give the rule a unique name, and assign a priority rank. + ## Geomatch custom rules Custom rules let you create tailored rules to suit the exact needs of your applications and security policies. You can restrict access to your web applications by country/region. For more information, see [Geomatch custom rules](geomatch-custom-rules.md). |