Service | Microsoft Docs article | Related commit history on GitHub | Change details |
---|---|---|---|
ai-services | Copy Move Projects | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/custom-vision-service/copy-move-projects.md | -The **[ExportProject](/rest/api/customvision/training/projects/export?view=rest-customvision-training-v3.3&tabs=HTTP)** and **[ImportProject](/rest/api/customvision/training/projects/import?view=rest-customvision-training-v3.3&tabs=HTTP)** APIs enable this scenario by allowing you to copy projects from one Custom Vision account into others. This guide shows you how to use these REST APIs with cURL. You can also use an HTTP request service, like the [REST Client](https://marketplace.visualstudio.com/items?itemName=humao.rest-client) for Visual Studio Code, to issue the requests. +The **[ExportProject](/rest/api/customvision/projects/export)** and **[ImportProject](/rest/api/customvision/projects/import)** APIs enable this scenario by allowing you to copy projects from one Custom Vision account into others. This guide shows you how to use these REST APIs with cURL. You can also use an HTTP request service, like the [REST Client](https://marketplace.visualstudio.com/items?itemName=humao.rest-client) for Visual Studio Code, to issue the requests. > [!TIP] > For an example of this scenario using the Python client library, see the [Move Custom Vision Project](https://github.com/Azure-Samples/custom-vision-move-project/tree/master/) repository on GitHub. The process for copying a project consists of the following steps: ## Get the project ID -First call **[GetProjects](/rest/api/customvision/training/projects/get?view=rest-customvision-training-v3.3&tabs=HTTP)** to see a list of your existing Custom Vision projects and their IDs. Use the training key and endpoint of your source account. +First call **[GetProjects](/rest/api/customvision/projects/get)** to see a list of your existing Custom Vision projects and their IDs. Use the training key and endpoint of your source account. ```curl curl -v -X GET "{endpoint}/customvision/v3.3/Training/projects" You'll get a `200\OK` response with a list of projects and their metadata in the ## Export the project -Call **[ExportProject](/rest/api/customvision/training/projects/export?view=rest-customvision-training-v3.3&tabs=HTTP)** using the project ID and your source training key and endpoint. +Call **[ExportProject](/rest/api/customvision/projects/export)** using the project ID and your source training key and endpoint. ```curl curl -v -X GET "{endpoint}/customvision/v3.3/Training/projects/{projectId}/export" You'll get a `200/OK` response with metadata about the exported project and a re ## Import the project -Call **[ImportProject](/rest/api/customvision/training/projects/import?view=rest-customvision-training-v3.3&tabs=HTTP)** using your target training key and endpoint, along with the reference token. You can also give your project a name in its new account. +Call **[ImportProject](/rest/api/customvision/projects/import)** using your target training key and endpoint, along with the reference token. You can also give your project a name in its new account. ```curl curl -v -G -X POST "{endpoint}/customvision/v3.3/Training/projects/import" |
ai-services | Export Delete Data | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/custom-vision-service/export-delete-data.md | -Custom Vision collects user data to operate the service, but customers have full control to viewing and delete their data using the Custom Vision [Training APIs](https://go.microsoft.com/fwlink/?linkid=865446). +Custom Vision collects user data to operate the service, but customers have full control to viewing and delete their data using the Custom Vision [Training APIs](/rest/api/customvision/train-project). [!INCLUDE [GDPR-related guidance](~/reusable-content/ce-skilling/azure/includes/gdpr-intro-sentence.md)] To learn how to view or delete different kinds of user data in Custom Vision, se | Data | View operation | Delete operation | | - | - | - |-| Account info (Keys) | [GetAccountInfo](https://go.microsoft.com/fwlink/?linkid=865446) | Delete using Azure portal (for Azure Subscriptions). Or use **Delete Your Account** button in [CustomVision.ai](https://customvision.ai) settings page (for Microsoft Account Subscriptions) | -| Iteration details | [GetIteration](https://go.microsoft.com/fwlink/?linkid=865446) | [DeleteIteration](https://go.microsoft.com/fwlink/?linkid=865446) | -| Iteration performance details | [GetIterationPerformance](https://go.microsoft.com/fwlink/?linkid=865446) | [DeleteIteration](https://go.microsoft.com/fwlink/?linkid=865446) | -| List of iterations | [GetIterations](https://go.microsoft.com/fwlink/?linkid=865446) | [DeleteIteration](https://go.microsoft.com/fwlink/?linkid=865446) | -| Projects and project details | [GetProject](https://go.microsoft.com/fwlink/?linkid=865446) and [GetProjects](https://go.microsoft.com/fwlink/?linkid=865446) | [DeleteProject](https://go.microsoft.com/fwlink/?linkid=865446) | -| Image tags | [GetTag](https://go.microsoft.com/fwlink/?linkid=865446) and [GetTags](https://go.microsoft.com/fwlink/?linkid=865446) | [DeleteTag](https://go.microsoft.com/fwlink/?linkid=865446) | -| Images | [GetTaggedImages](https://go.microsoft.com/fwlink/?linkid=865446) (provides uri for image download) and [GetUntaggedImages](https://go.microsoft.com/fwlink/?linkid=865446) (provides uri for image download) | [DeleteImages](https://go.microsoft.com/fwlink/?linkid=865446) | -| Exported iterations | [GetExports](https://go.microsoft.com/fwlink/?linkid=865446) | Deleted upon account deletion | +| Account info (Keys) | [GetAccountInfo](/rest/api/aiservices/accountmanagement/accounts/get) | Delete using Azure portal (for Azure Subscriptions). Or use **Delete Your Account** button in [CustomVision.ai](https://customvision.ai) settings page (for Microsoft Account Subscriptions) | +| Iteration details | [GetIteration](/rest/api/customvision/get-iteration) | [DeleteIteration](/rest/api/customvision/delete-iteration) | +| Iteration performance details | [GetIterationPerformance](/rest/api/customvision/get-iteration-performance) | [DeleteIteration](/rest/api/customvision/delete-iteration) | +| List of iterations | [GetIterations](/rest/api/customvision/get-iterations) | [DeleteIteration](/rest/api/customvision/delete-iteration) | +| Projects and project details | [GetProject](/rest/api/customvision/get-project) and [GetProjects](/rest/api/customvision/get-projects) | [DeleteProject](/rest/api/customvision/delete-project) | +| Image tags | [GetTag](/rest/api/customvision/get-tag) and [GetTags](/rest/api/customvision/get-tags) | [DeleteTag](/rest/api/customvision/delete-tag) | +| Images | [GetTaggedImages](/rest/api/customvision/get-tagged-images) (provides uri for image download) and [GetUntaggedImages](/rest/api/customvision/get-untagged-images) (provides uri for image download) | [DeleteImages](/rest/api/customvision/delete-images) | +| Exported iterations | [GetExports](/rest/api/customvision/get-exports) | Deleted upon account deletion | |
ai-services | Limits And Quotas | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/custom-vision-service/limits-and-quotas.md | There are two tiers of keys for the Custom Vision service. You can sign up for a |How long prediction images stored|30 days|30 days| |[Prediction](/rest/api/customvision/predictions) operations with storage (Transactions Per Second)|2|10| |[Prediction](/rest/api/customvision/predictions) operations without storage (Transactions Per Second)|2|20|-|[TrainProject](https://go.microsoft.com/fwlink/?linkid=865446) (API calls Per Second)|2|10| -|[Other API calls](https://go.microsoft.com/fwlink/?linkid=865446) (Transactions Per Second)|10|10| +|[TrainProject](/rest/api/customvision/train-project/train-project) (API calls Per Second)|2|10| +|[Other API calls](/rest/api/custom-vision) (Transactions Per Second)|10|10| |Accepted image types|jpg, png, bmp, gif|jpg, png, bmp, gif| |Min image height/width in pixels|256 (see note)|256 (see note)| |Max image height/width in pixels|10,240|10,240| |
ai-services | Select Domain | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/custom-vision-service/select-domain.md | -From the **settings** tab of your project on the Custom Vision web portal, you can select a model domain for your project. You'll want to choose the domain that's closest to your use case scenario. If you're accessing Custom Vision through a client library or REST API, you'll need to specify a domain ID when creating the project. You can get a list of domain IDs with [Get Domains](/rest/api/customvision/training/domains/list?view=rest-customvision-training-v3.3&tabs=HTTP). Or, use the table below. +From the **settings** tab of your project on the Custom Vision web portal, you can select a model domain for your project. You'll want to choose the domain that's closest to your use case scenario. If you're accessing Custom Vision through a client library or REST API, you'll need to specify a domain ID when creating the project. You can get a list of domain IDs with [Get Domains](/rest/api/customvision/get-domains). Or, use the table below. ## Image Classification domains |
ai-services | Storage Integration | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/custom-vision-service/storage-integration.md | Now that you have the integration URLs, you can create a new Custom Vision proje #### [Create a new project](#tab/create) -When you call the [CreateProject](/rest/api/customvision/training/projects/create?view=rest-customvision-training-v3.3&tabs=HTTP) API, add the optional parameters _exportModelContainerUri_ and _notificationQueueUri_. Assign the URL values you got in the previous section. +When you call the [CreateProject](/rest/api/customvision/create-project) API, add the optional parameters _exportModelContainerUri_ and _notificationQueueUri_. Assign the URL values you got in the previous section. ```curl curl -v -X POST "{endpoint}/customvision/v3.3/Training/projects?exportModelContainerUri={inputUri}¬ificationQueueUri={inputUri}&name={inputName}" If you receive a `200/OK` response, that means the URLs have been set up success #### [Update an existing project](#tab/update) -To update an existing project with Azure storage feature integration, call the [UpdateProject](/rest/api/customvision/training/projects/update?view=rest-customvision-training-v3.3&tabs=HTTP) API, using the ID of the project you want to update. +To update an existing project with Azure storage feature integration, call the [UpdateProject](/rest/api/customvision/update-project) API, using the ID of the project you want to update. ```curl curl -v -X PATCH "{endpoint}/customvision/v3.3/Training/projects/{projectId}" In your notification queue, you should see a test notification in the following ## Get event notifications -When you're ready, call the [TrainProject](/rest/api/customvision/training/projects/train?view=rest-customvision-training-v3.3&tabs=HTTP) API on your project to do an ordinary training operation. +When you're ready, call the [TrainProject](/rest/api/customvision/train-project) API on your project to do an ordinary training operation. In your Storage notification queue, you'll receive a notification once training finishes: The `"trainingStatus"` field may be either `"TrainingCompleted"` or `"TrainingFa ## Get model export backups -When you're ready, call the [ExportIteration](/rest/api/customvision/training/iterations/export?view=rest-customvision-training-v3.3&tabs=HTTP) API to export a trained model into a specified platform. +When you're ready, call the [ExportIteration](/rest/api/customvision/export-iteration) API to export a trained model into a specified platform. In your designated storage container, a backup copy of the exported model will appear. The blob name will have the format: The `"exportStatus"` field may be either `"ExportCompleted"` or `"ExportFailed"` ## Next steps In this guide, you learned how to copy and back up a project between Custom Vision resources. Next, explore the API reference docs to see what else you can do with Custom Vision.-* [REST API reference documentation (training)](/rest/api/customvision/training/operation-groups?view=rest-customvision-training-v3.3) +* [REST API reference documentation (training)](/rest/api/customvision/train-project) * [REST API reference documentation (prediction)](/rest/api/customvision/predictions) |
ai-services | Concept Read | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/document-intelligence/concept-read.md | description: Extract print and handwritten text from scanned and digital documen -- - ignite-2023 Last updated 08/07/2024 The searchable PDF capability enables you to convert an analog PDF, such as scan > [!IMPORTANT] > > * Currently, the searchable PDF capability is only supported by Read OCR model `prebuilt-read`. When using this feature, please specify the `modelId` as `prebuilt-read`, as other model types will return error for this preview version.- > * Searchable PDF is included with the 2024-07-31-preview `prebuilt-read` model with no usage cost for general PDF consumption. + > * Searchable PDF is included with the 2024-07-31-preview `prebuilt-read` model with no additional cost for generating a searchable PDF output. ### Use searchable PDF POST /documentModels/prebuilt-read:analyze?output=pdf 202 ``` -Once the `Analyze` operation is complete, make a `GET` request to retrieve the `Analyze` operation results. +Poll for completion of the `Analyze` operation. Once the operation is complete, issue a `GET` request to retrieve the PDF format of the `Analyze` operation results. Upon successful completion, the PDF can be retrieved and downloaded as `application/pdf`. This operation allows direct downloading of the embedded text form of PDF instead of Base64-encoded JSON. |
ai-services | Whats New | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/document-intelligence/whats-new.md | The Document Intelligence [**2024-07-31-preview**](/rest/api/aiservices/document * **West Europe** * **North Central US** -* [Read model](concept-read.md) now supports [PDF output](concept-read.md#searchable-pdf) to download PDFs with embedded text from extraction results, allowing for PDF to be utilized in scenarios such as search and large language model ingestion. -* [Layout model](concept-layout.md) now supports improved [figure detection](concept-layout.md#figures) where figures from documents can now be downloaded as an image file to be used for further figure understanding. -* [Custom extraction models](concept-custom.md#custom-extraction-models) - * Custom extraction models now support updating the model in-place. -* [🆕 Custom generative (Document field extraction) model](concept-custom-generative.md) - * Document Intelligence now offers new custom generative model that utilizes generative AI to extract fields from unstructured documents or structured forms with a wide variety of visual templates. +* [🆕 Document field extraction (custom generative) model](concept-custom-generative.md) + * Use **Generative AI** to extract fields from documents and forms. Document Intelligence now offers a new document field extraction model that utilizes large language models (LLMs) to extract fields from unstructured documents or structured forms with a wide variety of visual templates. With grounded values and confidence scores, the new Generative AI based extraction fits into your existing processes. +* [🆕 Model compose with custom classifiers](concept-composed-models.md) + * Document Intelligence now adds support for composing model with an explicit custom classification model. [Learn more about the benefits](concept-composed-models.md) of using the new compose capability. * [Custom classification model](concept-custom.md#custom-classification-model) * Custom classification model now supports updating the model in-place as well. * Custom classification model adds support for model copy operation to enable backup and disaster recovery. The Document Intelligence [**2024-07-31-preview**](/rest/api/aiservices/document * New prebuilt to extract account information including beginning and ending balances, transaction details from bank statements.​ * [🆕 US Tax model](concept-tax-document.md) * New unified US tax model that can extract from forms such as W-2, 1098, 1099, and 1040.+* 🆕 Searchable PDF. The [prebuilt read](concept-read.md) model now supports [PDF output](concept-read.md#searchable-pdf) to download PDFs with embedded text from extraction results, allowing for PDF to be utilized in scenarios such as search copy of contents. +* [Layout model](concept-layout.md) now supports improved [figure detection](concept-layout.md#figures) where figures from documents can now be downloaded as an image file to be used for further figure understanding. The layout model also features improvements to the OCR model for scanned text targeting improvements for single characters, boxed text, and dense text documents. +* [🆕 Batch API](concept-batch-analysis.md) + * Document Intelligence now adds support for batch analysis operation to support analyzing a set of documents to simplify developer experience and improve efficiency. * [Add-on capabilities](concept-add-on-capabilities.md) * [Query fields](concept-add-on-capabilities.md#query-fields) AI quality of extraction is improved with the latest model.-* [🆕 Batch API](concept-batch-analysis.md) - * Document Intelligence now adds support for batch analysis operation to support analyzing a set of documents to simplify developer experience and improve service efficiency. -* [🆕 Model compose with custom classifiers](concept-composed-models.md) - * Document Intelligence now adds support for composing model with an explicit custom classification model. ++ ## May 2024 |
ai-services | Model Retirements | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/openai/concepts/model-retirements.md | These models are currently available for use in Azure OpenAI Service. | Model | Version | Retirement date | | - | - | - | | `gpt-35-turbo` | 0301 | No earlier than October 1, 2024 |-| `gpt-35-turbo`<br>`gpt-35-turbo-16k` | 0613 | October 1, 2024 | +| `gpt-35-turbo`<br>`gpt-35-turbo-16k` | 0613 | November 1, 2024 | | `gpt-35-turbo` | 1106 | No earlier than Nov 17, 2024 | | `gpt-35-turbo` | 0125 | No earlier than Feb 22, 2025 | | `gpt-4`<br>`gpt-4-32k` | 0314 | **Deprecation:** October 1, 2024 <br> **Retirement:** June 6, 2025 | If you're an existing customer looking for information about these models, see [ ## Retirement and deprecation history +### August 8, 2024 ++* Updated `gpt-35-turbo` & `gpt-35-turbo-16k` (0613) model's retirement date to November 1, 2024. + ### July 30, 2024 * Updated `gpt-4` preview model upgrade date to November 15, 2024 or later for the following versions: |
ai-services | Quotas Limits | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/openai/quotas-limits.md | Global Standard deployments use Azure's global infrastructure, dynamically routi The Usage Limit determines the level of usage above which customers might see larger variability in response latency. A customerΓÇÖs usage is defined per model and is the total tokens consumed across all deployments in all subscriptions in all regions for a given tenant. +> [!NOTE] +> Usage tiers only apply to standard and global standard deployment types. Usage tiers do not apply to global batch deployments. + #### GPT-4o global standard & standard |Model| Usage Tiers per month | |-|-|-|`gpt-4o` |1.5 Billion tokens | +|`gpt-4o` | 8 Billion tokens | |`gpt-4o-mini` | 45 Billion tokens | +#### GPT-4 standard ++|Model| Usage Tiers per month| +||| +| `gpt-4` + `gpt-4-32k` (all versions) | 4 Billion | ++ ## Other offer types If your Azure subscription is linked to certain [offer types](https://azure.microsoft.com/support/legal/offer-details/) your max quota values are lower than the values indicated in the above tables. |
ai-services | Speech Container Overview | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-services/speech-service/speech-container-overview.md | The following table lists the Speech containers available in the Microsoft Conta | Container | Features | Supported versions and locales | |--|--|--|-| [Speech to text](speech-container-stt.md) | Transcribes continuous real-time speech or batch audio recordings with intermediate results. | Latest: 4.7.0<br/><br/>For all supported versions and locales, see the [Microsoft Container Registry (MCR)](https://mcr.microsoft.com/product/azure-cognitive-services/speechservices/speech-to-text/tags) and [JSON tags](https://mcr.microsoft.com/v2/azure-cognitive-services/speechservices/speech-to-text/tags/list).| -| [Custom speech to text](speech-container-cstt.md) | Using a custom model from the [custom speech portal](https://speech.microsoft.com/customspeech), transcribes continuous real-time speech or batch audio recordings into text with intermediate results. | Latest: 4.7.0<br/><br/>For all supported versions and locales, see the [Microsoft Container Registry (MCR)](https://mcr.microsoft.com/product/azure-cognitive-services/speechservices/custom-speech-to-text/tags) and [JSON tags](https://mcr.microsoft.com/v2/azure-cognitive-services/speechservices/speech-to-text/tags/list). | -| [Speech language identification](speech-container-lid.md)<sup>1, 2</sup> | Detects the language spoken in audio files. | Latest: 1.13.0<br/><br/>For all supported versions and locales, see the [Microsoft Container Registry (MCR)](https://mcr.microsoft.com/product/azure-cognitive-services/speechservices/language-detection/tags) and [JSON tags](https://mcr.microsoft.com/v2/azure-cognitive-services/speechservices/language-detection/tags/list). | +| [Speech to text](speech-container-stt.md) | Transcribes continuous real-time speech or batch audio recordings with intermediate results. | Latest: 4.8.0<br/><br/>For all supported versions and locales, see the [Microsoft Container Registry (MCR)](https://mcr.microsoft.com/product/azure-cognitive-services/speechservices/speech-to-text/tags) and [JSON tags](https://mcr.microsoft.com/v2/azure-cognitive-services/speechservices/speech-to-text/tags/list).| +| [Custom speech to text](speech-container-cstt.md) | Using a custom model from the [custom speech portal](https://speech.microsoft.com/customspeech), transcribes continuous real-time speech or batch audio recordings into text with intermediate results. | Latest: 4.8.0<br/><br/>For all supported versions and locales, see the [Microsoft Container Registry (MCR)](https://mcr.microsoft.com/product/azure-cognitive-services/speechservices/custom-speech-to-text/tags) and [JSON tags](https://mcr.microsoft.com/v2/azure-cognitive-services/speechservices/speech-to-text/tags/list). | +| [Speech language identification](speech-container-lid.md)<sup>1, 2</sup> | Detects the language spoken in audio files. | Latest: 1.14.0<br/><br/>For all supported versions and locales, see the [Microsoft Container Registry (MCR)](https://mcr.microsoft.com/product/azure-cognitive-services/speechservices/language-detection/tags) and [JSON tags](https://mcr.microsoft.com/v2/azure-cognitive-services/speechservices/language-detection/tags/list). | | [Neural text to speech](speech-container-ntts.md) | Converts text to natural-sounding speech by using deep neural network technology, which allows for more natural synthesized speech. | Latest: 3.3.0<br/><br/>For all supported versions and locales, see the [Microsoft Container Registry (MCR)](https://mcr.microsoft.com/product/azure-cognitive-services/speechservices/neural-text-to-speech/tags) and [JSON tags](https://mcr.microsoft.com/v2/azure-cognitive-services/speechservices/neural-text-to-speech/tags/list). | <sup>1</sup> The container is available in public preview. Containers in preview are still under development and don't meet Microsoft's stability and support requirements. |
ai-studio | Configure Managed Network | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-studio/how-to/configure-managed-network.md | You need to configure following network isolation configurations. - Create private endpoint outbound rules to your private Azure resources. Private Azure AI Search isn't supported yet. - If you use Visual Studio Code integration with allow only approved outbound mode, create FQDN outbound rules described in the [use Visual Studio Code](#scenario-use-visual-studio-code) section. - If you use HuggingFace models in Models with allow only approved outbound mode, create FQDN outbound rules described in the [use HuggingFace models](#scenario-use-huggingface-models) section.+- If you use one of the open-source models with allow only approved outbound mode, create FQDN outbound rules described in the [curated by Azure AI](#scenario-curated-by-azure-ai) section. ## Network isolation architecture and isolation modes If you plan to use __HuggingFace models__ with the hub, add outbound _FQDN_ rule * cnd.auth0.com * cdn-lfs.huggingface.co +### Scenario: Curated by Azure AI ++These models involve dynamic installation of dependencies at runtime, and reequire outbound _FQDN_ rules to allow traffic to the following hosts: ++*.anaconda.org +*.anaconda.com +anaconda.com +pypi.org +*.pythonhosted.org +*.pytorch.org +pytorch.org + ## Private endpoints Private endpoints are currently supported for the following Azure |
ai-studio | Deploy Jais Models | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-studio/how-to/deploy-jais-models.md | - Title: How to deploy JAIS models with Azure AI Studio- -description: Learn how to deploy JAIS models with Azure AI Studio. --- Previously updated : 5/21/2024-------# How to deploy JAIS with Azure AI Studio ---In this article, you learn how to use Azure AI Studio to deploy the JAIS model as serverless APIs with pay-as-you-go token-based billing. --The JAIS model is available in [Azure AI Studio](https://ai.azure.com) with pay-as-you-go token based billing with Models as a Service. --You can find the JAIS model in the [Model Catalog](model-catalog.md) by filtering on the JAIS collection. --### Prerequisites --- An Azure subscription with a valid payment method. Free or trial Azure subscriptions will not work. If you don't have an Azure subscription, create a [paid Azure account](https://azure.microsoft.com/pricing/purchase-options/pay-as-you-go) to begin.-- An [Azure AI Studio hub](../how-to/create-azure-ai-resource.md). The serverless API model deployment offering for JAIS is only available with hubs created in these regions:-- * East US - * East US 2 - * North Central US - * South Central US - * West US - * West US 3 - * Sweden Central -- For a list of regions that are available for each of the models supporting serverless API endpoint deployments, see [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md). -- An [AI Studio project](../how-to/create-projects.md) in Azure AI Studio.-- Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure AI Studio. To perform the steps in this article, your user account must be assigned the __Azure AI Developer role__ on the resource group. For more information on permissions, see [Role-based access control in Azure AI Studio](../concepts/rbac-ai-studio.md).---### JAIS 30b Chat --JAIS 30b Chat is an auto-regressive bi-lingual LLM for **Arabic** & **English**. The tuned versions use supervised fine-tuning (SFT). The model is fine-tuned with both Arabic and English prompt-response pairs. The fine-tuning datasets included a wide range of instructional data across various domains. The model covers a wide range of common tasks including question answering, code generation, and reasoning over textual content. To enhance performance in Arabic, the Core42 team developed an in-house Arabic dataset as well as translating some open-source English instructions into Arabic. --*Context length:* JAIS supports a context length of 8K. --*Input:* Model input is text only. --*Output:* Model generates text only. --## Deploy as a serverless API ---Certain models in the model catalog can be deployed as a serverless API with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. This deployment option doesn't require quota from your subscription. --The previously mentioned JAIS 30b Chat model can be deployed as a service with pay-as-you-go billing and is offered by Core42 through the Microsoft Azure Marketplace. Core42 can change or update the terms of use and pricing of this model. ---### Create a new deployment --To create a deployment: --1. Sign in to [Azure AI Studio](https://ai.azure.com). -1. Select **Model catalog** from the left sidebar. -1. Search for *JAIS* and select the model _Jais-30b-chat_. -- :::image type="content" source="../media/deploy-monitor/jais/jais-search.png" alt-text="A screenshot showing a model in the model catalog." lightbox="../media/deploy-monitor/jais/jais-search.png"::: --2. Select **Deploy** to open a serverless API deployment window for the model. -- :::image type="content" source="../media/deploy-monitor/jais/jais-deploy-pay-as-you-go.png" alt-text="A screenshot showing how to deploy a model with the pay-as-you-go option." lightbox="../media/deploy-monitor/jais/jais-deploy-pay-as-you-go.png"::: --1. Select the project in which you want to deploy your model. To deploy the model your project must be in the East US 2 or Sweden Central region. -1. In the deployment wizard, select the link to **Azure Marketplace Terms** to learn more about the terms of use. -1. Select the **Pricing and terms** tab to learn about pricing for the selected model. -1. Select the **Subscribe and Deploy** button. If this is your first time deploying the model in the project, you have to subscribe your project for the particular offering. This step requires that your account has the **Azure AI Developer role** permissions on the resource group, as listed in the prerequisites. Each project has its own subscription to the particular Azure Marketplace offering of the model, which allows you to control and monitor spending. Currently, you can have only one deployment for each model within a project. -- :::image type="content" source="../media/deploy-monitor/jais/jais-marketplace-terms.png" alt-text="A screenshot showing the terms and conditions of a given model." lightbox="../media/deploy-monitor/jais/jais-marketplace-terms.png"::: --1. Once you subscribe the project for the particular Azure Marketplace offering, subsequent deployments of the _same_ offering in the _same_ project don't require subscribing again. If this scenario applies to you, there's a **Continue to deploy** option to select. -- :::image type="content" source="../media/deploy-monitor/jais/jais-existing-subscription.png" alt-text="A screenshot showing a project that is already subscribed to the offering." lightbox="../media/deploy-monitor/jais/jais-existing-subscription.png"::: --1. Give the deployment a name. This name becomes part of the deployment API URL. This URL must be unique in each Azure region. -- :::image type="content" source="../media/deploy-monitor/jais/jais-deployment-name.png" alt-text="A screenshot showing how to indicate the name of the deployment you want to create." lightbox="../media/deploy-monitor/jais/jais-deployment-name.png"::: --1. Select **Deploy**. Wait until the deployment is ready and you're redirected to the Deployments page. -1. Select **Open in playground** to start interacting with the model. -1. You can return to the Deployments page, select the deployment, and note the endpoint's **Target** URL and the Secret **Key**. For more information on using the APIs, see the [reference](#chat-api-reference-for-jais-deployed-as-a-service) section. -1. You can always find the endpoint's details, URL, and access keys by navigating to your **Project overview** page. Then, from the left sidebar of your project, select **Components** > **Deployments**. --To learn about billing for the JAIS models deployed as a serverless API with pay-as-you-go token-based billing, see [Cost and quota considerations for JAIS models deployed as a service](#cost-and-quota-considerations-for-models-deployed-as-a-service) --### Consume the JAIS 30b Chat model as a service --These models can be consumed using the chat API. --1. From your **Project overview** page, go to the left sidebar and select **Components** > **Deployments**. --1. Find and select the deployment you created. --1. Copy the **Target** URL and the **Key** value. --For more information on using the APIs, see the [reference](#chat-api-reference-for-jais-deployed-as-a-service) section. --## Chat API reference for JAIS deployed as a service --### v1/chat/completions --#### Request -``` - POST /v1/chat/completions HTTP/1.1 - Host: <DEPLOYMENT_URI> - Authorization: Bearer <TOKEN> - Content-type: application/json -``` --#### v1/chat/completions request schema --JAIS 30b Chat accepts the following parameters for a `v1/chat/completions` response inference call: --| Property | Type | Default | Description | -| | | | | -| `messages` | `array` | `None` | Text input for the model to respond to. | -| `max_tokens` | `integer` | `None` | The maximum number of tokens the model generates as part of the response. Note: Setting a low value might result in incomplete generations. If not specified, generates tokens until end of sequence. | -| `temperature` | `float` | `0.3` | Controls randomness in the model. Lower values make the model more deterministic and higher values make the model more random. | -| `top_p` | `float` |`None`|The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling, defaults to null.| -| `top_k` | `integer` |`None`|The number of highest probability vocabulary tokens to keep for top-k-filtering, defaults to null.| ---A System or User Message supports the following properties: --| Property | Type | Default | Description | -| | | | | -| `role` | `enum` | Required | `role=system` or `role=user`. | -|`content` |`string` |Required |Text input for the model to respond to. | ---An Assistant Message supports the following properties: --| Property | Type | Default | Description | -| | | | | -| `role` | `enum` | Required | `role=assistant`| -|`content` |`string` |Required |The contents of the assistant message. | ---#### v1/chat/completions response schema --The response payload is a dictionary with the following fields: --| Key | Type | Description | -| | | | -| `id` | `string` | A unique identifier for the completion. | -| `choices` | `array` | The list of completion choices the model generated for the input messages. | -| `created` | `integer` | The Unix timestamp (in seconds) of when the completion was created. | -| `model` | `string` | The model_id used for completion. | -| `object` | `string` | chat.completion. | -| `usage` | `object` | Usage statistics for the completion request. | --The `choices` object is a dictionary with the following fields: --| Key | Type | Description | -| | | | -| `index` | `integer` | Choice index. | -| `messages` or `delta` | `string` | Chat completion result in messages object. When streaming mode is used, delta key is used. | -| `finish_reason` | `string` | The reason the model stopped generating tokens. | --The `usage` object is a dictionary with the following fields: --| Key | Type | Description | -| | | | -| `prompt_tokens` | `integer` | Number of tokens in the prompt. | -| `completion_tokens` | `integer` | Number of tokens generated in the completion. | -| `total_tokens` | `integer` | Total tokens. | ---#### Examples --##### Arabic -Request: --```json - "messages": [ - { - "role": "user", - "content": "ما هي الأماكن الشهيرة التي يجب زيارتها في الإمارات؟" - } - ] -``` --Response: --```json - { - "id": "df23b9f7-e6bd-493f-9437-443c65d428a1", - "choices": [ - { - "index": 0, - "finish_reason": "stop", - "message": { - "role": "assistant", - "content": "هناك العديد من الأماكن المذهلة للزيارة في الإمارات! ومن أشهرها برج خليفة في دبي وهو أطول مبنى في العالم ، ومسجد الشيخ زايد الكبير في أبوظبي والذي يعد أحد أجمل المساجد في العالم ، وصحراء ليوا في الظفرة والتي تعد أكبر صحراء رملية في العالم وتجذب الكثير من السياح لتجربة ركوب الجمال والتخييم في الصحراء. كما يمكن للزوار الاستمتاع بالشواطئ الجميلة في دبي وأبوظبي والشارقة ورأس الخيمة، وزيارة متحف اللوفر أبوظبي للتعرف على تاريخ الفن والثقافة العالمية" - } - } - ], - "created": 1711734274, - "model": "jais-30b-chat", - "object": "chat.completion", - "usage": { - "prompt_tokens": 23, - "completion_tokens": 744, - "total_tokens": 767 - } - } -``` --##### English -Request: --```json - "messages": [ - { - "role": "user", - "content": "List the emirates of the UAE." - } - ] -``` --Response: --```json - { - "id": "df23b9f7-e6bd-493f-9437-443c65d428a1", - "choices": [ - { - "index": 0, - "finish_reason": "stop", - "message": { - "role": "assistant", - "content": "The seven emirates of the United Arab Emirates are: Abu Dhabi, Dubai, Sharjah, Ajman, Umm Al-Quwain, Fujairah, and Ras Al Khaimah." - } - } - ], - "created": 1711734274, - "model": "jais-30b-chat", - "object": "chat.completion", - "usage": { - "prompt_tokens": 23, - "completion_tokens": 60, - "total_tokens": 83 - } - } -``` --##### More inference examples --| **Sample Type** | **Sample Notebook** | -|-|-| -| CLI using CURL and Python web requests | [webrequests.ipynb](https://aka.ms/jais/webrequests-sample) | -| OpenAI SDK (experimental) | [openaisdk.ipynb](https://aka.ms/jais/openaisdk) | -| LiteLLM | [litellm.ipynb](https://aka.ms/jais/litellm-sample) | --## Cost and quotas --### Cost and quota considerations for models deployed as a service --JAIS 30b Chat is deployed as a service are offered by Core42 through the Azure Marketplace and integrated with Azure AI Studio for use. You can find the Azure Marketplace pricing when deploying the model. --Each time a project subscribes to a given offer from the Azure Marketplace, a new resource is created to track the costs associated with its consumption. The same resource is used to track costs associated with inference; however, multiple meters are available to track each scenario independently. --For more information on how to track costs, see [monitor costs for models offered throughout the Azure Marketplace](../how-to/costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace). --Quota is managed per deployment. Each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute. However, we currently limit one deployment per model per project. Contact Microsoft Azure Support if the current rate limits aren't sufficient for your scenarios. --## Content filtering --Models deployed as a service with pay-as-you-go billing are protected by [Azure AI Content Safety](../../ai-services/content-safety/overview.md). With Azure AI content safety, both the prompt and completion pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. Learn more about [content filtering here](../concepts/content-filtering.md). --## Next steps --- [What is Azure AI Studio?](../what-is-ai-studio.md)-- [Azure AI FAQ article](../faq.yml)-- [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md) |
ai-studio | Deploy Models Cohere Command | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-studio/how-to/deploy-models-cohere-command.md | Title: How to deploy Cohere Command models with Azure AI Studio + Title: How to use Cohere Command chat models with Azure AI Studio -description: Learn how to deploy Cohere Command models with Azure AI Studio. -+description: Learn how to use Cohere Command chat models with Azure AI Studio. + Previously updated : 5/21/2024 Last updated : 08/08/2024 +reviewer: shubhirajMsft -++zone_pivot_groups: azure-ai-model-catalog-samples-chat +++# How to use Cohere Command chat models ++In this article, you learn about Cohere Command chat models and how to use them. +The Cohere family of models includes various models optimized for different use cases, including chat completions, embeddings, and rerank. Cohere models are optimized for various use cases that include reasoning, summarization, and question answering. +++++## Cohere Command chat models ++The Cohere Command chat models include the following models: ++# [Cohere Command R+](#tab/cohere-command-r-plus) ++Command R+ is a generative large language model optimized for various use cases, including reasoning, summarization, and question answering. ++* **Model Architecture**: Both Command R and Command R+ are autoregressive language models that use an optimized transformer architecture. After pre-training, the models use supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety. +* **Languages covered**: The models are optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, simplified Chinese, and Arabic. +* **Pre-training data also included the following 13 languages:** Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian. +* **Context length:** Command R and Command R+ support a context length of 128 K. ++We recommend using Command R+ for those workflows that lean on complex retrieval augmented generation (RAG) functionality and multi-step tool use (agents). +++The following models are available: ++* [Cohere-command-r-plus](https://aka.ms/azureai/landing/Cohere-command-r-plus) +++# [Cohere Command R](#tab/cohere-command-r) ++Command R is a large language model optimized for various use cases, including reasoning, summarization, and question answering. ++* **Model Architecture**: Both Command R and Command R+ are autoregressive language models that use an optimized transformer architecture. After pre-training, the models use supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety. +* **Languages covered**: The models are optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, simplified Chinese, and Arabic. +* **Pre-training data also included the following 13 languages:** Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian. +* **Context length:** Command R and Command R+ support a context length of 128 K. ++Command R is great for simpler retrieval augmented generation (RAG) and single-step tool use tasks. It's also great for use in applications where price is a major consideration. +++The following models are available: ++* [Cohere-command-r](https://aka.ms/azureai/landing/Cohere-command-r) +++++> [!TIP] +> Additionally, Cohere supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [Cohere documentation](https://docs.cohere.com/reference/about) or see the [inference examples](#more-inference-examples) section to code examples. ++## Prerequisites ++To use Cohere Command chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Cohere Command chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `azure-ai-inference` package with Python. To install this package, you need the following prerequisites: ++* Python 3.8 or later installed, including pip. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. + +Once you have these prerequisites, install the Azure AI inference package with the following command: ++```bash +pip install azure-ai-inference +``` ++Read more about the [Azure AI inference package and reference](https://aka.ms/azsdk/azure-ai-inference/python/reference). ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Cohere Command chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```python +import os +from azure.ai.inference import ChatCompletionsClient +from azure.core.credentials import AzureKeyCredential ++client = ChatCompletionsClient( + endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"], + credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"]), +) +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```python +model_info = client.get_model_info() +``` ++The response is as follows: +++```python +print("Model name:", model_info.model_name) +print("Model type:", model_info.model_type) +print("Model provider name:", model_info.model_provider) +``` ++```console +Model name: Cohere-command-r-plus +Model type: chat-completions +Model provider name: Cohere +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```python +from azure.ai.inference.models import SystemMessage, UserMessage ++response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], +) +``` ++The response is as follows, where you can see the model's usage statistics: +++```python +print("Response:", response.choices[0].message.content) +print("Model:", response.model) +print("Usage:") +print("\tPrompt tokens:", response.usage.prompt_tokens) +print("\tTotal tokens:", response.usage.total_tokens) +print("\tCompletion tokens:", response.usage.completion_tokens) +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: Cohere-command-r-plus +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```python +result = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + temperature=0, + top_p=1, + max_tokens=2048, + stream=True, +) +``` ++To stream completions, set `stream=True` when you call the model. ++To visualize the output, define a helper function to print the stream. ++```python +def print_stream(result): + """ + Prints the chat completion with streaming. Some delay is added to simulate + a real-time conversation. + """ + import time + for update in result: + if update.choices: + print(update.choices[0].delta.content, end="") + time.sleep(0.05) +``` ++You can visualize how streaming generates content: +++```python +print_stream(result) +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```python +from azure.ai.inference.models import ChatCompletionsResponseFormat ++response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + presence_penalty=0.1, + frequency_penalty=0.8, + max_tokens=2048, + stop=["<|endoftext|>"], + temperature=0, + top_p=1, + response_format={ "type": ChatCompletionsResponseFormat.TEXT }, +) +``` ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++#### Create JSON outputs ++Cohere Command chat models can create JSON outputs. Set `response_format` to `json_object` to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if `finish_reason="length"`, which indicates that the generation exceeded `max_tokens` or that the conversation exceeded the max context length. +++```python +response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant that always generate responses in JSON format, using." + " the following format: { ""answer"": ""response"" }."), + UserMessage(content="How many languages are in the world?"), + ], + response_format={ "type": ChatCompletionsResponseFormat.JSON_OBJECT } +) +``` ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```python +response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + model_extras={ + "logprobs": True + } +) +``` ++### Use tools ++Cohere Command chat models support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way. ++The following code example creates a tool definition that is able to look from flight information from two different cities. +++```python +from azure.ai.inference.models import FunctionDefinition, ChatCompletionsFunctionToolDefinition ++flight_info = ChatCompletionsFunctionToolDefinition( + function=FunctionDefinition( + name="get_flight_info", + description="Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight", + parameters={ + "type": "object", + "properties": { + "origin_city": { + "type": "string", + "description": "The name of the city where the flight originates", + }, + "destination_city": { + "type": "string", + "description": "The flight destination city", + }, + }, + "required": ["origin_city", "destination_city"], + }, + ) +) ++tools = [flight_info] +``` ++In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train. +++```python +def get_flight_info(loc_origin: str, loc_destination: str): + return { + "info": f"There are no flights available from {loc_origin} to {loc_destination}. You should take a train, specially if it helps to reduce CO2 emissions." + } +``` ++> [!NOTE] +> Cohere-command-r-plus and Cohere-command-r require a tool's responses to be a valid JSON content formatted as a string. When constructing messages of type *Tool*, ensure the response is a valid JSON string. ++Prompt the model to book flights with the help of this function: +++```python +messages = [ + SystemMessage( + content="You are a helpful assistant that help users to find information about traveling, how to get" + " to places and the different transportations options. You care about the environment and you" + " always have that in mind when answering inqueries.", + ), + UserMessage( + content="When is the next flight from Miami to Seattle?", + ), +] ++response = client.complete( + messages=messages, tools=tools, tool_choice="auto" +) +``` ++You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type `function`. +++```python +response_message = response.choices[0].message +tool_calls = response_message.tool_calls ++print("Finish reason:", response.choices[0].finish_reason) +print("Tool call:", tool_calls) +``` ++To continue, append this message to the chat history: +++```python +messages.append( + response_message +) +``` ++Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history. +++```python +import json +from azure.ai.inference.models import ToolMessage ++for tool_call in tool_calls: ++ # Get the tool details: ++ function_name = tool_call.function.name + function_args = json.loads(tool_call.function.arguments.replace("\'", "\"")) + tool_call_id = tool_call.id ++ print(f"Calling function `{function_name}` with arguments {function_args}") ++ # Call the function defined above using `locals()`, which returns the list of all functions + # available in the scope as a dictionary. Notice that this is just done as a simple way to get + # the function callable from its string name. Then we can call it with the corresponding + # arguments. ++ callable_func = locals()[function_name] + function_response = callable_func(**function_args) ++ print("->", function_response) ++ # Once we have a response from the function and its arguments, we can append a new message to the chat + # history. Notice how we are telling to the model that this chat message came from a tool: ++ messages.append( + ToolMessage( + tool_call_id=tool_call_id, + content=json.dumps(function_response) + ) + ) +``` ++View the response from the model: +++```python +response = client.complete( + messages=messages, + tools=tools, +) +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```python +from azure.ai.inference.models import AssistantMessage, UserMessage, SystemMessage ++try: + response = client.complete( + messages=[ + SystemMessage(content="You are an AI assistant that helps people find information."), + UserMessage(content="Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills."), + ] + ) ++ print(response.choices[0].message.content) ++except HttpResponseError as ex: + if ex.status_code == 400: + response = ex.response.json() + if isinstance(response, dict) and "error" in response: + print(f"Your request triggered an {response['error']['code']} error:\n\t {response['error']['message']}") + else: + raise + raise +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++++## Cohere Command chat models ++The Cohere Command chat models include the following models: ++# [Cohere Command R+](#tab/cohere-command-r-plus) ++Command R+ is a generative large language model optimized for various use cases, including reasoning, summarization, and question answering. ++* **Model Architecture**: Both Command R and Command R+ are autoregressive language models that use an optimized transformer architecture. After pre-training, the models use supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety. +* **Languages covered**: The models are optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, simplified Chinese, and Arabic. +* **Pre-training data also included the following 13 languages:** Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian. +* **Context length:** Command R and Command R+ support a context length of 128 K. ++We recommend using Command R+ for those workflows that lean on complex retrieval augmented generation (RAG) functionality and multi-step tool use (agents). +++The following models are available: ++* [Cohere-command-r-plus](https://aka.ms/azureai/landing/Cohere-command-r-plus) +++# [Cohere Command R](#tab/cohere-command-r) ++Command R is a large language model optimized for various use cases, including reasoning, summarization, and question answering. ++* **Model Architecture**: Both Command R and Command R+ are autoregressive language models that use an optimized transformer architecture. After pre-training, the models use supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety. +* **Languages covered**: The models are optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, simplified Chinese, and Arabic. +* **Pre-training data also included the following 13 languages:** Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian. +* **Context length:** Command R and Command R+ support a context length of 128 K. ++Command R is great for simpler retrieval augmented generation (RAG) and single-step tool use tasks. It's also great for use in applications where price is a major consideration. +++The following models are available: ++* [Cohere-command-r](https://aka.ms/azureai/landing/Cohere-command-r) ++ -# How to deploy Cohere Command models with Azure AI Studio +> [!TIP] +> Additionally, Cohere supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [Cohere documentation](https://docs.cohere.com/reference/about) or see the [inference examples](#more-inference-examples) section to code examples. ++## Prerequisites ++To use Cohere Command chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Cohere Command chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `@azure-rest/ai-inference` package from `npm`. To install this package, you need the following prerequisites: ++* LTS versions of `Node.js` with `npm`. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++Once you have these prerequisites, install the Azure Inference library for JavaScript with the following command: ++```bash +npm install @azure-rest/ai-inference +``` ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Cohere Command chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```javascript +import ModelClient from "@azure-rest/ai-inference"; +import { isUnexpected } from "@azure-rest/ai-inference"; +import { AzureKeyCredential } from "@azure/core-auth"; ++const client = new ModelClient( + process.env.AZURE_INFERENCE_ENDPOINT, + new AzureKeyCredential(process.env.AZURE_INFERENCE_CREDENTIAL) +); +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```javascript +var model_info = await client.path("/info").get() +``` ++The response is as follows: +++```javascript +console.log("Model name: ", model_info.body.model_name) +console.log("Model type: ", model_info.body.model_type) +console.log("Model provider name: ", model_info.body.model_provider_name) +``` ++```console +Model name: Cohere-command-r-plus +Model type: chat-completions +Model provider name: Cohere +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } +}); +``` ++The response is as follows, where you can see the model's usage statistics: +++```javascript +if (isUnexpected(response)) { + throw response.body.error; +} ++console.log("Response: ", response.body.choices[0].message.content); +console.log("Model: ", response.body.model); +console.log("Usage:"); +console.log("\tPrompt tokens:", response.body.usage.prompt_tokens); +console.log("\tTotal tokens:", response.body.usage.total_tokens); +console.log("\tCompletion tokens:", response.body.usage.completion_tokens); +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: Cohere-command-r-plus +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } +}).asNodeStream(); +``` ++To stream completions, use `.asNodeStream()` when you call the model. ++You can visualize how streaming generates content: +++```javascript +var stream = response.body; +if (!stream) { + stream.destroy(); + throw new Error(`Failed to get chat completions with status: ${response.status}`); +} ++if (response.status !== "200") { + throw new Error(`Failed to get chat completions: ${response.body.error}`); +} ++var sses = createSseStream(stream); ++for await (const event of sses) { + if (event.data === "[DONE]") { + return; + } + for (const choice of (JSON.parse(event.data)).choices) { + console.log(choice.delta?.content ?? ""); + } +} +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + presence_penalty: "0.1", + frequency_penalty: "0.8", + max_tokens: 2048, + stop: ["<|endoftext|>"], + temperature: 0, + top_p: 1, + response_format: { type: "text" }, + } +}); +``` ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++#### Create JSON outputs ++Cohere Command chat models can create JSON outputs. Set `response_format` to `json_object` to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if `finish_reason="length"`, which indicates that the generation exceeded `max_tokens` or that the conversation exceeded the max context length. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant that always generate responses in JSON format, using." + + " the following format: { \"answer\": \"response\" }." }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + response_format: { type: "json_object" } + } +}); +``` ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + headers: { + "extra-params": "pass-through" + }, + body: { + messages: messages, + logprobs: true + } +}); +``` ++### Use tools ++Cohere Command chat models support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way. ++The following code example creates a tool definition that is able to look from flight information from two different cities. +++```javascript +const flight_info = { + name: "get_flight_info", + description: "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight", + parameters: { + type: "object", + properties: { + origin_city: { + type: "string", + description: "The name of the city where the flight originates", + }, + destination_city: { + type: "string", + description: "The flight destination city", + }, + }, + required: ["origin_city", "destination_city"], + }, +} ++const tools = [ + { + type: "function", + function: flight_info, + }, +]; +``` ++In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train. +++```javascript +function get_flight_info(loc_origin, loc_destination) { + return { + info: "There are no flights available from " + loc_origin + " to " + loc_destination + ". You should take a train, specially if it helps to reduce CO2 emissions." + } +} +``` ++> [!NOTE] +> Cohere-command-r-plus and Cohere-command-r require a tool's responses to be a valid JSON content formatted as a string. When constructing messages of type *Tool*, ensure the response is a valid JSON string. ++Prompt the model to book flights with the help of this function: +++```javascript +var result = await client.path("/chat/completions").post({ + body: { + messages: messages, + tools: tools, + tool_choice: "auto" + } +}); +``` ++You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type `function`. +++```javascript +const response_message = response.body.choices[0].message; +const tool_calls = response_message.tool_calls; ++console.log("Finish reason: " + response.body.choices[0].finish_reason); +console.log("Tool call: " + tool_calls); +``` ++To continue, append this message to the chat history: +++```javascript +messages.push(response_message); +``` ++Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history. +++```javascript +function applyToolCall({ function: call, id }) { + // Get the tool details: + const tool_params = JSON.parse(call.arguments); + console.log("Calling function " + call.name + " with arguments " + tool_params); ++ // Call the function defined above using `window`, which returns the list of all functions + // available in the scope as a dictionary. Notice that this is just done as a simple way to get + // the function callable from its string name. Then we can call it with the corresponding + // arguments. + const function_response = tool_params.map(window[call.name]); + console.log("-> " + function_response); ++ return function_response +} ++for (const tool_call of tool_calls) { + var tool_response = tool_call.apply(applyToolCall); ++ messages.push( + { + role: "tool", + tool_call_id: tool_call.id, + content: tool_response + } + ); +} +``` ++View the response from the model: +++```javascript +var result = await client.path("/chat/completions").post({ + body: { + messages: messages, + tools: tools, + } +}); +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```javascript +try { + var messages = [ + { role: "system", content: "You are an AI assistant that helps people find information." }, + { role: "user", content: "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." }, + ]; ++ var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } + }); ++ console.log(response.body.choices[0].message.content); +} +catch (error) { + if (error.status_code == 400) { + var response = JSON.parse(error.response._content); + if (response.error) { + console.log(`Your request triggered an ${response.error.code} error:\n\t ${response.error.message}`); + } + else + { + throw error; + } + } +} +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++++## Cohere Command chat models ++The Cohere Command chat models include the following models: ++# [Cohere Command R+](#tab/cohere-command-r-plus) ++Command R+ is a generative large language model optimized for various use cases, including reasoning, summarization, and question answering. ++* **Model Architecture**: Both Command R and Command R+ are autoregressive language models that use an optimized transformer architecture. After pre-training, the models use supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety. +* **Languages covered**: The models are optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, simplified Chinese, and Arabic. +* **Pre-training data also included the following 13 languages:** Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian. +* **Context length:** Command R and Command R+ support a context length of 128 K. ++We recommend using Command R+ for those workflows that lean on complex retrieval augmented generation (RAG) functionality and multi-step tool use (agents). +++The following models are available: ++* [Cohere-command-r-plus](https://aka.ms/azureai/landing/Cohere-command-r-plus) +++# [Cohere Command R](#tab/cohere-command-r) ++Command R is a large language model optimized for various use cases, including reasoning, summarization, and question answering. ++* **Model Architecture**: Both Command R and Command R+ are autoregressive language models that use an optimized transformer architecture. After pre-training, the models use supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety. +* **Languages covered**: The models are optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, simplified Chinese, and Arabic. +* **Pre-training data also included the following 13 languages:** Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian. +* **Context length:** Command R and Command R+ support a context length of 128 K. ++Command R is great for simpler retrieval augmented generation (RAG) and single-step tool use tasks. It's also great for use in applications where price is a major consideration. +++The following models are available: ++* [Cohere-command-r](https://aka.ms/azureai/landing/Cohere-command-r) +++++> [!TIP] +> Additionally, Cohere supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [Cohere documentation](https://docs.cohere.com/reference/about) or see the [inference examples](#more-inference-examples) section to code examples. ++## Prerequisites ++To use Cohere Command chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Cohere Command chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `Azure.AI.Inference` package from [Nuget](https://www.nuget.org/). To install this package, you need the following prerequisites: ++* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++Once you have these prerequisites, install the Azure AI inference library with the following command: ++```dotnetcli +dotnet add package Azure.AI.Inference --prerelease +``` ++You can also authenticate with Microsoft Entra ID (formerly Azure Active Directory). To use credential providers provided with the Azure SDK, install the `Azure.Identity` package: ++```dotnetcli +dotnet add package Azure.Identity +``` ++Import the following namespaces: +++```csharp +using Azure; +using Azure.Identity; +using Azure.AI.Inference; +``` ++This example also use the following namespaces but you may not always need them: +++```csharp +using System.Text.Json; +using System.Text.Json.Serialization; +using System.Reflection; +``` ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Cohere Command chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```csharp +ChatCompletionsClient client = new ChatCompletionsClient( + new Uri(Environment.GetEnvironmentVariable("AZURE_INFERENCE_ENDPOINT")), + new AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_INFERENCE_CREDENTIAL")) +); +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```csharp +Response<ModelInfo> modelInfo = client.GetModelInfo(); +``` ++The response is as follows: +++```csharp +Console.WriteLine($"Model name: {modelInfo.Value.ModelName}"); +Console.WriteLine($"Model type: {modelInfo.Value.ModelType}"); +Console.WriteLine($"Model provider name: {modelInfo.Value.ModelProviderName}"); +``` ++```console +Model name: Cohere-command-r-plus +Model type: chat-completions +Model provider name: Cohere +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```csharp +ChatCompletionsOptions requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, +}; ++Response<ChatCompletions> response = client.Complete(requestOptions); +``` ++The response is as follows, where you can see the model's usage statistics: +++```csharp +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +Console.WriteLine($"Model: {response.Value.Model}"); +Console.WriteLine("Usage:"); +Console.WriteLine($"\tPrompt tokens: {response.Value.Usage.PromptTokens}"); +Console.WriteLine($"\tTotal tokens: {response.Value.Usage.TotalTokens}"); +Console.WriteLine($"\tCompletion tokens: {response.Value.Usage.CompletionTokens}"); +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: Cohere-command-r-plus +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. +You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. -In this article, you learn how to use Azure AI Studio to deploy the Cohere Command models as serverless APIs with pay-as-you-go token-based billing. -Cohere offers two Command models in [Azure AI Studio](https://ai.azure.com). These models are available as serverless APIs with pay-as-you-go token-based billing. You can browse the Cohere family of models in the [model catalog](model-catalog.md) by filtering on the Cohere collection. +```csharp +static async Task StreamMessageAsync(ChatCompletionsClient client) +{ + ChatCompletionsOptions requestOptions = new ChatCompletionsOptions() + { + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world? Write an essay about it.") + }, + MaxTokens=4096 + }; -## Cohere Command models + StreamingResponse<StreamingChatCompletionsUpdate> streamResponse = await client.CompleteStreamingAsync(requestOptions); ++ await PrintStream(streamResponse); +} +``` -In this section, you learn about the two Cohere Command models that are available in the model catalog: +To stream completions, use `CompleteStreamingAsync` method when you call the model. Notice that in this example we the call is wrapped in an asynchronous method. -* Cohere Command R -* Cohere Command R+ +To visualize the output, define an asynchronous method to print the stream in the console. -You can browse the Cohere family of models in the [Model Catalog](model-catalog-overview.md) by filtering on the Cohere collection. +```csharp +static async Task PrintStream(StreamingResponse<StreamingChatCompletionsUpdate> response) +{ + await foreach (StreamingChatCompletionsUpdate chatUpdate in response) + { + if (chatUpdate.Role.HasValue) + { + Console.Write($"{chatUpdate.Role.Value.ToString().ToUpperInvariant()}: "); + } + if (!string.IsNullOrEmpty(chatUpdate.ContentUpdate)) + { + Console.Write(chatUpdate.ContentUpdate); + } + } +} +``` -- **Model Architecture:** Both Command R and Command R+ are auto-regressive language models that use an optimized transformer architecture. After pretraining, the models use supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety.+You can visualize how streaming generates content: -- **Languages covered:** The models are optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, Simplified Chinese, and Arabic. - Pre-training data additionally included the following 13 languages: Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, Persian. +```csharp +StreamMessageAsync(client).GetAwaiter().GetResult(); +``` -- **Context length:** Command R and Command R+ support a context length of 128K.+#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + PresencePenalty = 0.1f, + FrequencyPenalty = 0.8f, + MaxTokens = 2048, + StopSequences = { "<|endoftext|>" }, + Temperature = 0, + NucleusSamplingFactor = 1, + ResponseFormat = new ChatCompletionsResponseFormatText() +}; ++response = client.Complete(requestOptions); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` -- **Input:** Models input text only.+If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). -- **Output:** Models generate text only.+#### Create JSON outputs -## Deploy as a serverless API +Cohere Command chat models can create JSON outputs. Set `response_format` to `json_object` to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if `finish_reason="length"`, which indicates that the generation exceeded `max_tokens` or that the conversation exceeded the max context length. -Certain models in the model catalog can be deployed as a serverless API with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. This deployment option doesn't require quota from your subscription. -The previously mentioned Cohere models can be deployed as a service with pay-as-you-go billing and are offered by Cohere through the Microsoft Azure Marketplace. Cohere can change or update the terms of use and pricing of these models. +```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage( + "You are a helpful assistant that always generate responses in JSON format, " + + "using. the following format: { \"answer\": \"response\" }." + ), + new ChatRequestUserMessage( + "How many languages are in the world?" + ) + }, + ResponseFormat = new ChatCompletionsResponseFormatJSON() +}; -### Prerequisites +response = client.Complete(requestOptions); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` -- An Azure subscription with a valid payment method. Free or trial Azure subscriptions won't work. If you don't have an Azure subscription, create a [paid Azure account](https://azure.microsoft.com/pricing/purchase-options/pay-as-you-go) to begin.-- An [Azure AI Studio hub](../how-to/create-azure-ai-resource.md). The serverless API model deployment offering for Cohere Command is only available with hubs created in these regions:+### Pass extra parameters to the model - * East US - * East US 2 - * North Central US - * South Central US - * West US - * West US 3 - * Sweden Central - - For a list of regions that are available for each of the models supporting serverless API endpoint deployments, see [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md). +The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. -- An [AI Studio project](../how-to/create-projects.md) in Azure AI Studio.-- Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure AI Studio. To perform the steps in this article, your user account must be assigned the __Azure AI Developer role__ on the resource group. For more information on permissions, see [Role-based access control in Azure AI Studio](../concepts/rbac-ai-studio.md).+Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. -### Create a new deployment +```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + AdditionalProperties = { { "logprobs", BinaryData.FromString("true") } }, +}; -The following steps demonstrate the deployment of Cohere Command R, but you can use the same steps to deploy Cohere Command R+ by replacing the model name. +response = client.Complete(requestOptions, extraParams: ExtraParameters.PassThrough); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` -To create a deployment: +### Use tools -1. Sign in to [Azure AI Studio](https://ai.azure.com). -1. Select **Model catalog** from the left sidebar. -1. Search for *Cohere*. -1. Select **Cohere-command-r** to open the Model Details page. +Cohere Command chat models support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way. - :::image type="content" source="../media/deploy-monitor/cohere-command/command-r-deploy-directly-from-catalog.png" alt-text="A screenshot showing how to access the model details page by going through the model catalog." lightbox="../media/deploy-monitor/cohere-command/command-r-deploy-directly-from-catalog.png"::: +The following code example creates a tool definition that is able to look from flight information from two different cities. -1. Select **Deploy** to open a serverless API deployment window for the model. -1. Alternatively, you can initiate a deployment by starting from your project in AI Studio. - 1. From the left sidebar of your project, select **Components** > **Deployments**. - 1. Select **+ Create deployment**. - 1. Search for and select **Cohere-command-r**. to open the Model Details page. - - :::image type="content" source="../media/deploy-monitor/cohere-command/command-r-deploy-start-from-project.png" alt-text="A screenshot showing how to access the model details page by going through the Deployments page in your project." lightbox="../media/deploy-monitor/cohere-command/command-r-deploy-start-from-project.png"::: +```csharp +FunctionDefinition flightInfoFunction = new FunctionDefinition("getFlightInfo") +{ + Description = "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight", + Parameters = BinaryData.FromObjectAsJson(new + { + Type = "object", + Properties = new + { + origin_city = new + { + Type = "string", + Description = "The name of the city where the flight originates" + }, + destination_city = new + { + Type = "string", + Description = "The flight destination city" + } + } + }, + new JsonSerializerOptions() { PropertyNamingPolicy = JsonNamingPolicy.CamelCase } + ) +}; - 1. Select **Confirm** to open a serverless API deployment window for the model. +ChatCompletionsFunctionToolDefinition getFlightTool = new ChatCompletionsFunctionToolDefinition(flightInfoFunction); +``` - :::image type="content" source="../media/deploy-monitor/cohere-command/command-r-deploy-pay-as-you-go.png" alt-text="A screenshot showing how to deploy a model as a serverless API." lightbox="../media/deploy-monitor/cohere-command/command-r-deploy-pay-as-you-go.png"::: +In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train. -1. Select the project in which you want to deploy your model. To deploy the model your project must be in the *EastUS2* or *Sweden Central* region. -1. In the deployment wizard, select the link to **Azure Marketplace Terms** to learn more about the terms of use. -1. Select the **Pricing and terms** tab to learn about pricing for the selected model. -1. Select the **Subscribe and Deploy** button. If this is your first time deploying the model in the project, you have to subscribe your project for the particular offering. This step requires that your account has the **Azure AI Developer role** permissions on the resource group, as listed in the prerequisites. Each project has its own subscription to the particular Azure Marketplace offering of the model, which allows you to control and monitor spending. Currently, you can have only one deployment for each model within a project. -1. Once you subscribe the project for the particular Azure Marketplace offering, subsequent deployments of the _same_ offering in the _same_ project don't require subscribing again. If this scenario applies to you, there's a **Continue to deploy** option to select. - :::image type="content" source="../media/deploy-monitor/cohere-command/command-r-existing-subscription.png" alt-text="A screenshot showing a project that is already subscribed to the offering." lightbox="../media/deploy-monitor/cohere-command/command-r-existing-subscription.png"::: +```csharp +static string getFlightInfo(string loc_origin, string loc_destination) +{ + return JsonSerializer.Serialize(new + { + info = $"There are no flights available from {loc_origin} to {loc_destination}. You " + + "should take a train, specially if it helps to reduce CO2 emissions." + }); +} +``` -1. Give the deployment a name. This name becomes part of the deployment API URL. This URL must be unique in each Azure region. +> [!NOTE] +> Cohere-command-r-plus and Cohere-command-r require a tool's responses to be a valid JSON content formatted as a string. When constructing messages of type *Tool*, ensure the response is a valid JSON string. - :::image type="content" source="../media/deploy-monitor/cohere-command/command-r-deployment-name.png" alt-text="A screenshot showing how to indicate the name of the deployment you want to create." lightbox="../media/deploy-monitor/cohere-command/command-r-deployment-name.png"::: +Prompt the model to book flights with the help of this function: -1. Select **Deploy**. Wait until the deployment is ready and you're redirected to the Deployments page. -1. Select **Open in playground** to start interacting with the model. -1. Return to the Deployments page, select the deployment, and note the endpoint's **Target** URL and the Secret **Key**. For more information on using the APIs, see the [reference](#reference-for-cohere-models-deployed-as-a-service) section. -1. You can always find the endpoint's details, URL, and access keys by navigating to your **Project overview** page. Then, from the left sidebar of your project, select **Components** > **Deployments**. -To learn about billing for the Cohere models deployed as a serverless API with pay-as-you-go token-based billing, see [Cost and quota considerations for models deployed as a serverless API](#cost-and-quota-considerations-for-models-deployed-as-a-serverless-api). +```csharp +var chatHistory = new List<ChatRequestMessage>(){ + new ChatRequestSystemMessage( + "You are a helpful assistant that help users to find information about traveling, " + + "how to get to places and the different transportations options. You care about the" + + "environment and you always have that in mind when answering inqueries." + ), + new ChatRequestUserMessage("When is the next flight from Miami to Seattle?") + }; -### Consume the Cohere models as a service +requestOptions = new ChatCompletionsOptions(chatHistory); +requestOptions.Tools.Add(getFlightTool); +requestOptions.ToolChoice = ChatCompletionsToolChoice.Auto; -These models can be consumed using the chat API. +response = client.Complete(requestOptions); +``` -1. From your **Project overview** page, go to the left sidebar and select **Components** > **Deployments**. +You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type `function`. -1. Find and select the deployment you created. -1. Copy the **Target** URL and the **Key** value. +```csharp +var responseMenssage = response.Value.Choices[0].Message; +var toolsCall = responseMenssage.ToolCalls; -2. Cohere exposes two routes for inference with the Command R and Command R+ models. The [Azure AI Model Inference API](../reference/reference-model-inference-api.md) on the route `/chat/completions` and the native [Cohere API](#cohere-chat-api). +Console.WriteLine($"Finish reason: {response.Value.Choices[0].FinishReason}"); +Console.WriteLine($"Tool call: {toolsCall[0].Id}"); +``` -For more information on using the APIs, see the [reference](#reference-for-cohere-models-deployed-as-a-service) section. +To continue, append this message to the chat history: -## Reference for Cohere models deployed as a service -Cohere Command R and Command R+ models accept both the [Azure AI Model Inference API](../reference/reference-model-inference-api.md) on the route `/chat/completions` and the native [Cohere Chat API](#cohere-chat-api) on `/v1/chat`. +```csharp +requestOptions.Messages.Add(new ChatRequestAssistantMessage(response.Value.Choices[0].Message)); +``` -### Azure AI Model Inference API +Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history. -The [Azure AI Model Inference API](../reference/reference-model-inference-api.md) schema can be found in the [reference for Chat Completions](../reference/reference-model-inference-chat-completions.md) article and an [OpenAPI specification can be obtained from the endpoint itself](../reference/reference-model-inference-api.md?tabs=rest#getting-started). -### Cohere Chat API +```csharp +foreach (ChatCompletionsToolCall tool in toolsCall) +{ + if (tool is ChatCompletionsFunctionToolCall functionTool) + { + // Get the tool details: + string callId = functionTool.Id; + string toolName = functionTool.Name; + string toolArgumentsString = functionTool.Arguments; + Dictionary<string, object> toolArguments = JsonSerializer.Deserialize<Dictionary<string, object>>(toolArgumentsString); ++ // Here you have to call the function defined. In this particular example we use + // reflection to find the method we definied before in an static class called + // `ChatCompletionsExamples`. Using reflection allows us to call a function + // by string name. Notice that this is just done for demonstration purposes as a + // simple way to get the function callable from its string name. Then we can call + // it with the corresponding arguments. ++ var flags = BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic | BindingFlags.Static; + string toolResponse = (string)typeof(ChatCompletionsExamples).GetMethod(toolName, flags).Invoke(null, toolArguments.Values.Cast<object>().ToArray()); ++ Console.WriteLine("->", toolResponse); + requestOptions.Messages.Add(new ChatRequestToolMessage(toolResponse, callId)); + } + else + throw new Exception("Unsupported tool type"); +} +``` -The following contains details about Cohere Chat API. +View the response from the model: -#### Request +```csharp +response = client.Complete(requestOptions); ```- POST /v1/chat HTTP/1.1 - Host: <DEPLOYMENT_URI> - Authorization: Bearer <TOKEN> - Content-type: application/json ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```csharp +try +{ + requestOptions = new ChatCompletionsOptions() + { + Messages = { + new ChatRequestSystemMessage("You are an AI assistant that helps people find information."), + new ChatRequestUserMessage( + "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." + ), + }, + }; ++ response = client.Complete(requestOptions); + Console.WriteLine(response.Value.Choices[0].Message.Content); +} +catch (RequestFailedException ex) +{ + if (ex.ErrorCode == "content_filter") + { + Console.WriteLine($"Your query has trigger Azure Content Safeaty: {ex.Message}"); + } + else + { + throw; + } +} ``` -#### v1/chat request schema +> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++++## Cohere Command chat models ++The Cohere Command chat models include the following models: ++# [Cohere Command R+](#tab/cohere-command-r-plus) -Cohere Command R and Command R+ accept the following parameters for a `v1/chat` response inference call: +Command R+ is a generative large language model optimized for various use cases, including reasoning, summarization, and question answering. -|Key |Type |Default |Description | -||||| -|`message` |`string` |Required |Text input for the model to respond to. | -|`chat_history` |`array of messages` |`None` |A list of previous messages between the user and the model, meant to give the model conversational context for responding to the user's message. | -|`documents` |`array` |`None ` |A list of relevant documents that the model can cite to generate a more accurate reply. Each document is a string-string dictionary. Keys and values from each document are serialized to a string and passed to the model. The resulting generation includes citations that reference some of these documents. Some suggested keys are "text", "author", and "date". For better generation quality, it's recommended to keep the total word count of the strings in the dictionary to under 300 words. An `_excludes` field (array of strings) can be optionally supplied to omit some key-value pairs from being shown to the model. The omitted fields still show up in the citation object. The "_excludes" field aren't passed to the model. See [Document Mode](https://docs.cohere.com/docs/retrieval-augmented-generation-rag#document-mode) guide from Cohere docs. | -|`search_queries_only` |`boolean` |`false` |When `true`, the response only contains a list of generated search queries, but no search takes place, and no reply from the model to the user's `message` is generated.| -|`stream` |`boolean` |`false` |When `true`, the response is a JSON stream of events. The final event contains the complete response, and has an `event_type` of `"stream-end"`. Streaming is beneficial for user interfaces that render the contents of the response piece by piece, as it gets generated.| -|`max_tokens` |`integer` |None |The maximum number of tokens the model generates as part of the response. Note: Setting a low value might result in incomplete generations. If not specified, generates tokens until end of sequence.| -|`temperature` |`float` |`0.3` |Use a lower value to decrease randomness in the response. Randomness can be further maximized by increasing the value of the `p` parameter. Min value is 0, and max is 2. | -|`p` |`float` |`0.75` |Use a lower value to ignore less probable options. Set to 0 or 1.0 to disable. If both p and k are enabled, p acts after k. min value of 0.01, max value of 0.99.| -|`k` |`float` |`0` |Specify the number of token choices the model uses to generate the next token. If both p and k are enabled, p acts after k. Min value is 0, max value is 500.| -|`prompt_truncation` |`enum string` |`OFF` |Accepts `AUTO_PRESERVE_ORDER`, `AUTO`, `OFF`. Dictates how the prompt is constructed. With `prompt_truncation` set to `AUTO_PRESERVE_ORDER`, some elements from `chat_history` and `documents` are dropped to construct a prompt that fits within the model's context length limit. During this process, the order of the documents and chat history are preserved. With `prompt_truncation` set to "OFF", no elements are dropped.| -|`stop_sequences` |`array of strings` |`None` |The generated text is cut at the end of the earliest occurrence of a stop sequence. The sequence is included the text. | -|`frequency_penalty` |`float` |`0` |Used to reduce repetitiveness of generated tokens. The higher the value, the stronger a penalty is applied to previously present tokens, proportional to how many times they have already appeared in the prompt or prior generation. Min value of 0.0, max value of 1.0.| -|`presence_penalty` |`float` |`0` |Used to reduce repetitiveness of generated tokens. Similar to `frequency_penalty`, except that this penalty is applied equally to all tokens that have already appeared, regardless of their exact frequencies. Min value of 0.0, max value of 1.0.| -|`seed` |`integer` |`None` |If specified, the backend makes a best effort to sample tokens deterministically, such that repeated requests with the same seed and parameters should return the same result. However, determinism can't be guaranteed.| -|`return_prompt` |`boolean ` |`false ` |Returns the full prompt that was sent to the model when `true`. | -|`tools` |`array of objects` |`None` |_Field is subject to changes._ A list of available tools (functions) that the model might suggest invoking before producing a text response. When `tools` is passed (without `tool_results`), the `text` field in the response is `""` and the `tool_calls` field in the response is populated with a list of tool calls that need to be made. If no calls need to be made, the `tool_calls` array is empty.| -|`tool_results` |`array of objects` |`None` |_Field is subject to changes._ A list of results from invoking tools recommended by the model in the previous chat turn. Results are used to produce a text response and is referenced in citations. When using `tool_results`, `tools` must be passed as well. Each tool_result contains information about how it was invoked, and a list of outputs in the form of dictionaries. Cohere's unique fine-grained citation logic requires the output to be a list. In case the output is just one item, for example, `{"status": 200}`, still wrap it inside a list. | +* **Model Architecture**: Both Command R and Command R+ are autoregressive language models that use an optimized transformer architecture. After pre-training, the models use supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety. +* **Languages covered**: The models are optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, simplified Chinese, and Arabic. +* **Pre-training data also included the following 13 languages:** Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian. +* **Context length:** Command R and Command R+ support a context length of 128 K. -The `chat_history` object requires the following fields: +We recommend using Command R+ for those workflows that lean on complex retrieval augmented generation (RAG) functionality and multi-step tool use (agents). -|Key |Type |Description | -|||| -|`role` |`enum string` |Takes `USER`, `SYSTEM`, or `CHATBOT`. | -|`message` |`string` |Text contents of the message. | -The `documents` object has the following optional fields: +The following models are available: -|Key |Type |Default| Description | -||||| -|`id` |`string` |`None` |Can be supplied to identify the document in the citations. This field isn't passed to the model. | -|`_excludes` |`array of strings` |`None`| Can be optionally supplied to omit some key-value pairs from being shown to the model. The omitted fields still show up in the citation object. The `_excludes` field isn't passed to the model. | +* [Cohere-command-r-plus](https://aka.ms/azureai/landing/Cohere-command-r-plus) -#### v1/chat response schema -Response fields are fully documented on [Cohere's Chat API reference](https://docs.cohere.com/reference/chat). The response object always contains: +# [Cohere Command R](#tab/cohere-command-r) -|Key |Type |Description | -|||| -|`response_id` |`string` |Unique identifier for chat completion. | -|`generation_id` |`string` |Unique identifier for chat completion, used with Feedback endpoint on Cohere's platform. | -|`text` |`string` |Model's response to chat message input. | -|`finish_reason` |`enum string` |Why the generation was completed. Can be any of the following values: `COMPLETE`, `ERROR`, `ERROR_TOXIC`, `ERROR_LIMIT`, `USER_CANCEL` or `MAX_TOKENS` | -|`token_count` |`integer` |Count of tokens used. | -|`meta` |`string` |API usage data, including current version and billable tokens. | +Command R is a large language model optimized for various use cases, including reasoning, summarization, and question answering. -<br/> +* **Model Architecture**: Both Command R and Command R+ are autoregressive language models that use an optimized transformer architecture. After pre-training, the models use supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety. +* **Languages covered**: The models are optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, simplified Chinese, and Arabic. +* **Pre-training data also included the following 13 languages:** Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian. +* **Context length:** Command R and Command R+ support a context length of 128 K. -#### Documents -If `documents` are specified in the request, there are two other fields in the response: +Command R is great for simpler retrieval augmented generation (RAG) and single-step tool use tasks. It's also great for use in applications where price is a major consideration. -|Key |Type |Description | -|||| -|`documents ` |`array of objects` |Lists the documents that were cited in the response. | -|`citations` |`array of objects` |Specifies which part of the answer was found in a given document. | -`citations` is an array of objects with the following required fields: +The following models are available: -|Key |Type |Description | -|||| -|`start` |`integer` |The index of text that the citation starts at, counting from zero. For example, a generation of `Hello, world!` with a citation on `world` would have a start value of `7`. This is because the citation starts at `w`, which is the seventh character. | -|`end` |`integer` |The index of text that the citation ends after, counting from zero. For example, a generation of `Hello, world!` with a citation on `world` would have an end value of `11`. This is because the citation ends after `d`, which is the eleventh character. | -|`text` |`string` |The text of the citation. For example, a generation of `Hello, world!` with a citation of `world` would have a text value of `world`. | -|`document_ids` |`array of strings` |Identifiers of documents cited by this section of the generated reply. | +* [Cohere-command-r](https://aka.ms/azureai/landing/Cohere-command-r) -#### Tools -If `tools` are specified and invoked by the model, there's another field in the response: -|Key |Type |Description | -|||| -|`tool_calls ` |`array of objects` |Contains the tool calls generated by the model. Use it to invoke your tools. | +++> [!TIP] +> Additionally, Cohere supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [Cohere documentation](https://docs.cohere.com/reference/about) or see the [inference examples](#more-inference-examples) section to code examples. ++## Prerequisites ++To use Cohere Command chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Cohere Command chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). -`tool_calls` is an array of objects with the following fields: +> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) -|Key |Type |Description | -|||| -|`name` |`string` |Name of the tool to call. | -|`parameters` |`object` |The name and value of the parameters to use when invoking a tool. | +### A REST client -#### Search_queries_only -If `search_queries_only=TRUE` is specified in the request, there are two other fields in the response: +Models deployed with the [Azure AI model inference API](https://aka.ms/azureai/modelinference) can be consumed using any REST client. To use the REST client, you need the following prerequisites: -|Key |Type |Description | -|||| -|`is_search_required` |`boolean` |Instructs the model to generate a search query. | -|`search_queries` |`array of objects` |Object that contains a list of search queries. | +* To construct the requests, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name`` is your unique model deployment host name and `your-azure-region`` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. -`search_queries` is an array of objects with the following fields: +## Work with chat completions -|Key |Type |Description | -|||| -|`text` |`string` |The text of the search query. | -|`generation_id` |`string` |Unique identifier for the generated search query. Useful for submitting feedback. | +In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. -#### Examples +> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Cohere Command chat models. -##### Chat - Completions -The following example is a sample request call to get chat completions from the Cohere Command model. Use when generating a chat completion. +### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: ++```http +GET /info HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json +``` ++The response is as follows: -Request: ```json- { - "chat_history": [ - {"role":"USER", "message": "What is an interesting new role in AI if I don't have an ML background"}, - {"role":"CHATBOT", "message": "You could explore being a prompt engineer!"} - ], - "message": "What are some skills I should have" - } +{ + "model_name": "Cohere-command-r-plus", + "model_type": "chat-completions", + "model_provider_name": "Cohere" +} ``` -Response: +### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ```json- { - "response_id": "09613f65-c603-41e6-94b3-a7484571ac30", - "text": "Writing skills are very important for prompt engineering. Some other key skills are:\n- Creativity\n- Awareness of biases\n- Knowledge of how NLP models work\n- Debugging skills\n\nYou can also have some fun with it and try to create some interesting, innovative prompts to train an AI model that can then be used to create various applications.", - "generation_id": "6d31a57f-4d94-4b05-874d-36d0d78c9549", - "finish_reason": "COMPLETE", - "token_count": { - "prompt_tokens": 99, - "response_tokens": 70, - "total_tokens": 169, - "billed_tokens": 151 +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." },- "meta": { - "api_version": { - "version": "1" + { + "role": "user", + "content": "How many languages are in the world?" + } + ] +} +``` ++The response is as follows, where you can see the model's usage statistics: +++```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718726686, + "model": "Cohere-command-r-plus", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.", + "tool_calls": null },- "billed_units": { - "input_tokens": 81, - "output_tokens": 70 - } + "finish_reason": "stop", + "logprobs": null }+ ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 }+} ``` -##### Chat - Grounded generation and RAG capabilities +Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content -Command R and Command R+ are trained for RAG via a mixture of supervised fine-tuning and preference fine-tuning, using a specific prompt template. We introduce that prompt template via the `documents` parameter. The document snippets should be chunks, rather than long documents, typically around 100-400 words per chunk. Document snippets consist of key-value pairs. The keys should be short descriptive strings. The values can be text or semi-structured. +By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. -Request: ```json- { - "message": "Where do the tallest penguins live?", - "documents": [ +{ + "messages": [ {- "title": "Tall penguins", - "snippet": "Emperor penguins are the tallest." + "role": "system", + "content": "You are a helpful assistant." }, {- "title": "Penguin habitats", - "snippet": "Emperor penguins only live in Antarctica." + "role": "user", + "content": "How many languages are in the world?" + } + ], + "stream": true, + "temperature": 0, + "top_p": 1, + "max_tokens": 2048 +} +``` ++You can visualize how streaming generates content: +++```json +{ + "id": "23b54589eba14564ad8a2e6978775a39", + "object": "chat.completion.chunk", + "created": 1718726371, + "model": "Cohere-command-r-plus", + "choices": [ + { + "index": 0, + "delta": { + "role": "assistant", + "content": "" + }, + "finish_reason": null, + "logprobs": null } ]- } +} ``` -Response: +The last message in the stream has `finish_reason` set, indicating the reason for the generation process to stop. + ```json- { - "response_id": "d7e72d2e-06c0-469f-8072-a3aa6bd2e3b2", - "text": "Emperor penguins are the tallest species of penguin and they live in Antarctica.", - "generation_id": "b5685d8d-00b4-48f1-b32f-baebabb563d8", - "finish_reason": "COMPLETE", - "token_count": { - "prompt_tokens": 615, - "response_tokens": 15, - "total_tokens": 630, - "billed_tokens": 22 - }, - "meta": { - "api_version": { - "version": "1" +{ + "id": "23b54589eba14564ad8a2e6978775a39", + "object": "chat.completion.chunk", + "created": 1718726371, + "model": "Cohere-command-r-plus", + "choices": [ + { + "index": 0, + "delta": { + "content": "" },- "billed_units": { - "input_tokens": 7, - "output_tokens": 15 - } + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." },- "citations": [ - { - "start": 0, - "end": 16, - "text": "Emperor penguins", - "document_ids": [ - "doc_0" - ] - }, - { - "start": 69, - "end": 80, - "text": "Antarctica.", - "document_ids": [ - "doc_1" - ] - } - ], - "documents": [ - { - "id": "doc_0", - "snippet": "Emperor penguins are the tallest.", - "title": "Tall penguins" + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "presence_penalty": 0.1, + "frequency_penalty": 0.8, + "max_tokens": 2048, + "stop": ["<|endoftext|>"], + "temperature" :0, + "top_p": 1, + "response_format": { "type": "text" } +} +``` +++```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718726686, + "model": "Cohere-command-r-plus", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.", + "tool_calls": null },- { - "id": "doc_1", - "snippet": "Emperor penguins only live in Antarctica.", - "title": "Penguin habitats" - } - ] + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 }+} ``` -##### Chat - Tool Use +If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). -If invoking tools or generating a response based on tool results, use the following parameters. +#### Create JSON outputs ++Cohere Command chat models can create JSON outputs. Set `response_format` to `json_object` to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if `finish_reason="length"`, which indicates that the generation exceeded `max_tokens` or that the conversation exceeded the max context length. -Request: ```json- { - "message":"I'd like 4 apples and a fish please", - "tools":[ - { - "name":"personal_shopper", - "description":"Returns items and requested volumes to purchase", - "parameter_definitions":{ - "item":{ - "description":"the item requested to be purchased, in all caps eg. Bananas should be BANANAS", - "type": "str", - "required": true - }, - "quantity":{ - "description": "how many of the items should be purchased", - "type": "int", - "required": true - } - } - } - ], - - "tool_results": [ +{ + "messages": [ {- "call": { - "name": "personal_shopper", - "parameters": { - "item": "Apples", - "quantity": 4 - }, - "generation_id": "cb3a6e8b-6448-4642-b3cd-b1cc08f7360d" - }, - "outputs": [ - { - "response": "Sale completed" - } - ] + "role": "system", + "content": "You are a helpful assistant that always generate responses in JSON format, using the following format: { \"answer\": \"response\" }" }, {- "call": { - "name": "personal_shopper", - "parameters": { - "item": "Fish", - "quantity": 1 + "role": "user", + "content": "How many languages are in the world?" + } + ], + "response_format": { "type": "json_object" } +} +``` +++```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718727522, + "model": "Cohere-command-r-plus", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "{\"answer\": \"There are approximately 7,117 living languages in the world today, according to the latest estimates. However, this number can vary as some languages become extinct and others are newly discovered or classified.\"}", + "tool_calls": null },- "generation_id": "cb3a6e8b-6448-4642-b3cd-b1cc08f7360d" - }, - "outputs": [ - { - "response": "Sale not completed" - } - ] + "finish_reason": "stop", + "logprobs": null }- ] + ], + "usage": { + "prompt_tokens": 39, + "total_tokens": 87, + "completion_tokens": 48 }+} +``` ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. ++```http +POST /chat/completions HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json +extra-parameters: pass-through ``` -Response: ```json- { - "response_id": "fa634da2-ccd1-4b56-8308-058a35daa100", - "text": "I've completed the sale for 4 apples. \n\nHowever, there was an error regarding the fish; it appears that there is currently no stock.", - "generation_id": "f567e78c-9172-4cfa-beba-ee3c330f781a", - "chat_history": [ - { - "message": "I'd like 4 apples and a fish please", - "response_id": "fa634da2-ccd1-4b56-8308-058a35daa100", - "generation_id": "a4c5da95-b370-47a4-9ad3-cbf304749c04", - "role": "User" - }, - { - "message": "I've completed the sale for 4 apples. \n\nHowever, there was an error regarding the fish; it appears that there is currently no stock.", - "response_id": "fa634da2-ccd1-4b56-8308-058a35daa100", - "generation_id": "f567e78c-9172-4cfa-beba-ee3c330f781a", - "role": "Chatbot" - } - ], - "finish_reason": "COMPLETE", - "token_count": { - "prompt_tokens": 644, - "response_tokens": 31, - "total_tokens": 675, - "billed_tokens": 41 - }, - "meta": { - "api_version": { - "version": "1" - }, - "billed_units": { - "input_tokens": 10, - "output_tokens": 31 - } +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." },- "citations": [ - { - "start": 5, - "end": 23, - "text": "completed the sale", - "document_ids": [ - "" - ] + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "logprobs": true +} +``` ++### Use tools ++Cohere Command chat models support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way. ++The following code example creates a tool definition that is able to look from flight information from two different cities. +++```json +{ + "type": "function", + "function": { + "name": "get_flight_info", + "description": "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight", + "parameters": { + "type": "object", + "properties": { + "origin_city": { + "type": "string", + "description": "The name of the city where the flight originates" + }, + "destination_city": { + "type": "string", + "description": "The flight destination city" + } },- { - "start": 113, - "end": 132, - "text": "currently no stock.", - "document_ids": [ - "" - ] - } - ], - "documents": [ - { - "response": "Sale completed" - } - ] + "required": [ + "origin_city", + "destination_city" + ] + } }+} ``` -Once you run your function and received tool outputs, you can pass them back to the model to generate a response for the user. +In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train. ++> [!NOTE] +> Cohere-command-r-plus and Cohere-command-r require a tool's responses to be a valid JSON content formatted as a string. When constructing messages of type *Tool*, ensure the response is a valid JSON string. ++Prompt the model to book flights with the help of this function: -Request: ```json- { - "message":"I'd like 4 apples and a fish please", - "tools":[ - { - "name":"personal_shopper", - "description":"Returns items and requested volumes to purchase", - "parameter_definitions":{ - "item":{ - "description":"the item requested to be purchased, in all caps eg. Bananas should be BANANAS", - "type": "str", - "required": true - }, - "quantity":{ - "description": "how many of the items should be purchased", - "type": "int", - "required": true - } - } - } - ], - - "tool_results": [ +{ + "messages": [ {- "call": { - "name": "personal_shopper", - "parameters": { - "item": "Apples", - "quantity": 4 - }, - "generation_id": "cb3a6e8b-6448-4642-b3cd-b1cc08f7360d" - }, - "outputs": [ - { - "response": "Sale completed" - } - ] + "role": "system", + "content": "You are a helpful assistant that help users to find information about traveling, how to get to places and the different transportations options. You care about the environment and you always have that in mind when answering inqueries" }, {- "call": { - "name": "personal_shopper", - "parameters": { - "item": "Fish", - "quantity": 1 - }, - "generation_id": "cb3a6e8b-6448-4642-b3cd-b1cc08f7360d" - }, - "outputs": [ - { - "response": "Sale not completed" + "role": "user", + "content": "When is the next flight from Miami to Seattle?" + } + ], + "tool_choice": "auto", + "tools": [ + { + "type": "function", + "function": { + "name": "get_flight_info", + "description": "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight", + "parameters": { + "type": "object", + "properties": { + "origin_city": { + "type": "string", + "description": "The name of the city where the flight originates" + }, + "destination_city": { + "type": "string", + "description": "The flight destination city" + } + }, + "required": [ + "origin_city", + "destination_city" + ] + } }- ] } ]- } +} ``` -Response: +You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type `function`. + ```json- { - "response_id": "fa634da2-ccd1-4b56-8308-058a35daa100", - "text": "I've completed the sale for 4 apples. \n\nHowever, there was an error regarding the fish; it appears that there is currently no stock.", - "generation_id": "f567e78c-9172-4cfa-beba-ee3c330f781a", - "chat_history": [ - { - "message": "I'd like 4 apples and a fish please", - "response_id": "fa634da2-ccd1-4b56-8308-058a35daa100", - "generation_id": "a4c5da95-b370-47a4-9ad3-cbf304749c04", - "role": "User" +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718726007, + "model": "Cohere-command-r-plus", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "", + "tool_calls": [ + { + "id": "abc0dF1gh", + "type": "function", + "function": { + "name": "get_flight_info", + "arguments": "{\"origin_city\": \"Miami\", \"destination_city\": \"Seattle\"}", + "call_id": null + } + } + ] },- { - "message": "I've completed the sale for 4 apples. \n\nHowever, there was an error regarding the fish; it appears that there is currently no stock.", - "response_id": "fa634da2-ccd1-4b56-8308-058a35daa100", - "generation_id": "f567e78c-9172-4cfa-beba-ee3c330f781a", - "role": "Chatbot" - } - ], - "finish_reason": "COMPLETE", - "token_count": { - "prompt_tokens": 644, - "response_tokens": 31, - "total_tokens": 675, - "billed_tokens": 41 + "finish_reason": "tool_calls", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 190, + "total_tokens": 226, + "completion_tokens": 36 + } +} +``` ++To continue, append this message to the chat history: ++Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history. ++View the response from the model: +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant that help users to find information about traveling, how to get to places and the different transportations options. You care about the environment and you always have that in mind when answering inqueries" },- "meta": { - "api_version": { - "version": "1" - }, - "billed_units": { - "input_tokens": 10, - "output_tokens": 31 - } + { + "role": "user", + "content": "When is the next flight from Miami to Seattle?" },- "citations": [ - { - "start": 5, - "end": 23, - "text": "completed the sale", - "document_ids": [ - "" - ] - }, - { - "start": 113, - "end": 132, - "text": "currently no stock.", - "document_ids": [ - "" - ] + { + "role": "assistant", + "content": "", + "tool_calls": [ + { + "id": "abc0DeFgH", + "type": "function", + "function": { + "name": "get_flight_info", + "arguments": "{\"origin_city\": \"Miami\", \"destination_city\": \"Seattle\"}", + "call_id": null + } + } + ] + }, + { + "role": "tool", + "content": "{ \"info\": \"There are no flights available from Miami to Seattle. You should take a train, specially if it helps to reduce CO2 emissions.\" }", + "tool_call_id": "abc0DeFgH" + } + ], + "tool_choice": "auto", + "tools": [ + { + "type": "function", + "function": { + "name": "get_flight_info", + "description": "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight", + "parameters":{ + "type": "object", + "properties": { + "origin_city": { + "type": "string", + "description": "The name of the city where the flight originates" + }, + "destination_city": { + "type": "string", + "description": "The flight destination city" + } + }, + "required": ["origin_city", "destination_city"] }- ], - "documents": [ - { - "response": "Sale completed" }- ] - } + } + ] +} ``` -##### Chat - Search queries -If you're building a RAG agent, you can also use Cohere's Chat API to get search queries from Command. Specify `search_queries_only=TRUE` in your request. +### Apply content safety +The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. -Request: ```json- { - "message": "Which lego set has the greatest number of pieces?", - "search_queries_only": true - } +{ + "messages": [ + { + "role": "system", + "content": "You are an AI assistant that helps people find information." + }, + { + "role": "user", + "content": "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." + } + ] +} ``` -Response: ```json- { - "response_id": "5e795fe5-24b7-47b4-a8bc-b58a68c7c676", - "text": "", - "finish_reason": "COMPLETE", - "meta": { - "api_version": { - "version": "1" - } - }, - "is_search_required": true, - "search_queries": [ - { - "text": "lego set with most pieces", - "generation_id": "a086696b-ad8e-4d15-92e2-1c57a3526e1c" - } - ] +{ + "error": { + "message": "The response was filtered due to the prompt triggering Microsoft's content management policy. Please modify your prompt and retry.", + "type": null, + "param": "prompt", + "code": "content_filter", + "status": 400 }+} ``` -##### More inference examples +> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). -| **Package** | **Sample Notebook** | -|-|-| -| CLI using CURL and Python web requests - Command R | [command-r.ipynb](https://aka.ms/samples/cohere-command-r/webrequests)| -| CLI using CURL and Python web requests - Command R+ | [command-r-plus.ipynb](https://aka.ms/samples/cohere-command-r-plus/webrequests)| -| OpenAI SDK (experimental) | [openaisdk.ipynb](https://aka.ms/samples/cohere-command/openaisdk) | -| LangChain | [langchain.ipynb](https://aka.ms/samples/cohere/langchain) | -| Cohere SDK | [cohere-sdk.ipynb](https://aka.ms/samples/cohere-python-sdk) | -| LiteLLM SDK | [litellm.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/litellm.ipynb) | -##### Retrieval Augmented Generation (RAG) and tool use samples -**Description** | **Package** | **Sample Notebook** |--|---Create a local Facebook AI similarity search (FAISS) vector index, using Cohere embeddings - Langchain|`langchain`, `langchain_cohere`|[cohere_faiss_langchain_embed.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/cohere_faiss_langchain_embed.ipynb) -Use Cohere Command R/R+ to answer questions from data in local FAISS vector index - Langchain|`langchain`, `langchain_cohere`|[command_faiss_langchain.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/command_faiss_langchain.ipynb) -Use Cohere Command R/R+ to answer questions from data in AI search vector index - Langchain|`langchain`, `langchain_cohere`|[cohere-aisearch-langchain-rag.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/cohere-aisearch-langchain-rag.ipynb) -Use Cohere Command R/R+ to answer questions from data in AI search vector index - Cohere SDK| `cohere`, `azure_search_documents`|[cohere-aisearch-rag.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/cohere-aisearch-rag.ipynb) -Command R+ tool/function calling, using LangChain|`cohere`, `langchain`, `langchain_cohere`|[command_tools-langchain.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/command_tools-langchain.ipynb) +## More inference examples -## Cost and quotas +For more examples of how to use Cohere, see the following examples and tutorials: -### Cost and quota considerations for models deployed as a serverless API +| Description | Language | Sample | +|-|-|--| +| Web requests | Bash | [Command-R](https://aka.ms/samples/cohere-command-r/webrequests) - [Command-R+](https://aka.ms/samples/cohere-command-r-plus/webrequests) | +| Azure AI Inference package for JavaScript | JavaScript | [Link](https://aka.ms/azsdk/azure-ai-inference/javascript/samples) | +| Azure AI Inference package for Python | Python | [Link](https://aka.ms/azsdk/azure-ai-inference/python/samples) | +| OpenAI SDK (experimental) | Python | [Link](https://aka.ms/samples/cohere-command/openaisdk) | +| LangChain | Python | [Link](https://aka.ms/samples/cohere/langchain) | +| Cohere SDK | Python | [Link](https://aka.ms/samples/cohere-python-sdk) | +| LiteLLM SDK | Python | [Link](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/litellm.ipynb) | -Cohere models deployed as a serverless API with pay-as-you-go billing are offered by Cohere through the Azure Marketplace and integrated with Azure AI Studio for use. You can find the Azure Marketplace pricing when deploying the model. +#### Retrieval Augmented Generation (RAG) and tool use samples -Each time a project subscribes to a given offer from the Azure Marketplace, a new resource is created to track the costs associated with its consumption. The same resource is used to track costs associated with inference; however, multiple meters are available to track each scenario independently. +| Description | Packages | Sample | +|-||--| +| Create a local Facebook AI similarity search (FAISS) vector index, using Cohere embeddings - Langchain | `langchain`, `langchain_cohere` | [cohere_faiss_langchain_embed.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/cohere_faiss_langchain_embed.ipynb) | +| Use Cohere Command R/R+ to answer questions from data in local FAISS vector index - Langchain |`langchain`, `langchain_cohere` | [command_faiss_langchain.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/command_faiss_langchain.ipynb) | +| Use Cohere Command R/R+ to answer questions from data in AI search vector index - Langchain | `langchain`, `langchain_cohere` | [cohere-aisearch-langchain-rag.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/cohere-aisearch-langchain-rag.ipynb) | +| Use Cohere Command R/R+ to answer questions from data in AI search vector index - Cohere SDK | `cohere`, `azure_search_documents` | [cohere-aisearch-rag.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/cohere-aisearch-rag.ipynb) | +| Command R+ tool/function calling, using LangChain | `cohere`, `langchain`, `langchain_cohere` | [command_tools-langchain.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/command_tools-langchain.ipynb) | -For more information on how to track costs, see [monitor costs for models offered throughout the Azure Marketplace](./costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace). -Quota is managed per deployment. Each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute. However, we currently limit one deployment per model per project. Contact Microsoft Azure Support if the current rate limits aren't sufficient for your scenarios. +## Cost and quota considerations for Cohere family of models deployed as serverless API endpoints -## Content filtering +Quota is managed per deployment. Each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute. However, we currently limit one deployment per model per project. Contact Microsoft Azure Support if the current rate limits aren't sufficient for your scenarios. -Models deployed as a serverless API with pay-as-you-go billing are protected by [Azure AI Content Safety](../../ai-services/content-safety/overview.md). With Azure AI content safety, both the prompt and completion pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. Learn more about [content filtering here](../concepts/content-filtering.md). +Cohere models deployed as a serverless API are offered by Cohere through the Azure Marketplace and integrated with Azure AI Studio for use. You can find the Azure Marketplace pricing when deploying the model. ++Each time a project subscribes to a given offer from the Azure Marketplace, a new resource is created to track the costs associated with its consumption. The same resource is used to track costs associated with inference; however, multiple meters are available to track each scenario independently. ++For more information on how to track costs, see [Monitor costs for models offered through the Azure Marketplace](costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace). ## Related content -- [What is Azure AI Studio?](../what-is-ai-studio.md)-- [Azure AI FAQ article](../faq.yml)-- [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md)++* [Azure AI Model Inference API](../reference/reference-model-inference-api.md) +* [Deploy models as serverless APIs](deploy-models-serverless.md) +* [Consume serverless API endpoints from a different Azure AI Studio project or hub](deploy-models-serverless-connect.md) +* [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md) +* [Plan and manage costs (marketplace)](costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace) |
ai-studio | Deploy Models Cohere Embed | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-studio/how-to/deploy-models-cohere-embed.md | Title: How to deploy Cohere Embed models with Azure AI Studio + Title: How to use Cohere Embed V3 models with Azure AI Studio -description: Learn how to deploy Cohere Embed models with Azure AI Studio. -+description: Learn how to use Cohere Embed V3 models with Azure AI Studio. + Previously updated : 5/21/2024 Last updated : 08/08/2024 +reviewer: shubhirajMsft -++zone_pivot_groups: azure-ai-model-catalog-samples-embeddings -# How to deploy Cohere Embed models with Azure AI Studio +# How to use Cohere Embed V3 models with Azure AI Studio +In this article, you learn about Cohere Embed V3 models and how to use them with Azure AI Studio. +The Cohere family of models includes various models optimized for different use cases, including chat completions, embeddings, and rerank. Cohere models are optimized for various use cases that include reasoning, summarization, and question answering. -In this article, you learn how to use Azure AI Studio to deploy the Cohere Embed models as serverless APIs with pay-as-you-go token-based billing. -Cohere offers two Embed models in [Azure AI Studio](https://ai.azure.com). These models are available as serverless APIs with pay-as-you-go token-based billing. You can browse the Cohere family of models in the [Model Catalog](model-catalog.md) by filtering on the Cohere collection. -## Cohere Embed models -In this section, you learn about the two Cohere Embed models that are available in the model catalog: +## Cohere embedding models -* Cohere Embed v3 - English -* Cohere Embed v3 - Multilingual +The Cohere family of models for embeddings includes the following models: -You can browse the Cohere family of models in the [Model Catalog](model-catalog-overview.md) by filtering on the Cohere collection. +# [Cohere Embed v3 - English](#tab/cohere-embed-v3-english) -### Cohere Embed v3 - English -Cohere Embed English is the market's leading text representation model used for semantic search, retrieval-augmented generation (RAG), classification, and clustering. Embed English has top performance on the HuggingFace MTEB benchmark and performs well on use-cases for various industries, such as Finance, Legal, and General-Purpose Corpora. Embed English also has the following attributes: +Cohere Embed English is a text representation model used for semantic search, retrieval-augmented generation (RAG), classification, and clustering. Embed English performs well on the HuggingFace (massive text embed) MTEB benchmark and on use-cases for various industries, such as Finance, Legal, and General-Purpose Corpora. Embed English also has the following attributes: * Embed English has 1,024 dimensions. * Context window of the model is 512 tokens -### Cohere Embed v3 - Multilingual -Cohere Embed Multilingual is the market's leading text representation model used for semantic search, retrieval-augmented generation (RAG), classification, and clustering. Embed Multilingual supports 100+ languages and can be used to search within a language (for example, search with a French query on French documents) and across languages (for example, search with an English query on Chinese documents). Embed multilingual has state-of-the-art performance on multilingual benchmarks such as Miracl. Embed Multilingual also has the following attributes: ++# [Cohere Embed v3 - Multilingual](#tab/cohere-embed-v3-multilingual) ++Cohere Embed Multilingual is a text representation model used for semantic search, retrieval-augmented generation (RAG), classification, and clustering. Embed Multilingual supports more than 100 languages and can be used to search within a language (for example, to search with a French query on French documents) and across languages (for example, to search with an English query on Chinese documents). Embed multilingual performs well on multilingual benchmarks such as Miracl. Embed Multilingual also has the following attributes: * Embed Multilingual has 1,024 dimensions. * Context window of the model is 512 tokens -## Deploy as a serverless API -Certain models in the model catalog can be deployed as a serverless API with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. This deployment option doesn't require quota from your subscription. +++## Prerequisites ++To use Cohere Embed V3 models with Azure AI Studio, you need the following prerequisites: ++### A model deployment -The previously mentioned Cohere models can be deployed as a service with pay-as-you-go billing and are offered by Cohere through the Microsoft Azure Marketplace. Cohere can change or update the terms of use and pricing of these models. +**Deployment to serverless APIs** -### Prerequisites +Cohere Embed V3 models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. -- An Azure subscription with a valid payment method. Free or trial Azure subscriptions won't work. If you don't have an Azure subscription, create a [paid Azure account](https://azure.microsoft.com/pricing/purchase-options/pay-as-you-go) to begin.-- An [AI Studio hub](../how-to/create-azure-ai-resource.md). The serverless API model deployment offering for Cohere Embed is only available with hubs created in these regions:+Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). - * East US - * East US 2 - * North Central US - * South Central US - * West US - * West US 3 - * Sweden Central - - For a list of regions that are available for each of the models supporting serverless API endpoint deployments, see [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md). +> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) -- An [AI Studio project](../how-to/create-projects.md) in Azure AI Studio.-- Azure role-based access controls are used to grant access to operations in Azure AI Studio. To perform the steps in this article, your user account must be assigned the __Azure AI Developer role__ on the resource group. For more information on permissions, see [Role-based access control in Azure AI Studio](../concepts/rbac-ai-studio.md).+### The inference package installed +You can consume predictions from this model by using the `azure-ai-inference` package with Python. To install this package, you need the following prerequisites: -### Create a new deployment +* Python 3.8 or later installed, including pip. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. + +Once you have these prerequisites, install the Azure AI inference package with the following command: -The following steps demonstrate the deployment of Cohere Embed v3 - English, but you can use the same steps to deploy Cohere Embed v3 - Multilingual by replacing the model name. +```bash +pip install azure-ai-inference +``` -To create a deployment: +Read more about the [Azure AI inference package and reference](https://aka.ms/azsdk/azure-ai-inference/python/reference). -1. Sign in to [Azure AI Studio](https://ai.azure.com). -1. Select **Model catalog** from the left sidebar. -1. Search for *Cohere*. -1. Select **Cohere-embed-v3-english** to open the Model Details page. +> [!TIP] +> Additionally, Cohere supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [Cohere documentation](https://docs.cohere.com/reference/about). - :::image type="content" source="../media/deploy-monitor/cohere-embed/embed-english-deploy-directly-from-catalog.png" alt-text="A screenshot showing how to access the model details page by going through the model catalog." lightbox="../media/deploy-monitor/cohere-embed/embed-english-deploy-directly-from-catalog.png"::: +## Work with embeddings -1. Select **Deploy** to open a serverless API deployment window for the model. -1. Alternatively, you can initiate a deployment by starting from your project in AI Studio. +In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with an embeddings model. - 1. From the left sidebar of your project, select **Components** > **Deployments**. - 1. Select **+ Create deployment**. - 1. Search for and select **Cohere-embed-v3-english**. to open the Model Details page. +### Create a client to consume the model - :::image type="content" source="../media/deploy-monitor/cohere-embed/embed-english-deploy-start-from-project.png" alt-text="A screenshot showing how to access the model details page by going through the Deployments page in your project." lightbox="../media/deploy-monitor/cohere-embed/embed-english-deploy-start-from-project.png"::: +First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. - 1. Select **Confirm** to open a serverless API deployment window for the model. - :::image type="content" source="../media/deploy-monitor/cohere-embed/embed-english-deploy-pay-as-you-go.png" alt-text="A screenshot showing how to deploy a model with the pay-as-you-go option." lightbox="../media/deploy-monitor/cohere-embed/embed-english-deploy-pay-as-you-go.png"::: +```python +import os +from azure.ai.inference import EmbeddingsClient +from azure.core.credentials import AzureKeyCredential -1. Select the project in which you want to deploy your model. To deploy the model your project must be in the *EastUS2* or *Sweden Central* region. -1. In the deployment wizard, select the link to **Azure Marketplace Terms** to learn more about the terms of use. -1. Select the **Pricing and terms** tab to learn about pricing for the selected model. -1. Select the **Subscribe and Deploy** button. If this is your first time deploying the model in the project, you have to subscribe your project for the particular offering. This step requires that your account has the **Azure AI Developer role** permissions on the resource group, as listed in the prerequisites. Each project has its own subscription to the particular Azure Marketplace offering of the model, which allows you to control and monitor spending. Currently, you can have only one deployment for each model within a project. -1. Once you subscribe the project for the particular Azure Marketplace offering, subsequent deployments of the _same_ offering in the _same_ project don't require subscribing again. If this scenario applies to you, there's a **Continue to deploy** option to select. +model = EmbeddingsClient( + endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"], + credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"]), +) +``` - :::image type="content" source="../media/deploy-monitor/cohere-embed/embed-english-existing-subscription.png" alt-text="A screenshot showing a project that is already subscribed to the offering." lightbox="../media/deploy-monitor/cohere-embed/embed-english-existing-subscription.png"::: +### Get the model's capabilities -1. Give the deployment a name. This name becomes part of the deployment API URL. This URL must be unique in each Azure region. +The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: - :::image type="content" source="../media/deploy-monitor/cohere-embed/embed-english-deployment-name.png" alt-text="A screenshot showing how to indicate the name of the deployment you want to create." lightbox="../media/deploy-monitor/cohere-embed/embed-english-deployment-name.png"::: -1. Select **Deploy**. Wait until the deployment is ready and you're redirected to the Deployments page. -1. Select **Open in playground** to start interacting with the model. -1. Return to the Deployments page, select the deployment, and note the endpoint's **Target** URL and the Secret **Key**. For more information on using the APIs, see the [reference](#embed-api-reference-for-cohere-embed-models-deployed-as-a-service) section. -1. You can always find the endpoint's details, URL, and access keys by navigating to your **Project overview** page. Then, from the left sidebar of your project, select **Components** > **Deployments**. +```python +model_info = model.get_model_info() +``` -To learn about billing for the Cohere models deployed as a serverless API with pay-as-you-go token-based billing, see [Cost and quota considerations for Cohere models deployed as a service](#cost-and-quota-considerations-for-models-deployed-as-a-service). +The response is as follows: -### Consume the Cohere Embed models as a service -These models can be consumed using the embed API. +```python +print("Model name:", model_info.model_name) +print("Model type:", model_info.model_type) +print("Model provider name:", model_info.model_provider) +``` -1. From your **Project overview** page, go to the left sidebar and select **Components** > **Deployments**. +```console +Model name: Cohere-embed-v3-english +Model type": embeddings +Model provider name": Cohere +``` -1. Find and select the deployment you created. +### Create embeddings -1. Copy the **Target** URL and the **Key** value. +Create an embedding request to see the output of the model. -1. Cohere exposes two routes for inference with the Embed v3 - English and Embed v3 - Multilingual models. `v1/embeddings` adheres to the Azure AI Generative Messages API schema, and `v1/embed` supports Cohere's native API schema. +```python +response = model.embed( + input=["The ultimate answer to the question of life"], +) +``` - For more information on using the APIs, see the [reference](#embed-api-reference-for-cohere-embed-models-deployed-as-a-service) section. +> [!TIP] +> The context window for Cohere Embed V3 models is 512. Make sure that you don't exceed this limit when creating embeddings. -## Embed API reference for Cohere Embed models deployed as a service +The response is as follows, where you can see the model's usage statistics: -Cohere Embed v3 - English and Embed v3 - Multilingual accept both the [Azure AI Model Inference API](../reference/reference-model-inference-api.md) on the route `/embeddings` and the native [Cohere Embed v3 API](#cohere-embed-v3) on `/embed`. -### Azure AI Model Inference API +```python +import numpy as np -The [Azure AI Model Inference API](../reference/reference-model-inference-api.md) schema can be found in the [reference for Embeddings](../reference/reference-model-inference-embeddings.md) article and an [OpenAPI specification can be obtained from the endpoint itself](../reference/reference-model-inference-api.md?tabs=rest#getting-started). +for embed in response.data: + print("Embeding of size:", np.asarray(embed.embedding).shape) -### Cohere Embed v3 +print("Model:", response.model) +print("Usage:", response.usage) +``` -The following contains details about Cohere Embed v3 API. +It can be useful to compute embeddings in input batches. The parameter `inputs` can be a list of strings, where each string is a different input. In turn the response is a list of embeddings, where each embedding corresponds to the input in the same position. -#### Request +```python +response = model.embed( + input=[ + "The ultimate answer to the question of life", + "The largest planet in our solar system is Jupiter", + ], +) ```- POST /v1/embed HTTP/1.1 - Host: <DEPLOYMENT_URI> - Authorization: Bearer <TOKEN> - Content-type: application/json ++The response is as follows, where you can see the model's usage statistics: +++```python +import numpy as np ++for embed in response.data: + print("Embeding of size:", np.asarray(embed.embedding).shape) ++print("Model:", response.model) +print("Usage:", response.usage) ``` -#### v1/embed request schema +> [!TIP] +> Cohere Embed V3 models can take batches of 1024 at a time. When creating batches, make sure that you don't exceed this limit. ++#### Create different types of embeddings -Cohere Embed v3 - English and Embed v3 - Multilingual accept the following parameters for a `v1/embed` API call: +Cohere Embed V3 models can generate multiple embeddings for the same input depending on how you plan to use them. This capability allows you to retrieve more accurate embeddings for RAG patterns. -|Key |Type |Default |Description | -||||| -|`texts` |`array of strings` |Required |An array of strings for the model to embed. Maximum number of texts per call is 96. We recommend reducing the length of each text to be under 512 tokens for optimal quality. | -|`input_type` |`enum string` |Required |Prepends special tokens to differentiate each type from one another. You shouldn't mix different types together, except when mixing types for search and retrieval. In this case, embed your corpus with the `search_document` type and embedded queries with type `search_query` type. <br/> `search_document` ΓÇô In search use-cases, use search_document when you encode documents for embeddings that you store in a vector database. <br/> `search_query` ΓÇô Use search_query when querying your vector database to find relevant documents. <br/> `classification` ΓÇô Use classification when using embeddings as an input to a text classifier. <br/> `clustering` ΓÇô Use clustering to cluster the embeddings.| -|`truncate` |`enum string` |`NONE` |`NONE` ΓÇô Returns an error when the input exceeds the maximum input token length. <br/> `START` ΓÇô Discards the start of the input. <br/> `END` ΓÇô Discards the end of the input. | -|`embedding_types` |`array of strings` |`float` |Specifies the types of embeddings you want to get back. Can be one or more of the following types. `float`, `int8`, `uint8`, `binary`, `ubinary` | +The following example shows how to create embeddings that are used to create an embedding for a document that will be stored in a vector database: -#### v1/embed response schema -Cohere Embed v3 - English and Embed v3 - Multilingual include the following fields in the response: +```python +from azure.ai.inference.models import EmbeddingInputType -|Key |Type |Description | -|||| -|`response_type` |`enum` |The response type. Returns `embeddings_floats` when `embedding_types` isn't specified, or returns `embeddings_by_type` when `embeddings_types` is specified. | -|`id` |`integer` |An identifier for the response. | -|`embeddings` |`array` or `array of objects` |An array of embeddings, where each embedding is an array of floats with 1,024 elements. The length of the embeddings array is the same as the length of the original texts array.| -|`texts` |`array of strings` |The text entries for which embeddings were returned. | -|`meta` |`string` |API usage data, including current version and billable tokens. | +response = model.embed( + input=["The answer to the ultimate question of life, the universe, and everything is 42"], + input_type=EmbeddingInputType.DOCUMENT, +) +``` -For more information, see [https://docs.cohere.com/reference/embed](https://docs.cohere.com/reference/embed). +When you work on a query to retrieve such a document, you can use the following code snippet to create the embeddings for the query and maximize the retrieval performance. -### v1/embed examples -#### embeddings_floats Response +```python +from azure.ai.inference.models import EmbeddingInputType -Request: +response = model.embed( + input=["What's the ultimate meaning of life?"], + input_type=EmbeddingInputType.QUERY, +) +``` -```json - { - "input_type": "clustering", - "truncate": "START", - "texts":["hi", "hello"] +Cohere Embed V3 models can optimize the embeddings based on its use case. +++++## Cohere embedding models ++The Cohere family of models for embeddings includes the following models: ++# [Cohere Embed v3 - English](#tab/cohere-embed-v3-english) ++Cohere Embed English is a text representation model used for semantic search, retrieval-augmented generation (RAG), classification, and clustering. Embed English performs well on the HuggingFace (massive text embed) MTEB benchmark and on use-cases for various industries, such as Finance, Legal, and General-Purpose Corpora. Embed English also has the following attributes: ++* Embed English has 1,024 dimensions. +* Context window of the model is 512 tokens +++# [Cohere Embed v3 - Multilingual](#tab/cohere-embed-v3-multilingual) ++Cohere Embed Multilingual is a text representation model used for semantic search, retrieval-augmented generation (RAG), classification, and clustering. Embed Multilingual supports more than 100 languages and can be used to search within a language (for example, to search with a French query on French documents) and across languages (for example, to search with an English query on Chinese documents). Embed multilingual performs well on multilingual benchmarks such as Miracl. Embed Multilingual also has the following attributes: ++* Embed Multilingual has 1,024 dimensions. +* Context window of the model is 512 tokens +++++## Prerequisites ++To use Cohere Embed V3 models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Cohere Embed V3 models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `@azure-rest/ai-inference` package from `npm`. To install this package, you need the following prerequisites: ++* LTS versions of `Node.js` with `npm`. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++Once you have these prerequisites, install the Azure Inference library for JavaScript with the following command: ++```bash +npm install @azure-rest/ai-inference +``` ++> [!TIP] +> Additionally, Cohere supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [Cohere documentation](https://docs.cohere.com/reference/about). ++## Work with embeddings ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with an embeddings model. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```javascript +import ModelClient from "@azure-rest/ai-inference"; +import { isUnexpected } from "@azure-rest/ai-inference"; +import { AzureKeyCredential } from "@azure/core-auth"; ++const client = new ModelClient( + process.env.AZURE_INFERENCE_ENDPOINT, + new AzureKeyCredential(process.env.AZURE_INFERENCE_CREDENTIAL) +); +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```javascript +await client.path("/info").get() +``` ++The response is as follows: +++```javascript +console.log("Model name: ", model_info.body.model_name); +console.log("Model type: ", model_info.body.model_type); +console.log("Model provider name: ", model_info.body.model_provider_name); +``` ++```console +Model name: Cohere-embed-v3-english +Model type": embeddings +Model provider name": Cohere +``` ++### Create embeddings ++Create an embedding request to see the output of the model. ++```javascript +var response = await client.path("/embeddings").post({ + body: { + input: ["The ultimate answer to the question of life"], }+}); ``` -Response: +> [!TIP] +> The context window for Cohere Embed V3 models is 512. Make sure that you don't exceed this limit when creating embeddings. -```json - { - "id": "da7a104c-e504-4349-bcd4-4d69dfa02077", - "texts": [ - "hi", - "hello" - ], - "embeddings": [ - [ - ... - ], - [ - ... - ] +The response is as follows, where you can see the model's usage statistics: +++```javascript +if (isUnexpected(response)) { + throw response.body.error; +} ++console.log(response.embedding); +console.log(response.body.model); +console.log(response.body.usage); +``` ++It can be useful to compute embeddings in input batches. The parameter `inputs` can be a list of strings, where each string is a different input. In turn the response is a list of embeddings, where each embedding corresponds to the input in the same position. +++```javascript +var response = await client.path("/embeddings").post({ + body: { + input: [ + "The ultimate answer to the question of life", + "The largest planet in our solar system is Jupiter", ],- "meta": { - "api_version": { - "version": "1" - }, - "billed_units": { - "input_tokens": 2 - } - }, - "response_type": "embeddings_floats" }+}); ``` -#### Embeddings_by_types response +The response is as follows, where you can see the model's usage statistics: +++```javascript +if (isUnexpected(response)) { + throw response.body.error; +} ++console.log(response.embedding); +console.log(response.body.model); +console.log(response.body.usage); +``` ++> [!TIP] +> Cohere Embed V3 models can take batches of 1024 at a time. When creating batches, make sure that you don't exceed this limit. ++#### Create different types of embeddings ++Cohere Embed V3 models can generate multiple embeddings for the same input depending on how you plan to use them. This capability allows you to retrieve more accurate embeddings for RAG patterns. ++The following example shows how to create embeddings that are used to create an embedding for a document that will be stored in a vector database: +++```javascript +var response = await client.path("/embeddings").post({ + body: { + input: ["The answer to the ultimate question of life, the universe, and everything is 42"], + input_type: "document", + } +}); +``` ++When you work on a query to retrieve such a document, you can use the following code snippet to create the embeddings for the query and maximize the retrieval performance. +++```javascript +var response = await client.path("/embeddings").post({ + body: { + input: ["What's the ultimate meaning of life?"], + input_type: "query", + } +}); +``` ++Cohere Embed V3 models can optimize the embeddings based on its use case. +++++## Cohere embedding models ++The Cohere family of models for embeddings includes the following models: ++# [Cohere Embed v3 - English](#tab/cohere-embed-v3-english) ++Cohere Embed English is a text representation model used for semantic search, retrieval-augmented generation (RAG), classification, and clustering. Embed English performs well on the HuggingFace (massive text embed) MTEB benchmark and on use-cases for various industries, such as Finance, Legal, and General-Purpose Corpora. Embed English also has the following attributes: ++* Embed English has 1,024 dimensions. +* Context window of the model is 512 tokens +++# [Cohere Embed v3 - Multilingual](#tab/cohere-embed-v3-multilingual) ++Cohere Embed Multilingual is a text representation model used for semantic search, retrieval-augmented generation (RAG), classification, and clustering. Embed Multilingual supports more than 100 languages and can be used to search within a language (for example, to search with a French query on French documents) and across languages (for example, to search with an English query on Chinese documents). Embed multilingual performs well on multilingual benchmarks such as Miracl. Embed Multilingual also has the following attributes: ++* Embed Multilingual has 1,024 dimensions. +* Context window of the model is 512 tokens +++++## Prerequisites ++To use Cohere Embed V3 models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Cohere Embed V3 models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### A REST client ++Models deployed with the [Azure AI model inference API](https://aka.ms/azureai/modelinference) can be consumed using any REST client. To use the REST client, you need the following prerequisites: ++* To construct the requests, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++> [!TIP] +> Additionally, Cohere supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [Cohere documentation](https://docs.cohere.com/reference/about). ++## Work with embeddings ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with an embeddings model. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: ++```http +GET /info HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json +``` ++The response is as follows: -Request: ```json- { - "input_type": "clustering", - "embedding_types": ["int8", "binary"], - "truncate": "START", - "texts":["hi", "hello"] +{ + "model_name": "Cohere-embed-v3-english", + "model_type": "embeddings", + "model_provider_name": "Cohere" +} +``` ++### Create embeddings ++Create an embedding request to see the output of the model. ++```json +{ + "input": [ + "The ultimate answer to the question of life" + ] +} +``` ++> [!TIP] +> The context window for Cohere Embed V3 models is 512. Make sure that you don't exceed this limit when creating embeddings. ++The response is as follows, where you can see the model's usage statistics: +++```json +{ + "id": "0ab1234c-d5e6-7fgh-i890-j1234k123456", + "object": "list", + "data": [ + { + "index": 0, + "object": "embedding", + "embedding": [ + 0.017196655, + // ... + -0.000687122, + -0.025054932, + -0.015777588 + ] + } + ], + "model": "Cohere-embed-v3-english", + "usage": { + "prompt_tokens": 9, + "completion_tokens": 0, + "total_tokens": 9 }+} ``` -Response: +It can be useful to compute embeddings in input batches. The parameter `inputs` can be a list of strings, where each string is a different input. In turn the response is a list of embeddings, where each embedding corresponds to the input in the same position. + ```json- { - "id": "b604881a-a5e1-4283-8c0d-acbd715bf144", - "texts": [ - "hi", - "hello" - ], - "embeddings": { - "binary": [ - [ - ... - ], - [ - ... - ] - ], - "int8": [ - [ - ... - ], - [ - ... - ] +{ + "input": [ + "The ultimate answer to the question of life", + "The largest planet in our solar system is Jupiter" + ] +} +``` ++The response is as follows, where you can see the model's usage statistics: +++```json +{ + "id": "0ab1234c-d5e6-7fgh-i890-j1234k123456", + "object": "list", + "data": [ + { + "index": 0, + "object": "embedding", + "embedding": [ + 0.017196655, + // ... + -0.000687122, + -0.025054932, + -0.015777588 ] },- "meta": { - "api_version": { - "version": "1" - }, - "billed_units": { - "input_tokens": 2 - } - }, - "response_type": "embeddings_by_type" + { + "index": 1, + "object": "embedding", + "embedding": [ + 0.017196655, + // ... + -0.000687122, + -0.025054932, + -0.015777588 + ] + } + ], + "model": "Cohere-embed-v3-english", + "usage": { + "prompt_tokens": 19, + "completion_tokens": 0, + "total_tokens": 19 }+} ``` -#### More inference examples +> [!TIP] +> Cohere Embed V3 models can take batches of 1024 at a time. When creating batches, make sure that you don't exceed this limit. ++#### Create different types of embeddings ++Cohere Embed V3 models can generate multiple embeddings for the same input depending on how you plan to use them. This capability allows you to retrieve more accurate embeddings for RAG patterns. -| **Package** | **Sample Notebook** | -|-|-| -| CLI using CURL and Python web requests | [cohere-embed.ipynb](https://aka.ms/samples/embed-v3/webrequests)| -| OpenAI SDK (experimental) | [openaisdk.ipynb](https://aka.ms/samples/cohere-embed/openaisdk) | -| LangChain | [langchain.ipynb](https://aka.ms/samples/cohere-embed/langchain) | -| Cohere SDK | [cohere-sdk.ipynb](https://aka.ms/samples/cohere-embed/cohere-python-sdk) | -| LiteLLM SDK | [litellm.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/litellm.ipynb) | +The following example shows how to create embeddings that are used to create an embedding for a document that will be stored in a vector database: -##### Retrieval Augmented Generation (RAG) and tool-use samples -**Description** | **Package** | **Sample Notebook** |--|---Create a local Facebook AI Similarity Search (FAISS) vector index, using Cohere embeddings - Langchain|`langchain`, `langchain_cohere`|[cohere_faiss_langchain_embed.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/cohere_faiss_langchain_embed.ipynb) -Use Cohere Command R/R+ to answer questions from data in local FAISS vector index - Langchain|`langchain`, `langchain_cohere`|[command_faiss_langchain.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/command_faiss_langchain.ipynb) -Use Cohere Command R/R+ to answer questions from data in AI search vector index - Langchain|`langchain`, `langchain_cohere`|[cohere-aisearch-langchain-rag.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/cohere-aisearch-langchain-rag.ipynb) -Use Cohere Command R/R+ to answer questions from data in AI search vector index - Cohere SDK| `cohere`, `azure_search_documents`|[cohere-aisearch-rag.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/cohere-aisearch-rag.ipynb) -Command R+ tool/function calling, using LangChain|`cohere`, `langchain`, `langchain_cohere`|[command_tools-langchain.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/command_tools-langchain.ipynb) -## Cost and quotas +```json +{ + "input": [ + "The answer to the ultimate question of life, the universe, and everything is 42" + ], + "input_type": "document" +} +``` -### Cost and quota considerations for models deployed as a service +When you work on a query to retrieve such a document, you can use the following code snippet to create the embeddings for the query and maximize the retrieval performance. -Cohere models deployed as a serverless API with pay-as-you-go billing are offered by Cohere through the Azure Marketplace and integrated with Azure AI Studio for use. You can find the Azure Marketplace pricing when deploying the model. -Each time a project subscribes to a given offer from the Azure Marketplace, a new resource is created to track the costs associated with its consumption. The same resource is used to track costs associated with inference; however, multiple meters are available to track each scenario independently. +```json +{ + "input": [ + "What's the ultimate meaning of life?" + ], + "input_type": "query" +} +``` -For more information on how to track costs, see [monitor costs for models offered throughout the Azure Marketplace](./costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace). +Cohere Embed V3 models can optimize the embeddings based on its use case. -Quota is managed per deployment. Each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute. However, we currently limit one deployment per model per project. Contact Microsoft Azure Support if the current rate limits aren't sufficient for your scenarios. -## Content filtering +## More inference examples -Models deployed as a serverless API are protected by [Azure AI Content Safety](../../ai-services/content-safety/overview.md). With Azure AI content safety, both the prompt and completion pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. Learn more about [content filtering here](../concepts/content-filtering.md). +| Description | Language | Sample | +|-|-|--| +| Web requests | Bash | [Command-R](https://aka.ms/samples/cohere-command-r/webrequests) - [Command-R+](https://aka.ms/samples/cohere-command-r-plus/webrequests) | +| Azure AI Inference package for JavaScript | JavaScript | [Link](https://aka.ms/azsdk/azure-ai-inference/javascript/samples) | +| Azure AI Inference package for Python | Python | [Link](https://aka.ms/azsdk/azure-ai-inference/python/samples) | +| OpenAI SDK (experimental) | Python | [Link](https://aka.ms/samples/cohere-command/openaisdk) | +| LangChain | Python | [Link](https://aka.ms/samples/cohere/langchain) | +| Cohere SDK | Python | [Link](https://aka.ms/samples/cohere-python-sdk) | +| LiteLLM SDK | Python | [Link](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/litellm.ipynb) | ++#### Retrieval Augmented Generation (RAG) and tool use samples ++| Description | Packages | Sample | +|-||--| +| Create a local Facebook AI similarity search (FAISS) vector index, using Cohere embeddings - Langchain | `langchain`, `langchain_cohere` | [cohere_faiss_langchain_embed.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/cohere_faiss_langchain_embed.ipynb) | +| Use Cohere Command R/R+ to answer questions from data in local FAISS vector index - Langchain |`langchain`, `langchain_cohere` | [command_faiss_langchain.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/command_faiss_langchain.ipynb) | +| Use Cohere Command R/R+ to answer questions from data in AI search vector index - Langchain | `langchain`, `langchain_cohere` | [cohere-aisearch-langchain-rag.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/cohere-aisearch-langchain-rag.ipynb) | +| Use Cohere Command R/R+ to answer questions from data in AI search vector index - Cohere SDK | `cohere`, `azure_search_documents` | [cohere-aisearch-rag.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/cohere-aisearch-rag.ipynb) | +| Command R+ tool/function calling, using LangChain | `cohere`, `langchain`, `langchain_cohere` | [command_tools-langchain.ipynb](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/cohere/command_tools-langchain.ipynb) | +++## Cost and quota considerations for Cohere family of models deployed as serverless API endpoints ++Cohere models deployed as a serverless API are offered by Cohere through the Azure Marketplace and integrated with Azure AI Studio for use. You can find the Azure Marketplace pricing when deploying the model. ++Each time a project subscribes to a given offer from the Azure Marketplace, a new resource is created to track the costs associated with its consumption. The same resource is used to track costs associated with inference; however, multiple meters are available to track each scenario independently. ++For more information on how to track costs, see monitor costs for models offered throughout the Azure Marketplace. ++Quota is managed per deployment. Each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute. However, we currently limit one deployment per model per project. Contact Microsoft Azure Support if the current rate limits aren't sufficient for your scenarios. ## Related content -- [What is Azure AI Studio?](../what-is-ai-studio.md)-- [Azure AI FAQ article](../faq.yml)-- [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md)++* [Azure AI Model Inference API](../reference/reference-model-inference-api.md) +* [Deploy models as serverless APIs](deploy-models-serverless.md) +* [Consume serverless API endpoints from a different Azure AI Studio project or hub](deploy-models-serverless-connect.md) +* [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md) +* [Plan and manage costs (marketplace)](costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace) |
ai-studio | Deploy Models Cohere Rerank | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-studio/how-to/deploy-models-cohere-rerank.md | To create a deployment: 1. Alternatively, you can initiate a deployment by starting from your project in AI Studio. 1. From the left sidebar of your project, select **Components** > **Deployments**.- 1. Select **+ Create deployment**. + 1. Select **+ Deploy model**. 1. Search for and select **Cohere-rerank-3-english**. to open the Model Details page. 1. Select **Confirm** to open a serverless API deployment window for the model. |
ai-studio | Deploy Models Jais | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-studio/how-to/deploy-models-jais.md | + + Title: How to use Jais chat models with Azure AI Studio ++description: Learn how to use Jais chat models with Azure AI Studio. +++ Last updated : 08/08/2024++reviewer: hazemelh ++++zone_pivot_groups: azure-ai-model-catalog-samples-chat +++# How to use Jais chat models ++In this article, you learn about Jais chat models and how to use them. +JAIS 30b Chat is an autoregressive bi-lingual LLM for **Arabic** & **English**. The tuned versions use supervised fine-tuning (SFT). The model is fine-tuned with both Arabic and English prompt-response pairs. The fine-tuning datasets included a wide range of instructional data across various domains. The model covers a wide range of common tasks including question answering, code generation, and reasoning over textual content. To enhance performance in Arabic, the Core42 team developed an in-house Arabic dataset and translated some open-source English instructions into Arabic. ++* **Context length:** JAIS supports a context length of 8K. +* **Input:** Model input is text only. +* **Output:** Model generates text only. +++++++You can learn more about the models in their respective model card: ++* [jais-30b-chat](https://aka.ms/azureai/landing/jais-30b-chat) +++## Prerequisites ++To use Jais chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Jais chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `azure-ai-inference` package with Python. To install this package, you need the following prerequisites: ++* Python 3.8 or later installed, including pip. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. + +Once you have these prerequisites, install the Azure AI inference package with the following command: ++```bash +pip install azure-ai-inference +``` ++Read more about the [Azure AI inference package and reference](https://aka.ms/azsdk/azure-ai-inference/python/reference). ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Jais chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```python +import os +from azure.ai.inference import ChatCompletionsClient +from azure.core.credentials import AzureKeyCredential ++client = ChatCompletionsClient( + endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"], + credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"]), +) +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```python +model_info = client.get_model_info() +``` ++The response is as follows: +++```python +print("Model name:", model_info.model_name) +print("Model type:", model_info.model_type) +print("Model provider name:", model_info.model_provider) +``` ++```console +Model name: jais-30b-chat +Model type: chat-completions +Model provider name: G42 +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```python +from azure.ai.inference.models import SystemMessage, UserMessage ++response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], +) +``` ++The response is as follows, where you can see the model's usage statistics: +++```python +print("Response:", response.choices[0].message.content) +print("Model:", response.model) +print("Usage:") +print("\tPrompt tokens:", response.usage.prompt_tokens) +print("\tTotal tokens:", response.usage.total_tokens) +print("\tCompletion tokens:", response.usage.completion_tokens) +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: jais-30b-chat +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```python +result = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + temperature=0, + top_p=1, + max_tokens=2048, + stream=True, +) +``` ++To stream completions, set `stream=True` when you call the model. ++To visualize the output, define a helper function to print the stream. ++```python +def print_stream(result): + """ + Prints the chat completion with streaming. Some delay is added to simulate + a real-time conversation. + """ + import time + for update in result: + if update.choices: + print(update.choices[0].delta.content, end="") + time.sleep(0.05) +``` ++You can visualize how streaming generates content: +++```python +print_stream(result) +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```python +from azure.ai.inference.models import ChatCompletionsResponseFormat ++response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + presence_penalty=0.1, + frequency_penalty=0.8, + max_tokens=2048, + stop=["<|endoftext|>"], + temperature=0, + top_p=1, + response_format={ "type": ChatCompletionsResponseFormat.TEXT }, +) +``` ++> [!WARNING] +> Jais doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```python +response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + model_extras={ + "logprobs": True + } +) +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```python +from azure.ai.inference.models import AssistantMessage, UserMessage, SystemMessage ++try: + response = client.complete( + messages=[ + SystemMessage(content="You are an AI assistant that helps people find information."), + UserMessage(content="Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills."), + ] + ) ++ print(response.choices[0].message.content) ++except HttpResponseError as ex: + if ex.status_code == 400: + response = ex.response.json() + if isinstance(response, dict) and "error" in response: + print(f"Your request triggered an {response['error']['code']} error:\n\t {response['error']['message']}") + else: + raise + raise +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++++++You can learn more about the models in their respective model card: ++* [jais-30b-chat](https://aka.ms/azureai/landing/jais-30b-chat) +++## Prerequisites ++To use Jais chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Jais chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `@azure-rest/ai-inference` package from `npm`. To install this package, you need the following prerequisites: ++* LTS versions of `Node.js` with `npm`. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++Once you have these prerequisites, install the Azure Inference library for JavaScript with the following command: ++```bash +npm install @azure-rest/ai-inference +``` ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Jais chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```javascript +import ModelClient from "@azure-rest/ai-inference"; +import { isUnexpected } from "@azure-rest/ai-inference"; +import { AzureKeyCredential } from "@azure/core-auth"; ++const client = new ModelClient( + process.env.AZURE_INFERENCE_ENDPOINT, + new AzureKeyCredential(process.env.AZURE_INFERENCE_CREDENTIAL) +); +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```javascript +var model_info = await client.path("/info").get() +``` ++The response is as follows: +++```javascript +console.log("Model name: ", model_info.body.model_name) +console.log("Model type: ", model_info.body.model_type) +console.log("Model provider name: ", model_info.body.model_provider_name) +``` ++```console +Model name: jais-30b-chat +Model type: chat-completions +Model provider name: G42 +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } +}); +``` ++The response is as follows, where you can see the model's usage statistics: +++```javascript +if (isUnexpected(response)) { + throw response.body.error; +} ++console.log("Response: ", response.body.choices[0].message.content); +console.log("Model: ", response.body.model); +console.log("Usage:"); +console.log("\tPrompt tokens:", response.body.usage.prompt_tokens); +console.log("\tTotal tokens:", response.body.usage.total_tokens); +console.log("\tCompletion tokens:", response.body.usage.completion_tokens); +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: jais-30b-chat +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } +}).asNodeStream(); +``` ++To stream completions, use `.asNodeStream()` when you call the model. ++You can visualize how streaming generates content: +++```javascript +var stream = response.body; +if (!stream) { + stream.destroy(); + throw new Error(`Failed to get chat completions with status: ${response.status}`); +} ++if (response.status !== "200") { + throw new Error(`Failed to get chat completions: ${response.body.error}`); +} ++var sses = createSseStream(stream); ++for await (const event of sses) { + if (event.data === "[DONE]") { + return; + } + for (const choice of (JSON.parse(event.data)).choices) { + console.log(choice.delta?.content ?? ""); + } +} +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + presence_penalty: "0.1", + frequency_penalty: "0.8", + max_tokens: 2048, + stop: ["<|endoftext|>"], + temperature: 0, + top_p: 1, + response_format: { type: "text" }, + } +}); +``` ++> [!WARNING] +> Jais doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + headers: { + "extra-params": "pass-through" + }, + body: { + messages: messages, + logprobs: true + } +}); +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```javascript +try { + var messages = [ + { role: "system", content: "You are an AI assistant that helps people find information." }, + { role: "user", content: "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." }, + ]; ++ var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } + }); ++ console.log(response.body.choices[0].message.content); +} +catch (error) { + if (error.status_code == 400) { + var response = JSON.parse(error.response._content); + if (response.error) { + console.log(`Your request triggered an ${response.error.code} error:\n\t ${response.error.message}`); + } + else + { + throw error; + } + } +} +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++++++You can learn more about the models in their respective model card: ++* [jais-30b-chat](https://aka.ms/azureai/landing/jais-30b-chat) +++## Prerequisites ++To use Jais chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Jais chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `Azure.AI.Inference` package from [Nuget](https://www.nuget.org/). To install this package, you need the following prerequisites: ++* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++Once you have these prerequisites, install the Azure AI inference library with the following command: ++```dotnetcli +dotnet add package Azure.AI.Inference --prerelease +``` ++You can also authenticate with Microsoft Entra ID (formerly Azure Active Directory). To use credential providers provided with the Azure SDK, install the `Azure.Identity` package: ++```dotnetcli +dotnet add package Azure.Identity +``` ++Import the following namespaces: +++```csharp +using Azure; +using Azure.Identity; +using Azure.AI.Inference; +``` ++This example also use the following namespaces but you may not always need them: +++```csharp +using System.Text.Json; +using System.Text.Json.Serialization; +using System.Reflection; +``` ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Jais chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```csharp +ChatCompletionsClient client = new ChatCompletionsClient( + new Uri(Environment.GetEnvironmentVariable("AZURE_INFERENCE_ENDPOINT")), + new AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_INFERENCE_CREDENTIAL")) +); +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```csharp +Response<ModelInfo> modelInfo = client.GetModelInfo(); +``` ++The response is as follows: +++```csharp +Console.WriteLine($"Model name: {modelInfo.Value.ModelName}"); +Console.WriteLine($"Model type: {modelInfo.Value.ModelType}"); +Console.WriteLine($"Model provider name: {modelInfo.Value.ModelProviderName}"); +``` ++```console +Model name: jais-30b-chat +Model type: chat-completions +Model provider name: G42 +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```csharp +ChatCompletionsOptions requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, +}; ++Response<ChatCompletions> response = client.Complete(requestOptions); +``` ++The response is as follows, where you can see the model's usage statistics: +++```csharp +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +Console.WriteLine($"Model: {response.Value.Model}"); +Console.WriteLine("Usage:"); +Console.WriteLine($"\tPrompt tokens: {response.Value.Usage.PromptTokens}"); +Console.WriteLine($"\tTotal tokens: {response.Value.Usage.TotalTokens}"); +Console.WriteLine($"\tCompletion tokens: {response.Value.Usage.CompletionTokens}"); +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: jais-30b-chat +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```csharp +static async Task StreamMessageAsync(ChatCompletionsClient client) +{ + ChatCompletionsOptions requestOptions = new ChatCompletionsOptions() + { + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world? Write an essay about it.") + }, + MaxTokens=4096 + }; ++ StreamingResponse<StreamingChatCompletionsUpdate> streamResponse = await client.CompleteStreamingAsync(requestOptions); ++ await PrintStream(streamResponse); +} +``` ++To stream completions, use `CompleteStreamingAsync` method when you call the model. Notice that in this example we the call is wrapped in an asynchronous method. ++To visualize the output, define an asynchronous method to print the stream in the console. ++```csharp +static async Task PrintStream(StreamingResponse<StreamingChatCompletionsUpdate> response) +{ + await foreach (StreamingChatCompletionsUpdate chatUpdate in response) + { + if (chatUpdate.Role.HasValue) + { + Console.Write($"{chatUpdate.Role.Value.ToString().ToUpperInvariant()}: "); + } + if (!string.IsNullOrEmpty(chatUpdate.ContentUpdate)) + { + Console.Write(chatUpdate.ContentUpdate); + } + } +} +``` ++You can visualize how streaming generates content: +++```csharp +StreamMessageAsync(client).GetAwaiter().GetResult(); +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + PresencePenalty = 0.1f, + FrequencyPenalty = 0.8f, + MaxTokens = 2048, + StopSequences = { "<|endoftext|>" }, + Temperature = 0, + NucleusSamplingFactor = 1, + ResponseFormat = new ChatCompletionsResponseFormatText() +}; ++response = client.Complete(requestOptions); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` ++> [!WARNING] +> Jais doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + AdditionalProperties = { { "logprobs", BinaryData.FromString("true") } }, +}; ++response = client.Complete(requestOptions, extraParams: ExtraParameters.PassThrough); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```csharp +try +{ + requestOptions = new ChatCompletionsOptions() + { + Messages = { + new ChatRequestSystemMessage("You are an AI assistant that helps people find information."), + new ChatRequestUserMessage( + "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." + ), + }, + }; ++ response = client.Complete(requestOptions); + Console.WriteLine(response.Value.Choices[0].Message.Content); +} +catch (RequestFailedException ex) +{ + if (ex.ErrorCode == "content_filter") + { + Console.WriteLine($"Your query has trigger Azure Content Safeaty: {ex.Message}"); + } + else + { + throw; + } +} +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++++++You can learn more about the models in their respective model card: ++* [jais-30b-chat](https://aka.ms/azureai/landing/jais-30b-chat) +++## Prerequisites ++To use Jais chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Jais chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### A REST client ++Models deployed with the [Azure AI model inference API](https://aka.ms/azureai/modelinference) can be consumed using any REST client. To use the REST client, you need the following prerequisites: ++* To construct the requests, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name`` is your unique model deployment host name and `your-azure-region`` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Jais chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: ++```http +GET /info HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json +``` ++The response is as follows: +++```json +{ + "model_name": "jais-30b-chat", + "model_type": "chat-completions", + "model_provider_name": "G42" +} +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ] +} +``` ++The response is as follows, where you can see the model's usage statistics: +++```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718726686, + "model": "jais-30b-chat", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.", + "tool_calls": null + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "stream": true, + "temperature": 0, + "top_p": 1, + "max_tokens": 2048 +} +``` ++You can visualize how streaming generates content: +++```json +{ + "id": "23b54589eba14564ad8a2e6978775a39", + "object": "chat.completion.chunk", + "created": 1718726371, + "model": "jais-30b-chat", + "choices": [ + { + "index": 0, + "delta": { + "role": "assistant", + "content": "" + }, + "finish_reason": null, + "logprobs": null + } + ] +} +``` ++The last message in the stream has `finish_reason` set, indicating the reason for the generation process to stop. +++```json +{ + "id": "23b54589eba14564ad8a2e6978775a39", + "object": "chat.completion.chunk", + "created": 1718726371, + "model": "jais-30b-chat", + "choices": [ + { + "index": 0, + "delta": { + "content": "" + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "presence_penalty": 0.1, + "frequency_penalty": 0.8, + "max_tokens": 2048, + "stop": ["<|endoftext|>"], + "temperature" :0, + "top_p": 1, + "response_format": { "type": "text" } +} +``` +++```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718726686, + "model": "jais-30b-chat", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.", + "tool_calls": null + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` ++> [!WARNING] +> Jais doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. ++```http +POST /chat/completions HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json +extra-parameters: pass-through +``` +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "logprobs": true +} +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are an AI assistant that helps people find information." + }, + { + "role": "user", + "content": "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." + } + ] +} +``` +++```json +{ + "error": { + "message": "The response was filtered due to the prompt triggering Microsoft's content management policy. Please modify your prompt and retry.", + "type": null, + "param": "prompt", + "code": "content_filter", + "status": 400 + } +} +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++## More inference examples ++For more examples of how to use Jais, see the following examples and tutorials: ++| Description | Language | Sample | +|-|-|--| +| Azure AI Inference package for JavaScript | JavaScript | [Link](https://aka.ms/azsdk/azure-ai-inference/javascript/samples) | +| Azure AI Inference package for Python | Python | [Link](https://aka.ms/azsdk/azure-ai-inference/python/samples) | ++## Cost and quota considerations for Jais family of models deployed as serverless API endpoints ++Quota is managed per deployment. Each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute. However, we currently limit one deployment per model per project. Contact Microsoft Azure Support if the current rate limits aren't sufficient for your scenarios. ++Jais models deployed as a serverless API are offered by G42 through the Azure Marketplace and integrated with Azure AI Studio for use. You can find the Azure Marketplace pricing when deploying the model. ++Each time a project subscribes to a given offer from the Azure Marketplace, a new resource is created to track the costs associated with its consumption. The same resource is used to track costs associated with inference; however, multiple meters are available to track each scenario independently. ++For more information on how to track costs, see [Monitor costs for models offered through the Azure Marketplace](costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace). ++## Related content +++* [Azure AI Model Inference API](../reference/reference-model-inference-api.md) +* [Deploy models as serverless APIs](deploy-models-serverless.md) +* [Consume serverless API endpoints from a different Azure AI Studio project or hub](deploy-models-serverless-connect.md) +* [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md) +* [Plan and manage costs (marketplace)](costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace) |
ai-studio | Deploy Models Jamba | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-studio/how-to/deploy-models-jamba.md | Title: How to deploy AI21's Jamba-Instruct model with Azure AI Studio + Title: How to use Jamba-Instruct chat models with Azure AI Studio -description: How to deploy AI21's Jamba-Instruct model with Azure AI Studio +description: Learn how to use Jamba-Instruct chat models with Azure AI Studio. + - Previously updated : 06/19/2024- Last updated : 08/08/2024 reviewer: tgokal-++++zone_pivot_groups: azure-ai-model-catalog-samples-chat -# How to deploy AI21's Jamba-Instruct model with Azure AI Studio +# How to use Jamba-Instruct chat models ++In this article, you learn about Jamba-Instruct chat models and how to use them. +The Jamba-Instruct model is AI21's production-grade Mamba-based large language model (LLM) which uses AI21's hybrid Mamba-Transformer architecture. It's an instruction-tuned version of AI21's hybrid structured state space model (SSM) transformer Jamba model. The Jamba-Instruct model is built for reliable commercial use with respect to quality and performance. ++> [!TIP] +> See our announcements of AI21's Jamba-Instruct model available now on Azure AI Model Catalog through [AI21's blog](https://aka.ms/ai21-jamba-instruct-blog) and [Microsoft Tech Community Blog](https://aka.ms/ai21-jamba-instruct-announcement). +++++++You can learn more about the models in their respective model card: +* [AI21-Jamba-Instruct](https://aka.ms/azureai/landing/AI21-Jamba-Instruct) -In this article, you learn how to use Azure AI Studio to deploy AI21's Jamba-Instruct model as a serverless API with pay-as-you-go billing. -The Jamba Instruct model is AI21's production-grade Mamba-based large language model (LLM) which leverages AI21's hybrid Mamba-Transformer architecture. It's an instruction-tuned version of AI21's hybrid structured state space model (SSM) transformer Jamba model. The Jamba Instruct model is built for reliable commercial use with respect to quality and performance. +## Prerequisites -## Deploy the Jamba Instruct model as a serverless API +To use Jamba-Instruct chat models with Azure AI Studio, you need the following prerequisites: -Certain models in the model catalog can be deployed as a serverless API with pay-as-you-go billing, providing a way to consume them as an API without hosting them on your subscription, while keeping the enterprise security and compliance organizations need. This deployment option doesn't require quota from your subscription. +### A model deployment -The [AI21-Jamba-Instruct model](https://aka.ms/aistudio/landing/ai21-labs-jamba-instruct) deployed as a serverless API with pay-as-you-go billing is [offered by AI21 through Microsoft Azure Marketplace](https://aka.ms/azure-marketplace-offer-ai21-jamba-instruct). AI21 can change or update the terms of use and pricing of this model. +**Deployment to serverless APIs** -To get started with Jamba Instruct deployed as a serverless API, explore our integrations with [LangChain](https://aka.ms/ai21-jamba-instruct-langchain-sample), [LiteLLM](https://aka.ms/ai21-jamba-instruct-litellm-sample), [OpenAI](https://aka.ms/ai21-jamba-instruct-openai-sample) and the [Azure API](https://aka.ms/ai21-jamba-instruct-azure-api-sample). +Jamba-Instruct chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `azure-ai-inference` package with Python. To install this package, you need the following prerequisites: ++* Python 3.8 or later installed, including pip. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. + +Once you have these prerequisites, install the Azure AI inference package with the following command: ++```bash +pip install azure-ai-inference +``` ++Read more about the [Azure AI inference package and reference](https://aka.ms/azsdk/azure-ai-inference/python/reference). ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. > [!TIP]-> See our announcements of AI21's Jamba-Instruct model available now on Azure AI Model Catalog through [AI21's blog](https://aka.ms/ai21-jamba-instruct-blog) and [Microsoft Tech Community Blog](https://aka.ms/ai21-jamba-instruct-announcement). +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Jamba-Instruct chat models. +### Create a client to consume the model -### Prerequisites +First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. -- An Azure subscription with a valid payment method. Free or trial Azure subscriptions won't work. If you don't have an Azure subscription, create a [paid Azure account](https://azure.microsoft.com/pricing/purchase-options/pay-as-you-go) to begin.-- An [AI Studio hub](../how-to/create-azure-ai-resource.md). The serverless API model deployment offering for Jamba Instruct is only available with hubs created in these regions: - * East US - * East US 2 - * North Central US - * South Central US - * West US - * West US 3 - * Sweden Central - - For a list of regions that are available for each of the models supporting serverless API endpoint deployments, see [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md). -- An Azure [AI Studio project](../how-to/create-projects.md).-- Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure AI Studio. To perform the steps in this article, your user account must be assigned the __owner__ or __contributor__ role for the Azure subscription. Alternatively, your account can be assigned a custom role that has the following permissions:+```python +import os +from azure.ai.inference import ChatCompletionsClient +from azure.core.credentials import AzureKeyCredential - - On the Azure subscription—to subscribe the AI Studio project to the Azure Marketplace offering, once for each project, per offering: - - `Microsoft.MarketplaceOrdering/agreements/offers/plans/read` - - `Microsoft.MarketplaceOrdering/agreements/offers/plans/sign/action` - - `Microsoft.MarketplaceOrdering/offerTypes/publishers/offers/plans/agreements/read` - - `Microsoft.Marketplace/offerTypes/publishers/offers/plans/agreements/read` - - `Microsoft.SaaS/register/action` - - - On the resource group—to create and use the SaaS resource: - - `Microsoft.SaaS/resources/read` - - `Microsoft.SaaS/resources/write` - - - On the AI Studio project—to deploy endpoints (the Azure AI Developer role contains these permissions already): - - `Microsoft.MachineLearningServices/workspaces/marketplaceModelSubscriptions/*` - - `Microsoft.MachineLearningServices/workspaces/serverlessEndpoints/*` +client = ChatCompletionsClient( + endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"], + credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"]), +) +``` - For more information on permissions, see [Role-based access control in Azure AI Studio](../concepts/rbac-ai-studio.md). +### Get the model's capabilities -### Create a new deployment +The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: -These steps demonstrate the deployment of AI21-Jamba-Instruct. To create a deployment: -1. Sign in to [Azure AI Studio](https://ai.azure.com). -1. Select **Model catalog** from the left sidebar. -1. Search for and select **AI21-Jamba-Instruct** to open its Details page. -1. Select **Deploy** to open a serverless API deployment window for the model. -1. Alternatively, you can initiate a deployment by starting from your project in AI Studio. - 1. From the left sidebar of your project, select **Components** > **Deployments**. - 1. Select **+ Create deployment**. - 1. Search for and select **AI21-Jamba-Instruct**. to open the Model's Details page. - 1. Select **Confirm** to open a serverless API deployment window for the model. -1. Select the project in which you want to deploy your model. To deploy the AI21-Jamba-Instruct model, your project must be in one of the regions listed in the [Prerequisites](#prerequisites) section. -1. In the deployment wizard, select the link to **Azure Marketplace Terms**, to learn more about the terms of use. -1. Select the **Pricing and terms** tab to learn about pricing for the selected model. -1. Select the **Subscribe and Deploy** button. If this is your first time deploying the model in the project, you have to subscribe your project for the particular offering. This step requires that your account has the Azure subscription permissions and resource group permissions listed in the [Prerequisites](#prerequisites). Each project has its own subscription to the particular Azure Marketplace offering of the model, which allows you to control and monitor spending. Currently, you can have only one deployment for each model within a project. -1. Once you subscribe the project for the particular Azure Marketplace offering, subsequent deployments of the _same_ offering in the _same_ project don't require subscribing again. If this scenario applies to you, there's a **Continue to deploy** option to select. -1. Give the deployment a name. This name becomes part of the deployment API URL. This URL must be unique in each Azure region. -1. Select **Deploy**. Wait until the deployment is ready and you're redirected to the Deployments page. -1. Return to the Deployments page, select the deployment, and note the endpoint's **Target** URL and the Secret **Key**. For more information on using the APIs, see the [Reference](#reference-for-jamba-instruct-deployed-as-a-serverless-api) section. -1. You can always find the endpoint's details, URL, and access keys by navigating to your **Project overview** page. Then, from the left sidebar of your project, select **Components** > **Deployments**. +```python +model_info = client.get_model_info() +``` ++The response is as follows: -To learn about billing for the AI21-Jamba-Instruct model deployed as a serverless API with pay-as-you-go token-based billing, see [Cost and quota considerations for Jamba Instruct deployed as a serverless API](#cost-and-quota-considerations-for-jamba-instruct-deployed-as-a-serverless-api). -### Consume Jamba Instruct as a serverless API +```python +print("Model name:", model_info.model_name) +print("Model type:", model_info.model_type) +print("Model provider name:", model_info.model_provider) +``` -You can consume Jamba Instruct models as follows: +```console +Model name: AI21-Jamba-Instruct +Model type: chat-completions +Model provider name: AI21 +``` -1. From your **Project overview** page, go to the left sidebar and select **Components** > **Deployments**. +### Create a chat completion request -1. Find and select the deployment you created. +The following example shows how you can create a basic chat completions request to the model. -1. Copy the **Target** URL and the **Key** value. +```python +from azure.ai.inference.models import SystemMessage, UserMessage -1. Make an API request. +response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], +) +``` -For more information on using the APIs, see the [reference](#reference-for-jamba-instruct-deployed-as-a-serverless-api) section. +The response is as follows, where you can see the model's usage statistics: -## Reference for Jamba Instruct deployed as a serverless API -Jamba Instruct models accept both of these APIs: +```python +print("Response:", response.choices[0].message.content) +print("Model:", response.model) +print("Usage:") +print("\tPrompt tokens:", response.usage.prompt_tokens) +print("\tTotal tokens:", response.usage.total_tokens) +print("\tCompletion tokens:", response.usage.completion_tokens) +``` -- The [Azure AI Model Inference API](../reference/reference-model-inference-api.md) on the route `/chat/completions` for multi-turn chat or single-turn question-answering. This API is supported because Jamba Instruct is fine-tuned for chat completion.-- [AI21's Azure Client](https://docs.ai21.com/reference/jamba-instruct-api). For more information about the REST endpoint being called, visit [AI21's REST documentation](https://docs.ai21.com/reference/jamba-instruct-api).+```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: AI21-Jamba-Instruct +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` -### Azure AI model inference API +Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. -The [Azure AI model inference API](../reference/reference-model-inference-api.md) schema can be found in the [reference for Chat Completions](../reference/reference-model-inference-chat-completions.md) article and an [OpenAPI specification can be obtained from the endpoint itself](../reference/reference-model-inference-api.md?tabs=rest#getting-started). +#### Stream content -Single-turn and multi-turn chat have the same request and response format, except that question answering (single-turn) involves only a single user message in the request, while multi-turn chat requires that you send the entire chat message history in each request. +By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. -In a multi-turn chat, the message thread has the following attributes: +You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. -- Includes all messages from the user and the model, ordered from oldest to newest.-- Messages alternate between `user` and `assistant` role messages-- Optionally, the message thread starts with a system message to provide context. -The following pseudocode is an example of the message stack for the fourth call in a chat request that includes an initial system message. +```python +result = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + temperature=0, + top_p=1, + max_tokens=2048, + stream=True, +) +``` -```json -[ - {"role": "system", "message": "Some contextual information here"}, - {"role": "user", "message": "User message 1"}, - {"role": "assistant", "message": "System response 1"}, - {"role": "user", "message": "User message 2"}, - {"role": "assistant"; "message": "System response 2"}, - {"role": "user", "message": "User message 3"}, - {"role": "assistant", "message": "System response 3"}, - {"role": "user", "message": "User message 4"} -] +To stream completions, set `stream=True` when you call the model. ++To visualize the output, define a helper function to print the stream. ++```python +def print_stream(result): + """ + Prints the chat completion with streaming. Some delay is added to simulate + a real-time conversation. + """ + import time + for update in result: + if update.choices: + print(update.choices[0].delta.content, end="") + time.sleep(0.05) ``` -### AI21's Azure client +You can visualize how streaming generates content: -Use the method `POST` to send the request to the `/v1/chat/completions` route: -__Request__ +```python +print_stream(result) +``` -```HTTP/1.1 -POST /v1/chat/completions HTTP/1.1 -Host: <DEPLOYMENT_URI> -Authorization: Bearer <TOKEN> -Content-type: application/json +#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```python +from azure.ai.inference.models import ChatCompletionsResponseFormat ++response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + presence_penalty=0.1, + frequency_penalty=0.8, + max_tokens=2048, + stop=["<|endoftext|>"], + temperature=0, + top_p=1, + response_format={ "type": ChatCompletionsResponseFormat.TEXT }, +) +``` ++> [!WARNING] +> Jamba doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```python +response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + model_extras={ + "logprobs": True + } +) +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```python +from azure.ai.inference.models import AssistantMessage, UserMessage, SystemMessage ++try: + response = client.complete( + messages=[ + SystemMessage(content="You are an AI assistant that helps people find information."), + UserMessage(content="Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills."), + ] + ) ++ print(response.choices[0].message.content) ++except HttpResponseError as ex: + if ex.status_code == 400: + response = ex.response.json() + if isinstance(response, dict) and "error" in response: + print(f"Your request triggered an {response['error']['code']} error:\n\t {response['error']['message']}") + else: + raise + raise +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++++++You can learn more about the models in their respective model card: ++* [AI21-Jamba-Instruct](https://aka.ms/azureai/landing/AI21-Jamba-Instruct) +++## Prerequisites ++To use Jamba-Instruct chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Jamba-Instruct chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `@azure-rest/ai-inference` package from `npm`. To install this package, you need the following prerequisites: ++* LTS versions of `Node.js` with `npm`. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++Once you have these prerequisites, install the Azure Inference library for JavaScript with the following command: ++```bash +npm install @azure-rest/ai-inference +``` ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Jamba-Instruct chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```javascript +import ModelClient from "@azure-rest/ai-inference"; +import { isUnexpected } from "@azure-rest/ai-inference"; +import { AzureKeyCredential } from "@azure/core-auth"; ++const client = new ModelClient( + process.env.AZURE_INFERENCE_ENDPOINT, + new AzureKeyCredential(process.env.AZURE_INFERENCE_CREDENTIAL) +); +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```javascript +var model_info = await client.path("/info").get() +``` ++The response is as follows: +++```javascript +console.log("Model name: ", model_info.body.model_name) +console.log("Model type: ", model_info.body.model_type) +console.log("Model provider name: ", model_info.body.model_provider_name) +``` ++```console +Model name: AI21-Jamba-Instruct +Model type: chat-completions +Model provider name: AI21 +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } +}); +``` ++The response is as follows, where you can see the model's usage statistics: +++```javascript +if (isUnexpected(response)) { + throw response.body.error; +} ++console.log("Response: ", response.body.choices[0].message.content); +console.log("Model: ", response.body.model); +console.log("Usage:"); +console.log("\tPrompt tokens:", response.body.usage.prompt_tokens); +console.log("\tTotal tokens:", response.body.usage.total_tokens); +console.log("\tCompletion tokens:", response.body.usage.completion_tokens); +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: AI21-Jamba-Instruct +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 ``` -#### Request schema +Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. -Payload is a JSON formatted string containing the following parameters: +#### Stream content -| Key | Type | Required/Default | Allowed values | Description | -| - | -- | :--:| -- | | -| `model` | `string` | Y | Must be `jamba-instruct` | -| `messages` | `list[object]` | Y | A list of objects, one per message, from oldest to newest. The oldest message can be role `system`. All later messages must alternate between user and assistant roles. See the message object definition below. | -| `max_tokens` | `integer` | N <br>`4096` | 0 – 4096 | The maximum number of tokens to allow for each generated response message. Typically the best way to limit output length is by providing a length limit in the system prompt (for example, "limit your answers to three sentences") | -| `temperature` | `float` | N <br>`1` | 0.0 – 2.0 | How much variation to provide in each answer. Setting this value to 0 guarantees the same response to the same question every time. Setting a higher value encourages more variation. Modifies the distribution from which tokens are sampled. We recommend altering this or `top_p`, but not both. | -| `top_p` | `float` | N <br>`1` | 0 < _value_ <=1.0 | Limit the pool of next tokens in each step to the top N percentile of possible tokens, where 1.0 means the pool of all possible tokens, and 0.01 means the pool of only the most likely next tokens. | -| `stop` | `string` OR `list[string]` | N <br> | "" | String or list of strings containing the word(s) where the API should stop generating output. Newlines are allowed as "\n". The returned text won't contain the stop sequence. | -| `n` | `integer` | N <br>`1` | 1 – 16 | How many responses to generate for each prompt. With Azure AI Studio's Playground, `n=1` as we work on multi-response Playground. | -| `stream` | `boolean` | N <br>`False` | `True` OR `False` | Whether to enable streaming. If true, results are returned one token at a time. If set to true, `n` must be 1, which is automatically set. | +By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } +}).asNodeStream(); +``` -The `messages` object has the following fields: - - `role`: [_string, required_] The author or purpose of the message. One of the following values: - - `user`: Input provided by the user. Any instructions given here that conflict with instructions given in the `system` prompt take precedence over the `system` prompt instructions. - - `assistant`: A response generated by the model. - - `system`: Initial instructions to provide general guidance on the tone and voice of the generated message. An initial system message is optional, but recommended to provide guidance on the tone of the chat. For example, "You are a helpful chatbot with a background in earth sciences and a charming French accent." - - `content`: [_string, required_] The content of the message. +To stream completions, use `.asNodeStream()` when you call the model. +You can visualize how streaming generates content: +++```javascript +var stream = response.body; +if (!stream) { + stream.destroy(); + throw new Error(`Failed to get chat completions with status: ${response.status}`); +} ++if (response.status !== "200") { + throw new Error(`Failed to get chat completions: ${response.body.error}`); +} ++var sses = createSseStream(stream); ++for await (const event of sses) { + if (event.data === "[DONE]") { + return; + } + for (const choice of (JSON.parse(event.data)).choices) { + console.log(choice.delta?.content ?? ""); + } +} +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + presence_penalty: "0.1", + frequency_penalty: "0.8", + max_tokens: 2048, + stop: ["<|endoftext|>"], + temperature: 0, + top_p: 1, + response_format: { type: "text" }, + } +}); +``` ++> [!WARNING] +> Jamba doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + headers: { + "extra-params": "pass-through" + }, + body: { + messages: messages, + logprobs: true + } +}); +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```javascript +try { + var messages = [ + { role: "system", content: "You are an AI assistant that helps people find information." }, + { role: "user", content: "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." }, + ]; ++ var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } + }); ++ console.log(response.body.choices[0].message.content); +} +catch (error) { + if (error.status_code == 400) { + var response = JSON.parse(error.response._content); + if (response.error) { + console.log(`Your request triggered an ${response.error.code} error:\n\t ${response.error.message}`); + } + else + { + throw error; + } + } +} +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++++++You can learn more about the models in their respective model card: ++* [AI21-Jamba-Instruct](https://aka.ms/azureai/landing/AI21-Jamba-Instruct) +++## Prerequisites ++To use Jamba-Instruct chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Jamba-Instruct chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `Azure.AI.Inference` package from [Nuget](https://www.nuget.org/). To install this package, you need the following prerequisites: ++* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++Once you have these prerequisites, install the Azure AI inference library with the following command: ++```dotnetcli +dotnet add package Azure.AI.Inference --prerelease +``` ++You can also authenticate with Microsoft Entra ID (formerly Azure Active Directory). To use credential providers provided with the Azure SDK, install the `Azure.Identity` package: ++```dotnetcli +dotnet add package Azure.Identity +``` ++Import the following namespaces: +++```csharp +using Azure; +using Azure.Identity; +using Azure.AI.Inference; +``` ++This example also use the following namespaces but you may not always need them: +++```csharp +using System.Text.Json; +using System.Text.Json.Serialization; +using System.Reflection; +``` -#### Request example +## Work with chat completions -__Single-turn example__ +In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. -```JSON +> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Jamba-Instruct chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```csharp +ChatCompletionsClient client = new ChatCompletionsClient( + new Uri(Environment.GetEnvironmentVariable("AZURE_INFERENCE_ENDPOINT")), + new AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_INFERENCE_CREDENTIAL")) +); +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```csharp +Response<ModelInfo> modelInfo = client.GetModelInfo(); +``` ++The response is as follows: +++```csharp +Console.WriteLine($"Model name: {modelInfo.Value.ModelName}"); +Console.WriteLine($"Model type: {modelInfo.Value.ModelType}"); +Console.WriteLine($"Model provider name: {modelInfo.Value.ModelProviderName}"); +``` ++```console +Model name: AI21-Jamba-Instruct +Model type: chat-completions +Model provider name: AI21 +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```csharp +ChatCompletionsOptions requestOptions = new ChatCompletionsOptions() {- "model": "jamba-instruct", - "messages": [ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, +}; ++Response<ChatCompletions> response = client.Complete(requestOptions); +``` ++The response is as follows, where you can see the model's usage statistics: +++```csharp +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +Console.WriteLine($"Model: {response.Value.Model}"); +Console.WriteLine("Usage:"); +Console.WriteLine($"\tPrompt tokens: {response.Value.Usage.PromptTokens}"); +Console.WriteLine($"\tTotal tokens: {response.Value.Usage.TotalTokens}"); +Console.WriteLine($"\tCompletion tokens: {response.Value.Usage.CompletionTokens}"); +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: AI21-Jamba-Instruct +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```csharp +static async Task StreamMessageAsync(ChatCompletionsClient client) +{ + ChatCompletionsOptions requestOptions = new ChatCompletionsOptions() {- "role":"user", - "content":"Who was the first emperor of rome?"} - ], - "temperature": 0.8, - "max_tokens": 512 + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world? Write an essay about it.") + }, + MaxTokens=4096 + }; ++ StreamingResponse<StreamingChatCompletionsUpdate> streamResponse = await client.CompleteStreamingAsync(requestOptions); ++ await PrintStream(streamResponse); } ``` -__Chat example (fourth request containing third user response)__ +To stream completions, use `CompleteStreamingAsync` method when you call the model. Notice that in this example we the call is wrapped in an asynchronous method. ++To visualize the output, define an asynchronous method to print the stream in the console. -```JSON +```csharp +static async Task PrintStream(StreamingResponse<StreamingChatCompletionsUpdate> response) {- "model": "jamba-instruct", - "messages": [ - {"role": "system", - "content": "You are a helpful genie just released from a bottle. You start the conversation with 'Thank you for freeing me! I grant you one wish.'"}, - {"role":"user", - "content":"I want a new car"}, - {"role":"assistant", - "content":"🚗 Great choice, I can definitely help you with that! Before I grant your wish, can you tell me what kind of car you're looking for?"}, - {"role":"user", - "content":"A corvette"}, - {"role":"assistant", - "content":"Great choice! What color and year?"}, - {"role":"user", - "content":"1963 black split window Corvette"} - ], - "n":3 + await foreach (StreamingChatCompletionsUpdate chatUpdate in response) + { + if (chatUpdate.Role.HasValue) + { + Console.Write($"{chatUpdate.Role.Value.ToString().ToUpperInvariant()}: "); + } + if (!string.IsNullOrEmpty(chatUpdate.ContentUpdate)) + { + Console.Write(chatUpdate.ContentUpdate); + } + } } ``` -#### Response schema +You can visualize how streaming generates content: -The response depends slightly on whether the result is streamed or not. -In a _non-streamed result_, all responses are delivered together in a single response, which also includes a `usage` property. +```csharp +StreamMessageAsync(client).GetAwaiter().GetResult(); +``` -In a _streamed result_, +#### Explore more parameters supported by the inference client -* Each response includes a single token in the `choices` field. -* The `choices` object structure is different. -* Only the last response includes a `usage` object. -* The entire response is wrapped in a `data` object. -* The final response object is `data: [DONE]`. +Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). -The response payload is a dictionary with the following fields. +```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + PresencePenalty = 0.1f, + FrequencyPenalty = 0.8f, + MaxTokens = 2048, + StopSequences = { "<|endoftext|>" }, + Temperature = 0, + NucleusSamplingFactor = 1, + ResponseFormat = new ChatCompletionsResponseFormatText() +}; ++response = client.Complete(requestOptions); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` -| Key | Type | Description | -| | | - | -| `id` | `string` | A unique identifier for the request. | -| `model` | `string` | Name of the model used. | -| `choices` | `list[object`]|The model-generated response text. For a non-streaming response it is a list with `n` items. For a streaming response, it is a single object containing a single token. See the object description below. | -| `created` | `integer` | The Unix timestamp (in seconds) of when the completion was created. | -| `object` | `string` | The object type, which is always `chat.completion`. | -| `usage` | `object` | Usage statistics for the completion request. See below for details. | +> [!WARNING] +> Jamba doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. -The `choices` response object contains the model-generated response. The object has the following fields: +If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). -| Key | Type | Description | -| | | | -| `index` | `integer` | Zero-based index of the message in the list of messages. Might not correspond to the position in the list. For streamed messages this is always zero. | -| `message` OR `delta` | `object` | The generated message (or token in a streaming response). Same object type as described in the request with two changes:<br> - In a non-streaming response, this object is called `message`. <br>- In a streaming response, it is called `delta`, and contains either `message` or `role` but never both. | -| `finish_reason` | `string` | The reason the model stopped generating tokens: <br>- `stop`: The model reached a natural stop point, or a provided stop sequence. <br>- `length`: Max number of tokens have been reached. <br>- `content_filter`: The generated response violated a responsible AI policy. <br>- `null`: Streaming only. In a streaming response, all responses except the last will be `null`. | +### Pass extra parameters to the model -The `usage` response object contains the following fields. +The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. -| Key | Type | Value | -| - | | -- | -| `prompt_tokens` | `integer` | Number of tokens in the prompt. Note that the prompt token count includes extra tokens added by the system to format the prompt list into a single string as required by the model. The number of extra tokens is typically proportional to the number of messages in the thread, and should be relatively small. | -| `completion_tokens` | `integer` | Number of tokens generated in the completion. | -| `total_tokens` | `integer` | Total tokens. +Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. -#### Non-streaming response example -```JSON +```csharp +requestOptions = new ChatCompletionsOptions() {- "id":"cmpl-524c73beb8714d878e18c3b5abd09f2a", - "choices":[ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + AdditionalProperties = { { "logprobs", BinaryData.FromString("true") } }, +}; ++response = client.Complete(requestOptions, extraParams: ExtraParameters.PassThrough); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```csharp +try +{ + requestOptions = new ChatCompletionsOptions() + { + Messages = { + new ChatRequestSystemMessage("You are an AI assistant that helps people find information."), + new ChatRequestUserMessage( + "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." + ), + }, + }; ++ response = client.Complete(requestOptions); + Console.WriteLine(response.Value.Choices[0].Message.Content); +} +catch (RequestFailedException ex) +{ + if (ex.ErrorCode == "content_filter") + { + Console.WriteLine($"Your query has trigger Azure Content Safeaty: {ex.Message}"); + } + else {- "index":0, - "message":{ - "role":"assistant", - "content":"The human nose can detect over 1 trillion different scents, making it one of the most sensitive smell organs in the animal kingdom." - }, - "finishReason":"stop" + throw; }- ], - "created": 1717487036, - "usage":{ - "promptTokens":116, - "completionTokens":30, - "totalTokens":146 - } } ```-#### Streaming response example -```JSON -data: {"id": "cmpl-8e8b2f6556f94714b0cd5cfe3eeb45fc", "choices": [{"index": 0, "delta": {"role": "assistant"}, "created": 1717487336, "finish_reason": null}]} -data: {"id": "cmpl-8e8b2f6556f94714b0cd5cfe3eeb45fc", "choices": [{"index": 0, "delta": {"content": ""}, "created": 1717487336, "finish_reason": null}]} -data: {"id": "cmpl-8e8b2f6556f94714b0cd5cfe3eeb45fc", "choices": [{"index": 0, "delta": {"content": " The"}, "created": 1717487336, "finish_reason": null}]} -data: {"id": "cmpl-8e8b2f6556f94714b0cd5cfe3eeb45fc", "choices": [{"index": 0, "delta": {"content": " first e"}, "created": 1717487336, "finish_reason": null}]} -data: {"id": "cmpl-8e8b2f6556f94714b0cd5cfe3eeb45fc", "choices": [{"index": 0, "delta": {"content": "mpe"}, "created": 1717487336, "finish_reason": null}]} -... 115 responses omitted for sanity ... -data: {"id": "cmpl-8e8b2f6556f94714b0cd5cfe3eeb45fc", "choices": [{"index": 0, "delta": {"content": "me"}, "created": 1717487336, "finish_reason": null}]} -data: {"id": "cmpl-8e8b2f6556f94714b0cd5cfe3eeb45fc", "choices": [{"index": 0, "delta": {"content": "."}, "created": 1717487336,"finish_reason": "stop"}], "usage": {"prompt_tokens": 107, "completion_tokens": 121, "total_tokens": 228}} -data: [DONE] +> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++++++You can learn more about the models in their respective model card: ++* [AI21-Jamba-Instruct](https://aka.ms/azureai/landing/AI21-Jamba-Instruct) +++## Prerequisites ++To use Jamba-Instruct chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Jamba-Instruct chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### A REST client ++Models deployed with the [Azure AI model inference API](https://aka.ms/azureai/modelinference) can be consumed using any REST client. To use the REST client, you need the following prerequisites: ++* To construct the requests, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name`` is your unique model deployment host name and `your-azure-region`` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Jamba-Instruct chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: ++```http +GET /info HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json ``` -## Cost and quotas +The response is as follows: -### Cost and quota considerations for Jamba Instruct deployed as a serverless API -The Jamba Instruct model is deployed as a serverless API and is offered by AI21 through Azure Marketplace and integrated with Azure AI studio for use. You can find Azure Marketplace pricing when deploying or fine-tuning models. +```json +{ + "model_name": "AI21-Jamba-Instruct", + "model_type": "chat-completions", + "model_provider_name": "AI21" +} +``` -Each time a workspace subscribes to a given model offering from Azure Marketplace, a new resource is created to track the costs associated with its consumption. The same resource is used to track costs associated with inference and fine-tuning; however, multiple meters are available to track each scenario independently. +### Create a chat completion request -For more information on how to track costs, see [Monitor costs for models offered through the Azure Marketplace](./costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace). +The following example shows how you can create a basic chat completions request to the model. ++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ] +} +``` ++The response is as follows, where you can see the model's usage statistics: +++```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718726686, + "model": "AI21-Jamba-Instruct", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.", + "tool_calls": null + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "stream": true, + "temperature": 0, + "top_p": 1, + "max_tokens": 2048 +} +``` ++You can visualize how streaming generates content: +++```json +{ + "id": "23b54589eba14564ad8a2e6978775a39", + "object": "chat.completion.chunk", + "created": 1718726371, + "model": "AI21-Jamba-Instruct", + "choices": [ + { + "index": 0, + "delta": { + "role": "assistant", + "content": "" + }, + "finish_reason": null, + "logprobs": null + } + ] +} +``` ++The last message in the stream has `finish_reason` set, indicating the reason for the generation process to stop. +++```json +{ + "id": "23b54589eba14564ad8a2e6978775a39", + "object": "chat.completion.chunk", + "created": 1718726371, + "model": "AI21-Jamba-Instruct", + "choices": [ + { + "index": 0, + "delta": { + "content": "" + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "presence_penalty": 0.1, + "frequency_penalty": 0.8, + "max_tokens": 2048, + "stop": ["<|endoftext|>"], + "temperature" :0, + "top_p": 1, + "response_format": { "type": "text" } +} +``` +++```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718726686, + "model": "AI21-Jamba-Instruct", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.", + "tool_calls": null + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` ++> [!WARNING] +> Jamba doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. ++```http +POST /chat/completions HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json +extra-parameters: pass-through +``` +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "logprobs": true +} +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are an AI assistant that helps people find information." + }, + { + "role": "user", + "content": "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." + } + ] +} +``` +++```json +{ + "error": { + "message": "The response was filtered due to the prompt triggering Microsoft's content management policy. Please modify your prompt and retry.", + "type": null, + "param": "prompt", + "code": "content_filter", + "status": 400 + } +} +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++## More inference examples ++For more examples of how to use Jamba, see the following examples and tutorials: ++| Description | Language | Sample | +|-|-|--| +| Azure AI Inference package for JavaScript | JavaScript | [Link](https://aka.ms/azsdk/azure-ai-inference/javascript/samples) | +| Azure AI Inference package for Python | Python | [Link](https://aka.ms/azsdk/azure-ai-inference/python/samples) | ++## Cost and quota considerations for Jamba family of models deployed as serverless API endpoints Quota is managed per deployment. Each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute. However, we currently limit one deployment per model per project. Contact Microsoft Azure Support if the current rate limits aren't sufficient for your scenarios. -## Content filtering +Jamba models deployed as a serverless API are offered by AI21 through the Azure Marketplace and integrated with Azure AI Studio for use. You can find the Azure Marketplace pricing when deploying the model. -Models deployed as a serverless API are protected by Azure AI content safety. With Azure AI content safety enabled, both the prompt and completion pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. Learn more about [Azure AI Content Safety](/azure/ai-services/content-safety/overview). +Each time a project subscribes to a given offer from the Azure Marketplace, a new resource is created to track the costs associated with its consumption. The same resource is used to track costs associated with inference; however, multiple meters are available to track each scenario independently. ++For more information on how to track costs, see [Monitor costs for models offered through the Azure Marketplace](costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace). ## Related content -- [What is Azure AI Studio?](../what-is-ai-studio.md)-- [Azure AI FAQ article](../faq.yml)-- [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md)++* [Azure AI Model Inference API](../reference/reference-model-inference-api.md) +* [Deploy models as serverless APIs](deploy-models-serverless.md) +* [Consume serverless API endpoints from a different Azure AI Studio project or hub](deploy-models-serverless-connect.md) +* [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md) +* [Plan and manage costs (marketplace)](costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace) |
ai-studio | Deploy Models Llama | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-studio/how-to/deploy-models-llama.md | Title: How to deploy Meta Llama 3.1 models with Azure AI Studio + Title: How to use Meta Llama chat models with Azure AI Studio -description: Learn how to deploy Meta Llama 3.1 models with Azure AI Studio. -+description: Learn how to use Meta Llama chat models with Azure AI Studio. + Previously updated : 7/21/2024 Last updated : 08/08/2024 reviewer: shubhirajMsft -++zone_pivot_groups: azure-ai-model-catalog-samples-chat -# How to deploy Meta Llama 3.1 models with Azure AI Studio +# How to use Meta Llama chat models +In this article, you learn about Meta Llama chat models and how to use them. +Meta Llama 2 and 3 models and tools are a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. The model family also includes fine-tuned versions optimized for dialogue use cases with reinforcement learning from human feedback (RLHF). -In this article, you learn about the Meta Llama model family. You also learn how to use Azure AI Studio to deploy models from this set either to serverless APIs with pay-as you go billing or to managed compute. - > [!IMPORTANT] - > Read more about the announcement of Meta Llama 3.1 405B Instruct and other Llama 3.1 models available now on Azure AI Model Catalog: [Microsoft Tech Community Blog](https://aka.ms/meta-llama-3.1-release-on-azure) and from [Meta Announcement Blog](https://aka.ms/meta-llama-3.1-release-announcement). -Now available on Azure AI Models-as-a-Service: -- `Meta-Llama-3.1-405B-Instruct`-- `Meta-Llama-3.1-70B-Instruct`-- `Meta-Llama-3.1-8B-Instruct` -The Meta Llama 3.1 family of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). All models support long context length (128k) and are optimized for inference with support for grouped query attention (GQA). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source chat models on common industry benchmarks. +## Meta Llama chat models -See the following GitHub samples to explore integrations with [LangChain](https://aka.ms/meta-llama-3.1-405B-instruct-langchain), [LiteLLM](https://aka.ms/meta-llama-3.1-405B-instruct-litellm), [OpenAI](https://aka.ms/meta-llama-3.1-405B-instruct-openai) and the [Azure API](https://aka.ms/meta-llama-3.1-405B-instruct-webrequests). +The Meta Llama chat models include the following models: -## Deploy Meta Llama 3.1 405B Instruct as a serverless API +# [Meta Llama-3.1](#tab/meta-llama-3-1) -Meta Llama 3.1 models - like `Meta Llama 3.1 405B Instruct` - can be deployed as a serverless API with pay-as-you-go, providing a way to consume them as an API without hosting them on your subscription while keeping the enterprise security and compliance organizations need. This deployment option doesn't require quota from your subscription. Meta Llama 3.1 models are deployed as a serverless API with pay-as-you-go billing through Microsoft Azure Marketplace, and they might add more terms of use and pricing. +The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open-source and closed chat models on common industry benchmarks. -### Azure Marketplace model offerings -# [Meta Llama 3.1](#tab/llama-three) +The following models are available: -The following models are available in Azure Marketplace for Llama 3.1 and Llama 3 when deployed as a service with pay-as-you-go: +* [Meta-Llama-3.1-405B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3.1-405B-Instruct) +* [Meta-Llama-3.1-70B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3.1-70B-Instruct) +* [Meta-Llama-3.1-8B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3.1-8B-Instruct) -* [Meta-Llama-3.1-405B-Instruct (preview)](https://aka.ms/aistudio/landing/meta-llama-3.1-405B-instruct) -* [Meta-Llama-3.1-70B-Instruct (preview)](https://aka.ms/aistudio/landing/meta-llama-3.1-70B-instruct) -* [Meta Llama-3.1-8B-Instruct (preview)](https://aka.ms/aistudio/landing/meta-llama-3.1-8B-instruct) -* [Meta-Llama-3-70B-Instruct (preview)](https://aka.ms/aistudio/landing/meta-llama-3-70b-chat) -* [Meta-Llama-3-8B-Instruct (preview)](https://aka.ms/aistudio/landing/meta-llama-3-8b-chat) -# [Meta Llama 2](#tab/llama-two) +# [Meta Llama-3](#tab/meta-llama-3) ++Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B, and 405B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open-source chat models on common industry benchmarks. Further, in developing these models, we took great care to optimize helpfulness and safety. +++The following models are available: ++* [Meta-Llama-3-70B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3-70B-Instruct) +* [Meta-Llama-3-8B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3-8B-Instruct) +++# [Meta Llama-2](#tab/meta-llama-2) ++Meta has developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama-2-Chat models outperform open-source chat models on most benchmarks we tested, and in our human evaluations for helpfulness and safety, are on par with some popular closed-source models like ChatGPT and PaLM. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama-2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. +++The following models are available: ++* [Llama-2-70b-chat](https://aka.ms/azureai/landing/Llama-2-70b-chat) +* [Llama-2-13b-chat](https://aka.ms/azureai/landing/Llama-2-13b-chat) +* [Llama-2-7b-chat](https://aka.ms/azureai/landing/Llama-2-7b-chat) -The following models are available in Azure Marketplace for Llama 2 when deployed as a serverless API: -* Meta Llama-2-7B (preview) -* Meta Llama 2 7B-Chat (preview) -* Meta Llama-2-13B (preview) -* Meta Llama 2 13B-Chat (preview) -* Meta Llama-2-70B (preview) -* Meta Llama 2 70B-Chat (preview) - -If you need to deploy a different model, [deploy it to managed compute](#deploy-meta-llama-models-to-managed-compute) instead. +## Prerequisites -### Prerequisites +To use Meta Llama chat models with Azure AI Studio, you need the following prerequisites: -# [Meta Llama 3](#tab/llama-three) +### A model deployment -- An Azure subscription with a valid payment method. Free or trial Azure subscriptions won't work. If you don't have an Azure subscription, create a [paid Azure account](https://azure.microsoft.com/pricing/purchase-options/pay-as-you-go) to begin.-- An [AI Studio hub](../how-to/create-azure-ai-resource.md). The serverless API model deployment offering for Meta Llama 3.1 and Llama 3 is only available with hubs created in these regions:+**Deployment to serverless APIs** - * East US - * East US 2 - * North Central US - * South Central US - * West US - * West US 3 - * Sweden Central - - For a list of regions that are available for each of the models supporting serverless API endpoint deployments, see [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md). -- An [AI Studio project](../how-to/create-projects.md) in Azure AI Studio.-- Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure AI Studio. To perform the steps in this article, your user account must be assigned the __owner__ or __contributor__ role for the Azure subscription. Alternatively, your account can be assigned a custom role that has the following permissions:-- - On the Azure subscriptionΓÇöto subscribe the AI Studio project to the Azure Marketplace offering, once for each project, per offering: - - `Microsoft.MarketplaceOrdering/agreements/offers/plans/read` - - `Microsoft.MarketplaceOrdering/agreements/offers/plans/sign/action` - - `Microsoft.MarketplaceOrdering/offerTypes/publishers/offers/plans/agreements/read` - - `Microsoft.Marketplace/offerTypes/publishers/offers/plans/agreements/read` - - `Microsoft.SaaS/register/action` - - - On the resource groupΓÇöto create and use the SaaS resource: - - `Microsoft.SaaS/resources/read` - - `Microsoft.SaaS/resources/write` - - - On the AI Studio projectΓÇöto deploy endpoints (the Azure AI Developer role contains these permissions already): - - `Microsoft.MachineLearningServices/workspaces/marketplaceModelSubscriptions/*` - - `Microsoft.MachineLearningServices/workspaces/serverlessEndpoints/*` -- For more information on permissions, see [Role-based access control in Azure AI Studio](../concepts/rbac-ai-studio.md). - -# [Meta Llama 2](#tab/llama-two) +Meta Llama chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++**Deployment to a self-hosted managed compute** ++Meta Llama chat models can be deployed to our self-hosted managed inference solution, which allows you to customize and control all the details about how the model is served. ++For deployment to a self-hosted managed compute, you must have enough quota in your subscription. If you don't have enough quota available, you can use our temporary quota access by selecting the option **I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours.** ++> [!div class="nextstepaction"] +> [Deploy the model to managed compute](../concepts/deployments-overview.md) -- An Azure subscription with a valid payment method. Free or trial Azure subscriptions won't work. If you don't have an Azure subscription, create a [paid Azure account](https://azure.microsoft.com/pricing/purchase-options/pay-as-you-go) to begin.-- An [AI Studio hub](../how-to/create-azure-ai-resource.md). The serverless API model deployment offering for Meta Llama 2 is only available with hubs created in these regions:+### The inference package installed - * East US - * East US 2 - * North Central US - * South Central US - * West US - * West US 3 +You can consume predictions from this model by using the `azure-ai-inference` package with Python. To install this package, you need the following prerequisites: ++* Python 3.8 or later installed, including pip. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. - For a list of regions that are available for each of the models supporting serverless API endpoint deployments, see [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md). -- An [AI Studio project](../how-to/create-projects.md) in Azure AI Studio.-- Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure AI Studio. To perform the steps in this article, your user account must be assigned the __owner__ or __contributor__ role for the Azure subscription. Alternatively, your account can be assigned a custom role that has the following permissions:-- - On the Azure subscriptionΓÇöto subscribe the AI Studio project to the Azure Marketplace offering, once for each project, per offering: - - `Microsoft.MarketplaceOrdering/agreements/offers/plans/read` - - `Microsoft.MarketplaceOrdering/agreements/offers/plans/sign/action` - - `Microsoft.MarketplaceOrdering/offerTypes/publishers/offers/plans/agreements/read` - - `Microsoft.Marketplace/offerTypes/publishers/offers/plans/agreements/read` - - `Microsoft.SaaS/register/action` - - - On the resource groupΓÇöto create and use the SaaS resource: - - `Microsoft.SaaS/resources/read` - - `Microsoft.SaaS/resources/write` - - - On the AI Studio projectΓÇöto deploy endpoints (the Azure AI Developer role contains these permissions already): - - `Microsoft.MachineLearningServices/workspaces/marketplaceModelSubscriptions/*` - - `Microsoft.MachineLearningServices/workspaces/serverlessEndpoints/*` -- For more information on permissions, see [Role-based access control in Azure AI Studio](../concepts/rbac-ai-studio.md). +Once you have these prerequisites, install the Azure AI inference package with the following command: -+```bash +pip install azure-ai-inference +``` ++Read more about the [Azure AI inference package and reference](https://aka.ms/azsdk/azure-ai-inference/python/reference). ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Meta Llama chat models. -### Create a new deployment +### Create a client to consume the model -# [Meta Llama 3](#tab/llama-three) +First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. -To create a deployment: -1. Sign in to [Azure AI Studio](https://ai.azure.com). -1. Choose `Meta-Llama-3.1-405B-Instruct` deploy from the Azure AI Studio [model catalog](https://ai.azure.com/explore/models). +```python +import os +from azure.ai.inference import ChatCompletionsClient +from azure.core.credentials import AzureKeyCredential - Alternatively, you can initiate deployment by starting from your project in AI Studio. Select a project and then select **Deployments** > **+ Create**. +client = ChatCompletionsClient( + endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"], + credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"]), +) +``` -1. On the **Details** page for `Meta-Llama-3.1-405B-Instruct`, select **Deploy** and then select **Serverless API with Azure AI Content Safety**. +When you deploy the model to a self-hosted online endpoint with **Microsoft Entra ID** support, you can use the following code snippet to create a client. -1. Select the project in which you want to deploy your models. To use the pay-as-you-go model deployment offering, your workspace must belong to the **East US 2** or **Sweden Central** region. -1. On the deployment wizard, select the link to **Azure Marketplace Terms** to learn more about the terms of use. You can also select the **Marketplace offer details** tab to learn about pricing for the selected model. -1. If this is your first time deploying the model in the project, you have to subscribe your project for the particular offering (for example, `Meta-Llama-3.1-405B-Instruct`) from Azure Marketplace. This step requires that your account has the Azure subscription permissions and resource group permissions listed in the prerequisites. Each project has its own subscription to the particular Azure Marketplace offering, which allows you to control and monitor spending. Select **Subscribe and Deploy**. - > [!NOTE] - > Subscribing a project to a particular Azure Marketplace offering (in this case, Meta-Llama-3-70B) requires that your account has **Contributor** or **Owner** access at the subscription level where the project is created. Alternatively, your user account can be assigned a custom role that has the Azure subscription permissions and resource group permissions listed in the [prerequisites](#prerequisites). +```python +import os +from azure.ai.inference import ChatCompletionsClient +from azure.identity import DefaultAzureCredential -1. Once you sign up the project for the particular Azure Marketplace offering, subsequent deployments of the _same_ offering in the _same_ project don't require subscribing again. Therefore, you don't need to have the subscription-level permissions for subsequent deployments. If this scenario applies to you, select **Continue to deploy**. +client = ChatCompletionsClient( + endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"], + credential=DefaultAzureCredential(), +) +``` -1. Give the deployment a name. This name becomes part of the deployment API URL. This URL must be unique in each Azure region. +> [!NOTE] +> Currently, serverless API endpoints do not support using Microsoft Entra ID for authentication. -1. Select **Deploy**. Wait until the deployment is ready and you're redirected to the Deployments page. +### Get the model's capabilities -1. Select **Open in playground** to start interacting with the model. +The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: -1. You can return to the Deployments page, select the deployment, and note the endpoint's **Target** URL and the Secret **Key**, which you can use to call the deployment and generate completions. -1. You can always find the endpoint's details, URL, and access keys by navigating to the project page and selecting **Deployments** from the left menu. +```python +model_info = client.get_model_info() +``` -To learn about billing for Meta Llama models deployed with pay-as-you-go, see [Cost and quota considerations for Llama 3 models deployed as a service](#cost-and-quota-considerations-for-meta-llama-31-models-deployed-as-a-service). +The response is as follows: -# [Meta Llama 2](#tab/llama-two) -To create a deployment: +```python +print("Model name:", model_info.model_name) +print("Model type:", model_info.model_type) +print("Model provider name:", model_info.model_provider) +``` ++```console +Model name: Meta-Llama-3.1-405B-Instruct +Model type: chat-completions +Model provider name: Meta +``` -1. Sign in to [Azure AI Studio](https://ai.azure.com). -1. Choose the model you want to deploy from the Azure AI Studio [model catalog](https://ai.azure.com/explore/models). +### Create a chat completion request - Alternatively, you can initiate deployment by starting from your project in AI Studio. Select a project and then select **Deployments** > **+ Create**. +The following example shows how you can create a basic chat completions request to the model. -1. On the model's **Details** page, select **Deploy** and then select **Serverless API with Azure AI Content Safety**. +```python +from azure.ai.inference.models import SystemMessage, UserMessage - :::image type="content" source="../media/deploy-monitor/llama/deploy-pay-as-you-go.png" alt-text="A screenshot showing how to deploy a model with the pay-as-you-go option." lightbox="../media/deploy-monitor/llama/deploy-pay-as-you-go.png"::: +response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], +) +``` -1. Select the project in which you want to deploy your models. To use the pay-as-you-go model deployment offering, your workspace must belong to the **East US 2** or **West US 3** region. -1. On the deployment wizard, select the link to **Azure Marketplace Terms** to learn more about the terms of use. You can also select the **Marketplace offer details** tab to learn about pricing for the selected model. -1. If this is your first time deploying the model in the project, you have to subscribe your project for the particular offering (for example, Meta-Llama-2-7B) from Azure Marketplace. This step requires that your account has the Azure subscription permissions and resource group permissions listed in the prerequisites. Each project has its own subscription to the particular Azure Marketplace offering, which allows you to control and monitor spending. Select **Subscribe and Deploy**. +The response is as follows, where you can see the model's usage statistics: - > [!NOTE] - > Subscribing a project to a particular Azure Marketplace offering (in this case, Meta-Llama-2-7B) requires that your account has **Contributor** or **Owner** access at the subscription level where the project is created. Alternatively, your user account can be assigned a custom role that has the Azure subscription permissions and resource group permissions listed in the [prerequisites](#prerequisites). - :::image type="content" source="../media/deploy-monitor/llama/deploy-marketplace-terms.png" alt-text="A screenshot showing the terms and conditions of a given model." lightbox="../media/deploy-monitor/llama/deploy-marketplace-terms.png"::: +```python +print("Response:", response.choices[0].message.content) +print("Model:", response.model) +print("Usage:") +print("\tPrompt tokens:", response.usage.prompt_tokens) +print("\tTotal tokens:", response.usage.total_tokens) +print("\tCompletion tokens:", response.usage.completion_tokens) +``` -1. Once you sign up the project for the particular Azure Marketplace offering, subsequent deployments of the _same_ offering in the _same_ project don't require subscribing again. Therefore, you don't need to have the subscription-level permissions for subsequent deployments. If this scenario applies to you, select **Continue to deploy**. +```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: Meta-Llama-3.1-405B-Instruct +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` - :::image type="content" source="../media/deploy-monitor/llama/deploy-pay-as-you-go-project.png" alt-text="A screenshot showing a project that is already subscribed to the offering." lightbox="../media/deploy-monitor/llama/deploy-pay-as-you-go-project.png"::: +Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. -1. Give the deployment a name. This name becomes part of the deployment API URL. This URL must be unique in each Azure region. +#### Stream content - :::image type="content" source="../media/deploy-monitor/llama/deployment-name.png" alt-text="A screenshot showing how to indicate the name of the deployment you want to create." lightbox="../media/deploy-monitor/llama/deployment-name.png"::: +By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. -1. Select **Deploy**. Wait until the deployment is ready and you're redirected to the Deployments page. -1. Select **Open in playground** to start interacting with the model. -1. You can return to the Deployments page, select the deployment, and note the endpoint's **Target** URL and the Secret **Key**, which you can use to call the deployment and generate completions. -1. You can always find the endpoint's details, URL, and access keys by navigating to your project and selecting **Deployments** from the left menu. +You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. -To learn about billing for Llama models deployed with pay-as-you-go, see [Cost and quota considerations for Llama 3 models deployed as a service](#cost-and-quota-considerations-for-meta-llama-31-models-deployed-as-a-service). -+```python +result = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + temperature=0, + top_p=1, + max_tokens=2048, + stream=True, +) +``` ++To stream completions, set `stream=True` when you call the model. ++To visualize the output, define a helper function to print the stream. ++```python +def print_stream(result): + """ + Prints the chat completion with streaming. Some delay is added to simulate + a real-time conversation. + """ + import time + for update in result: + if update.choices: + print(update.choices[0].delta.content, end="") + time.sleep(0.05) +``` ++You can visualize how streaming generates content: +++```python +print_stream(result) +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```python +from azure.ai.inference.models import ChatCompletionsResponseFormat ++response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + presence_penalty=0.1, + frequency_penalty=0.8, + max_tokens=2048, + stop=["<|endoftext|>"], + temperature=0, + top_p=1, + response_format={ "type": ChatCompletionsResponseFormat.TEXT }, +) +``` ++> [!WARNING] +> Meta Llama doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). -### Consume Meta Llama models as a service +### Pass extra parameters to the model -# [Meta Llama 3](#tab/llama-three) +The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. -Models deployed as a service can be consumed using either the chat or the completions API, depending on the type of model you deployed. +Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. -1. Select your project or hub and then select **Deployments** from the left menu. -1. Find and select the `Meta-Llama-3.1-405B-Instruct` deployment you created. +```python +response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + model_extras={ + "logprobs": True + } +) +``` ++The following extra parameters can be passed to Meta Llama chat models: ++| Name | Description | Type | +| -- | | | +| `n` | How many completions to generate for each prompt. Note: Because this parameter generates many completions, it can quickly consume your token quota. | `integer` | +| `best_of` | Generates best_of completions server-side and returns the *best* (the one with the lowest log probability per token). Results can't be streamed. When used with `n`, best_of controls the number of candidate completions and n specifies how many to returnΓÇöbest_of must be greater than `n`. Note: Because this parameter generates many completions, it can quickly consume your token quota. | `integer` | +| `logprobs` | A number indicating to include the log probabilities on the logprobs most likely tokens and the chosen tokens. For example, if logprobs is 10, the API returns a list of the 10 most likely tokens. the API will always return the logprob of the sampled token, so there might be up to logprobs+1 elements in the response. | `integer` | +| `ignore_eos` | Whether to ignore the `EOS` token and continue generating tokens after the `EOS` token is generated. | `boolean` | +| `use_beam_search` | Whether to use beam search instead of sampling. In such case, `best_of` must be greater than 1 and temperature must be 0. | `boolean` | +| `stop_token_ids` | List of IDs for tokens that, when generated, stop further token generation. The returned output contains the stop tokens unless the stop tokens are special tokens. | `array` | +| `skip_special_tokens` | Whether to skip special tokens in the output. | `boolean` | +++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```python +from azure.ai.inference.models import AssistantMessage, UserMessage, SystemMessage ++try: + response = client.complete( + messages=[ + SystemMessage(content="You are an AI assistant that helps people find information."), + UserMessage(content="Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills."), + ] + ) ++ print(response.choices[0].message.content) ++except HttpResponseError as ex: + if ex.status_code == 400: + response = ex.response.json() + if isinstance(response, dict) and "error" in response: + print(f"Your request triggered an {response['error']['code']} error:\n\t {response['error']['message']}") + else: + raise + raise +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). ++> [!NOTE] +> Azure AI content safety is only available for models deployed as serverless API endpoints. +++ -1. Select **Open in playground**. +## Meta Llama chat models -1. Select **View code** and copy the **Endpoint** URL and the **Key** value. +The Meta Llama chat models include the following models: -1. Make an API request based on the type of model you deployed. +# [Meta Llama-3.1](#tab/meta-llama-3-1) - - For completions models, such as `Meta-Llama-3-8B`, use the [`/completions`](#completions-api) API. - - For chat models, such as `Meta-Llama-3.1-405B-Instruct`, use the [`/chat/completions`](#chat-api) API. +The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open-source and closed chat models on common industry benchmarks. - For more information on using the APIs, see the [reference](#reference-for-meta-llama-31-models-deployed-as-a-service) section. -# [Meta Llama 2](#tab/llama-two) +The following models are available: +* [Meta-Llama-3.1-405B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3.1-405B-Instruct) +* [Meta-Llama-3.1-70B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3.1-70B-Instruct) +* [Meta-Llama-3.1-8B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3.1-8B-Instruct) -Models deployed as a service can be consumed using either the chat or the completions API, depending on the type of model you deployed. -1. Select your project or hub and then select **Deployments** from the left menu. +# [Meta Llama-3](#tab/meta-llama-3) -1. Find and select the deployment you created. +Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B, and 405B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open-source chat models on common industry benchmarks. Further, in developing these models, we took great care to optimize helpfulness and safety. -1. Select **Open in playground**. -1. Select **View code** and copy the **Endpoint** URL and the **Key** value. +The following models are available: -1. Make an API request based on the type of model you deployed. +* [Meta-Llama-3-70B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3-70B-Instruct) +* [Meta-Llama-3-8B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3-8B-Instruct) - - For completions models, such as `Meta-Llama-2-7B`, use the [`/v1/completions`](#completions-api) API or the [Azure AI Model Inference API](../reference/reference-model-inference-api.md) on the route `/completions`. - - For chat models, such as `Meta-Llama-2-7B-Chat`, use the [`/v1/chat/completions`](#chat-api) API or the [Azure AI Model Inference API](../reference/reference-model-inference-api.md) on the route `/chat/completions`. - For more information on using the APIs, see the [reference](#reference-for-meta-llama-31-models-deployed-as-a-service) section. +# [Meta Llama-2](#tab/meta-llama-2) ++Meta has developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama-2-Chat models outperform open-source chat models on most benchmarks we tested, and in our human evaluations for helpfulness and safety, are on par with some popular closed-source models like ChatGPT and PaLM. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama-2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. +++The following models are available: ++* [Llama-2-70b-chat](https://aka.ms/azureai/landing/Llama-2-70b-chat) +* [Llama-2-13b-chat](https://aka.ms/azureai/landing/Llama-2-13b-chat) +* [Llama-2-7b-chat](https://aka.ms/azureai/landing/Llama-2-7b-chat) + -### Reference for Meta Llama 3.1 models deployed as a service +## Prerequisites -Llama models accept both the [Azure AI Model Inference API](../reference/reference-model-inference-api.md) on the route `/chat/completions` or a [Llama Chat API](#chat-api) on `/v1/chat/completions`. In the same way, text completions can be generated using the [Azure AI Model Inference API](../reference/reference-model-inference-api.md) on the route `/completions` or a [Llama Completions API](#completions-api) on `/v1/completions` +To use Meta Llama chat models with Azure AI Studio, you need the following prerequisites: -The [Azure AI Model Inference API](../reference/reference-model-inference-api.md) schema can be found in the [reference for Chat Completions](../reference/reference-model-inference-chat-completions.md) article and an [OpenAPI specification can be obtained from the endpoint itself](../reference/reference-model-inference-api.md?tabs=rest#getting-started). +### A model deployment -#### Completions API +**Deployment to serverless APIs** -Use the method `POST` to send the request to the `/v1/completions` route: +Meta Llama chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. -__Request__ +Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). -```rest -POST /v1/completions HTTP/1.1 -Host: <DEPLOYMENT_URI> -Authorization: Bearer <TOKEN> -Content-type: application/json +> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++**Deployment to a self-hosted managed compute** ++Meta Llama chat models can be deployed to our self-hosted managed inference solution, which allows you to customize and control all the details about how the model is served. ++For deployment to a self-hosted managed compute, you must have enough quota in your subscription. If you don't have enough quota available, you can use our temporary quota access by selecting the option **I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours.** ++> [!div class="nextstepaction"] +> [Deploy the model to managed compute](../concepts/deployments-overview.md) ++### The inference package installed ++You can consume predictions from this model by using the `@azure-rest/ai-inference` package from `npm`. To install this package, you need the following prerequisites: ++* LTS versions of `Node.js` with `npm`. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++Once you have these prerequisites, install the Azure Inference library for JavaScript with the following command: ++```bash +npm install @azure-rest/ai-inference ``` -#### Request schema +## Work with chat completions -Payload is a JSON formatted string containing the following parameters: +In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. -| Key | Type | Default | Description | -||--||-| -| `prompt` | `string` | No default. This value must be specified. | The prompt to send to the model. | -| `stream` | `boolean` | `False` | Streaming allows the generated tokens to be sent as data-only server-sent events whenever they become available. | -| `max_tokens` | `integer` | `16` | The maximum number of tokens to generate in the completion. The token count of your prompt plus `max_tokens` can't exceed the model's context length. | -| `top_p` | `float` | `1` | An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with `top_p` probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering `top_p` or `temperature`, but not both. | -| `temperature` | `float` | `1` | The sampling temperature to use, between 0 and 2. Higher values mean the model samples more broadly the distribution of tokens. Zero means greedy sampling. We recommend altering this or `top_p`, but not both. | -| `n` | `integer` | `1` | How many completions to generate for each prompt. <br>Note: Because this parameter generates many completions, it can quickly consume your token quota. | -| `stop` | `array` | `null` | String or a list of strings containing the word where the API stops generating further tokens. The returned text won't contain the stop sequence. | -| `best_of` | `integer` | `1` | Generates `best_of` completions server-side and returns the "best" (the one with the lowest log probability per token). Results can't be streamed. When used with `n`, `best_of` controls the number of candidate completions and `n` specifies how many to returnΓÇô`best_of` must be greater than `n`. <br>Note: Because this parameter generates many completions, it can quickly consume your token quota.| -| `logprobs` | `integer` | `null` | A number indicating to include the log probabilities on the `logprobs` most likely tokens and the chosen tokens. For example, if `logprobs` is 10, the API returns a list of the 10 most likely tokens. the API always returns the logprob of the sampled token, so there might be up to `logprobs`+1 elements in the response. | -| `presence_penalty` | `float` | `null` | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. | -| `ignore_eos` | `boolean` | `True` | Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. | -| `use_beam_search` | `boolean` | `False` | Whether to use beam search instead of sampling. In such case, `best_of` must be greater than `1` and `temperature` must be `0`. | -| `stop_token_ids` | `array` | `null` | List of IDs for tokens that, when generated, stop further token generation. The returned output contains the stop tokens unless the stop tokens are special tokens. | -| `skip_special_tokens` | `boolean` | `null` | Whether to skip special tokens in the output. | +> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Meta Llama chat models. -#### Example +### Create a client to consume the model -__Body__ +First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. -```json -{ - "prompt": "What's the distance to the moon?", - "temperature": 0.8, - "max_tokens": 512 ++```javascript +import ModelClient from "@azure-rest/ai-inference"; +import { isUnexpected } from "@azure-rest/ai-inference"; +import { AzureKeyCredential } from "@azure/core-auth"; ++const client = new ModelClient( + process.env.AZURE_INFERENCE_ENDPOINT, + new AzureKeyCredential(process.env.AZURE_INFERENCE_CREDENTIAL) +); +``` ++When you deploy the model to a self-hosted online endpoint with **Microsoft Entra ID** support, you can use the following code snippet to create a client. +++```javascript +import ModelClient from "@azure-rest/ai-inference"; +import { isUnexpected } from "@azure-rest/ai-inference"; +import { DefaultAzureCredential } from "@azure/identity"; ++const client = new ModelClient( + process.env.AZURE_INFERENCE_ENDPOINT, + new DefaultAzureCredential() +); +``` ++> [!NOTE] +> Currently, serverless API endpoints do not support using Microsoft Entra ID for authentication. ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```javascript +var model_info = await client.path("/info").get() +``` ++The response is as follows: +++```javascript +console.log("Model name: ", model_info.body.model_name) +console.log("Model type: ", model_info.body.model_type) +console.log("Model provider name: ", model_info.body.model_provider_name) +``` ++```console +Model name: Meta-Llama-3.1-405B-Instruct +Model type: chat-completions +Model provider name: Meta +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } +}); +``` ++The response is as follows, where you can see the model's usage statistics: +++```javascript +if (isUnexpected(response)) { + throw response.body.error; +} ++console.log("Response: ", response.body.choices[0].message.content); +console.log("Model: ", response.body.model); +console.log("Usage:"); +console.log("\tPrompt tokens:", response.body.usage.prompt_tokens); +console.log("\tTotal tokens:", response.body.usage.total_tokens); +console.log("\tCompletion tokens:", response.body.usage.completion_tokens); +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: Meta-Llama-3.1-405B-Instruct +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } +}).asNodeStream(); +``` ++To stream completions, use `.asNodeStream()` when you call the model. ++You can visualize how streaming generates content: +++```javascript +var stream = response.body; +if (!stream) { + stream.destroy(); + throw new Error(`Failed to get chat completions with status: ${response.status}`); +} ++if (response.status !== "200") { + throw new Error(`Failed to get chat completions: ${response.body.error}`); +} ++var sses = createSseStream(stream); ++for await (const event of sses) { + if (event.data === "[DONE]") { + return; + } + for (const choice of (JSON.parse(event.data)).choices) { + console.log(choice.delta?.content ?? ""); + } } ``` -#### Response schema +#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + presence_penalty: "0.1", + frequency_penalty: "0.8", + max_tokens: 2048, + stop: ["<|endoftext|>"], + temperature: 0, + top_p: 1, + response_format: { type: "text" }, + } +}); +``` ++> [!WARNING] +> Meta Llama doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + headers: { + "extra-params": "pass-through" + }, + body: { + messages: messages, + logprobs: true + } +}); +``` ++The following extra parameters can be passed to Meta Llama chat models: ++| Name | Description | Type | +| -- | | | +| `n` | How many completions to generate for each prompt. Note: Because this parameter generates many completions, it can quickly consume your token quota. | `integer` | +| `best_of` | Generates best_of completions server-side and returns the *best* (the one with the lowest log probability per token). Results can't be streamed. When used with `n`, best_of controls the number of candidate completions and n specifies how many to returnΓÇöbest_of must be greater than `n`. Note: Because this parameter generates many completions, it can quickly consume your token quota. | `integer` | +| `logprobs` | A number indicating to include the log probabilities on the logprobs most likely tokens and the chosen tokens. For example, if logprobs is 10, the API returns a list of the 10 most likely tokens. the API will always return the logprob of the sampled token, so there might be up to logprobs+1 elements in the response. | `integer` | +| `ignore_eos` | Whether to ignore the `EOS` token and continue generating tokens after the `EOS` token is generated. | `boolean` | +| `use_beam_search` | Whether to use beam search instead of sampling. In such case, `best_of` must be greater than 1 and temperature must be 0. | `boolean` | +| `stop_token_ids` | List of IDs for tokens that, when generated, stop further token generation. The returned output contains the stop tokens unless the stop tokens are special tokens. | `array` | +| `skip_special_tokens` | Whether to skip special tokens in the output. | `boolean` | +++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. + -The response payload is a dictionary with the following fields. +```javascript +try { + var messages = [ + { role: "system", content: "You are an AI assistant that helps people find information." }, + { role: "user", content: "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." }, + ]; -| Key | Type | Description | -|--|--|--| -| `id` | `string` | A unique identifier for the completion. | -| `choices` | `array` | The list of completion choices the model generated for the input prompt. | -| `created` | `integer` | The Unix timestamp (in seconds) of when the completion was created. | -| `model` | `string` | The model_id used for completion. | -| `object` | `string` | The object type, which is always `text_completion`. | -| `usage` | `object` | Usage statistics for the completion request. | + var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } + }); ++ console.log(response.body.choices[0].message.content); +} +catch (error) { + if (error.status_code == 400) { + var response = JSON.parse(error.response._content); + if (response.error) { + console.log(`Your request triggered an ${response.error.code} error:\n\t ${response.error.message}`); + } + else + { + throw error; + } + } +} +``` > [!TIP]-> In the streaming mode, for each chunk of response, `finish_reason` is always `null`, except from the last one which is terminated by a payload `[DONE]`. +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +> [!NOTE] +> Azure AI content safety is only available for models deployed as serverless API endpoints. -The `choices` object is a dictionary with the following fields. -| Key | Type | Description | -||--|| -| `index` | `integer` | Choice index. When `best_of` > 1, the index in this array might not be in order and might not be 0 to n-1. | -| `text` | `string` | Completion result. | -| `finish_reason` | `string` | The reason the model stopped generating tokens: <br>- `stop`: model hit a natural stop point, or a provided stop sequence. <br>- `length`: if max number of tokens have been reached. <br>- `content_filter`: When RAI moderates and CMP forces moderation. <br>- `content_filter_error`: an error during moderation and wasn't able to make decision on the response. <br>- `null`: API response still in progress or incomplete. | -| `logprobs` | `object` | The log probabilities of the generated tokens in the output text. | -The `usage` object is a dictionary with the following fields. -| Key | Type | Value | -||--|--| -| `prompt_tokens` | `integer` | Number of tokens in the prompt. | -| `completion_tokens` | `integer` | Number of tokens generated in the completion. | -| `total_tokens` | `integer` | Total tokens. | +## Meta Llama chat models -The `logprobs` object is a dictionary with the following fields: +The Meta Llama chat models include the following models: -| Key | Type | Value | -||-|-| -| `text_offsets` | `array` of `integers` | The position or index of each token in the completion output. | -| `token_logprobs` | `array` of `float` | Selected `logprobs` from dictionary in `top_logprobs` array. | -| `tokens` | `array` of `string` | Selected tokens. | -| `top_logprobs` | `array` of `dictionary` | Array of dictionary. In each dictionary, the key is the token and the value is the prob. | +# [Meta Llama-3.1](#tab/meta-llama-3-1) -#### Example +The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open-source and closed chat models on common industry benchmarks. -```json ++The following models are available: ++* [Meta-Llama-3.1-405B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3.1-405B-Instruct) +* [Meta-Llama-3.1-70B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3.1-70B-Instruct) +* [Meta-Llama-3.1-8B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3.1-8B-Instruct) +++# [Meta Llama-3](#tab/meta-llama-3) ++Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B, and 405B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open-source chat models on common industry benchmarks. Further, in developing these models, we took great care to optimize helpfulness and safety. +++The following models are available: ++* [Meta-Llama-3-70B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3-70B-Instruct) +* [Meta-Llama-3-8B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3-8B-Instruct) +++# [Meta Llama-2](#tab/meta-llama-2) ++Meta has developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama-2-Chat models outperform open-source chat models on most benchmarks we tested, and in our human evaluations for helpfulness and safety, are on par with some popular closed-source models like ChatGPT and PaLM. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama-2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. +++The following models are available: ++* [Llama-2-70b-chat](https://aka.ms/azureai/landing/Llama-2-70b-chat) +* [Llama-2-13b-chat](https://aka.ms/azureai/landing/Llama-2-13b-chat) +* [Llama-2-7b-chat](https://aka.ms/azureai/landing/Llama-2-7b-chat) +++++## Prerequisites ++To use Meta Llama chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Meta Llama chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++**Deployment to a self-hosted managed compute** ++Meta Llama chat models can be deployed to our self-hosted managed inference solution, which allows you to customize and control all the details about how the model is served. ++For deployment to a self-hosted managed compute, you must have enough quota in your subscription. If you don't have enough quota available, you can use our temporary quota access by selecting the option **I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours.** ++> [!div class="nextstepaction"] +> [Deploy the model to managed compute](../concepts/deployments-overview.md) ++### The inference package installed ++You can consume predictions from this model by using the `Azure.AI.Inference` package from [Nuget](https://www.nuget.org/). To install this package, you need the following prerequisites: ++* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++Once you have these prerequisites, install the Azure AI inference library with the following command: ++```dotnetcli +dotnet add package Azure.AI.Inference --prerelease +``` ++You can also authenticate with Microsoft Entra ID (formerly Azure Active Directory). To use credential providers provided with the Azure SDK, install the `Azure.Identity` package: ++```dotnetcli +dotnet add package Azure.Identity +``` ++Import the following namespaces: +++```csharp +using Azure; +using Azure.Identity; +using Azure.AI.Inference; +``` ++This example also use the following namespaces but you may not always need them: +++```csharp +using System.Text.Json; +using System.Text.Json.Serialization; +using System.Reflection; +``` ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Meta Llama chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```csharp +ChatCompletionsClient client = new ChatCompletionsClient( + new Uri(Environment.GetEnvironmentVariable("AZURE_INFERENCE_ENDPOINT")), + new AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_INFERENCE_CREDENTIAL")) +); +``` ++When you deploy the model to a self-hosted online endpoint with **Microsoft Entra ID** support, you can use the following code snippet to create a client. +++```csharp +client = new ChatCompletionsClient( + new Uri(Environment.GetEnvironmentVariable("AZURE_INFERENCE_ENDPOINT")), + new DefaultAzureCredential(includeInteractiveCredentials: true) +); +``` ++> [!NOTE] +> Currently, serverless API endpoints do not support using Microsoft Entra ID for authentication. ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```csharp +Response<ModelInfo> modelInfo = client.GetModelInfo(); +``` ++The response is as follows: +++```csharp +Console.WriteLine($"Model name: {modelInfo.Value.ModelName}"); +Console.WriteLine($"Model type: {modelInfo.Value.ModelType}"); +Console.WriteLine($"Model provider name: {modelInfo.Value.ModelProviderName}"); +``` ++```console +Model name: Meta-Llama-3.1-405B-Instruct +Model type: chat-completions +Model provider name: Meta +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```csharp +ChatCompletionsOptions requestOptions = new ChatCompletionsOptions() {- "id": "12345678-1234-1234-1234-abcdefghijkl", - "object": "text_completion", - "created": 217877, - "choices": [ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, +}; ++Response<ChatCompletions> response = client.Complete(requestOptions); +``` ++The response is as follows, where you can see the model's usage statistics: +++```csharp +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +Console.WriteLine($"Model: {response.Value.Model}"); +Console.WriteLine("Usage:"); +Console.WriteLine($"\tPrompt tokens: {response.Value.Usage.PromptTokens}"); +Console.WriteLine($"\tTotal tokens: {response.Value.Usage.TotalTokens}"); +Console.WriteLine($"\tCompletion tokens: {response.Value.Usage.CompletionTokens}"); +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: Meta-Llama-3.1-405B-Instruct +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```csharp +static async Task StreamMessageAsync(ChatCompletionsClient client) +{ + ChatCompletionsOptions requestOptions = new ChatCompletionsOptions() + { + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world? Write an essay about it.") + }, + MaxTokens=4096 + }; ++ StreamingResponse<StreamingChatCompletionsUpdate> streamResponse = await client.CompleteStreamingAsync(requestOptions); ++ await PrintStream(streamResponse); +} +``` ++To stream completions, use `CompleteStreamingAsync` method when you call the model. Notice that in this example we the call is wrapped in an asynchronous method. ++To visualize the output, define an asynchronous method to print the stream in the console. ++```csharp +static async Task PrintStream(StreamingResponse<StreamingChatCompletionsUpdate> response) +{ + await foreach (StreamingChatCompletionsUpdate chatUpdate in response) + { + if (chatUpdate.Role.HasValue) {- "index": 0, - "text": "The Moon is an average of 238,855 miles away from Earth, which is about 30 Earths away.", - "logprobs": null, - "finish_reason": "stop" + Console.Write($"{chatUpdate.Role.Value.ToString().ToUpperInvariant()}: "); }- ], - "usage": { - "prompt_tokens": 7, - "total_tokens": 23, - "completion_tokens": 16 + if (!string.IsNullOrEmpty(chatUpdate.ContentUpdate)) + { + Console.Write(chatUpdate.ContentUpdate); + } + } +} +``` ++You can visualize how streaming generates content: +++```csharp +StreamMessageAsync(client).GetAwaiter().GetResult(); +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + PresencePenalty = 0.1f, + FrequencyPenalty = 0.8f, + MaxTokens = 2048, + StopSequences = { "<|endoftext|>" }, + Temperature = 0, + NucleusSamplingFactor = 1, + ResponseFormat = new ChatCompletionsResponseFormatText() +}; ++response = client.Complete(requestOptions); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` ++> [!WARNING] +> Meta Llama doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + AdditionalProperties = { { "logprobs", BinaryData.FromString("true") } }, +}; ++response = client.Complete(requestOptions, extraParams: ExtraParameters.PassThrough); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` ++The following extra parameters can be passed to Meta Llama chat models: ++| Name | Description | Type | +| -- | | | +| `n` | How many completions to generate for each prompt. Note: Because this parameter generates many completions, it can quickly consume your token quota. | `integer` | +| `best_of` | Generates best_of completions server-side and returns the *best* (the one with the lowest log probability per token). Results can't be streamed. When used with `n`, best_of controls the number of candidate completions and n specifies how many to returnΓÇöbest_of must be greater than `n`. Note: Because this parameter generates many completions, it can quickly consume your token quota. | `integer` | +| `logprobs` | A number indicating to include the log probabilities on the logprobs most likely tokens and the chosen tokens. For example, if logprobs is 10, the API returns a list of the 10 most likely tokens. the API will always return the logprob of the sampled token, so there might be up to logprobs+1 elements in the response. | `integer` | +| `ignore_eos` | Whether to ignore the `EOS` token and continue generating tokens after the `EOS` token is generated. | `boolean` | +| `use_beam_search` | Whether to use beam search instead of sampling. In such case, `best_of` must be greater than 1 and temperature must be 0. | `boolean` | +| `stop_token_ids` | List of IDs for tokens that, when generated, stop further token generation. The returned output contains the stop tokens unless the stop tokens are special tokens. | `array` | +| `skip_special_tokens` | Whether to skip special tokens in the output. | `boolean` | +++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```csharp +try +{ + requestOptions = new ChatCompletionsOptions() + { + Messages = { + new ChatRequestSystemMessage("You are an AI assistant that helps people find information."), + new ChatRequestUserMessage( + "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." + ), + }, + }; ++ response = client.Complete(requestOptions); + Console.WriteLine(response.Value.Choices[0].Message.Content); +} +catch (RequestFailedException ex) +{ + if (ex.ErrorCode == "content_filter") + { + Console.WriteLine($"Your query has trigger Azure Content Safeaty: {ex.Message}"); + } + else + { + throw; } } ``` -#### Chat API +> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). ++> [!NOTE] +> Azure AI content safety is only available for models deployed as serverless API endpoints. +++++## Meta Llama chat models ++The Meta Llama chat models include the following models: ++# [Meta Llama-3.1](#tab/meta-llama-3-1) ++The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open-source and closed chat models on common industry benchmarks. +++The following models are available: -Use the method `POST` to send the request to the `/v1/chat/completions` route: +* [Meta-Llama-3.1-405B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3.1-405B-Instruct) +* [Meta-Llama-3.1-70B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3.1-70B-Instruct) +* [Meta-Llama-3.1-8B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3.1-8B-Instruct) -__Request__ -```rest -POST /v1/chat/completions HTTP/1.1 -Host: <DEPLOYMENT_URI> +# [Meta Llama-3](#tab/meta-llama-3) ++Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B, and 405B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open-source chat models on common industry benchmarks. Further, in developing these models, we took great care to optimize helpfulness and safety. +++The following models are available: ++* [Meta-Llama-3-70B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3-70B-Instruct) +* [Meta-Llama-3-8B-Instruct](https://aka.ms/azureai/landing/Meta-Llama-3-8B-Instruct) +++# [Meta Llama-2](#tab/meta-llama-2) ++Meta has developed and publicly released the Llama 2 family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama-2-Chat, are optimized for dialogue use cases. Llama-2-Chat models outperform open-source chat models on most benchmarks we tested, and in our human evaluations for helpfulness and safety, are on par with some popular closed-source models like ChatGPT and PaLM. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama-2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. +++The following models are available: ++* [Llama-2-70b-chat](https://aka.ms/azureai/landing/Llama-2-70b-chat) +* [Llama-2-13b-chat](https://aka.ms/azureai/landing/Llama-2-13b-chat) +* [Llama-2-7b-chat](https://aka.ms/azureai/landing/Llama-2-7b-chat) +++++## Prerequisites ++To use Meta Llama chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Meta Llama chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++**Deployment to a self-hosted managed compute** ++Meta Llama chat models can be deployed to our self-hosted managed inference solution, which allows you to customize and control all the details about how the model is served. ++For deployment to a self-hosted managed compute, you must have enough quota in your subscription. If you don't have enough quota available, you can use our temporary quota access by selecting the option **I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours.** ++> [!div class="nextstepaction"] +> [Deploy the model to managed compute](../concepts/deployments-overview.md) ++### A REST client ++Models deployed with the [Azure AI model inference API](https://aka.ms/azureai/modelinference) can be consumed using any REST client. To use the REST client, you need the following prerequisites: ++* To construct the requests, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name`` is your unique model deployment host name and `your-azure-region`` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Meta Llama chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. ++When you deploy the model to a self-hosted online endpoint with **Microsoft Entra ID** support, you can use the following code snippet to create a client. ++> [!NOTE] +> Currently, serverless API endpoints do not support using Microsoft Entra ID for authentication. ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: ++```http +GET /info HTTP/1.1 +Host: <ENDPOINT_URI> Authorization: Bearer <TOKEN>-Content-type: application/json +Content-Type: application/json ``` -#### Request schema +The response is as follows: -Payload is a JSON formatted string containing the following parameters: -| Key | Type | Default | Description | -|--|--|--|--| -| `messages` | `string` | No default. This value must be specified. | The message or history of messages to use to prompt the model. | -| `stream` | `boolean` | `False` | Streaming allows the generated tokens to be sent as data-only server-sent events whenever they become available. | -| `max_tokens` | `integer` | `16` | The maximum number of tokens to generate in the completion. The token count of your prompt plus `max_tokens` can't exceed the model's context length. | -| `top_p` | `float` | `1` | An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with `top_p` probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering `top_p` or `temperature`, but not both. | -| `temperature` | `float` | `1` | The sampling temperature to use, between 0 and 2. Higher values mean the model samples more broadly the distribution of tokens. Zero means greedy sampling. We recommend altering this or `top_p`, but not both. | -| `n` | `integer` | `1` | How many completions to generate for each prompt. <br>Note: Because this parameter generates many completions, it can quickly consume your token quota. | -| `stop` | `array` | `null` | String or a list of strings containing the word where the API stops generating further tokens. The returned text won't contain the stop sequence. | -| `best_of` | `integer` | `1` | Generates `best_of` completions server-side and returns the "best" (the one with the lowest log probability per token). Results can't be streamed. When used with `n`, `best_of` controls the number of candidate completions and `n` specifies how many to returnΓÇö`best_of` must be greater than `n`. <br>Note: Because this parameter generates many completions, it can quickly consume your token quota.| -| `logprobs` | `integer` | `null` | A number indicating to include the log probabilities on the `logprobs` most likely tokens and the chosen tokens. For example, if `logprobs` is 10, the API returns a list of the 10 most likely tokens. the API will always return the logprob of the sampled token, so there might be up to `logprobs`+1 elements in the response. | -| `presence_penalty` | `float` | `null` | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. | -| `ignore_eos` | `boolean` | `True` | Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. | -| `use_beam_search` | `boolean` | `False` | Whether to use beam search instead of sampling. In such case, `best_of` must be greater than `1` and `temperature` must be `0`. | -| `stop_token_ids` | `array` | `null` | List of IDs for tokens that, when generated, stop further token generation. The returned output contains the stop tokens unless the stop tokens are special tokens.| -| `skip_special_tokens` | `boolean` | `null` | Whether to skip special tokens in the output. | +```json +{ + "model_name": "Meta-Llama-3.1-405B-Instruct", + "model_type": "chat-completions", + "model_provider_name": "Meta" +} +``` -The `messages` object has the following fields: +### Create a chat completion request -| Key | Type | Value | -|--|--|| -| `content` | `string` | The contents of the message. Content is required for all messages. | -| `role` | `string` | The role of the message's author. One of `system`, `user`, or `assistant`. | +The following example shows how you can create a basic chat completions request to the model. +```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ] +} +``` -#### Example +The response is as follows, where you can see the model's usage statistics: -__Body__ ```json {- "messages": - [ - { - "role": "system", - "content": "You are a helpful assistant that translates English to Italian."}, + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718726686, + "model": "Meta-Llama-3.1-405B-Instruct", + "choices": [ {- "role": "user", - "content": "Translate the following sentence from English to Italian: I love programming." + "index": 0, + "message": { + "role": "assistant", + "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.", + "tool_calls": null + }, + "finish_reason": "stop", + "logprobs": null } ],- "temperature": 0.8, - "max_tokens": 512, + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } } ``` -#### Response schema +Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. -The response payload is a dictionary with the following fields. +#### Stream content -| Key | Type | Description | -|--|--|-| -| `id` | `string` | A unique identifier for the completion. | -| `choices` | `array` | The list of completion choices the model generated for the input messages. | -| `created` | `integer` | The Unix timestamp (in seconds) of when the completion was created. | -| `model` | `string` | The model_id used for completion. | -| `object` | `string` | The object type, which is always `chat.completion`. | -| `usage` | `object` | Usage statistics for the completion request. | +By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. -> [!TIP] -> In the streaming mode, for each chunk of response, `finish_reason` is always `null`, except from the last one which is terminated by a payload `[DONE]`. In each `choices` object, the key for `messages` is changed by `delta`. +You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. -The `choices` object is a dictionary with the following fields. +```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "stream": true, + "temperature": 0, + "top_p": 1, + "max_tokens": 2048 +} +``` -| Key | Type | Description | -||--|--| -| `index` | `integer` | Choice index. When `best_of` > 1, the index in this array might not be in order and might not be `0` to `n-1`. | -| `messages` or `delta` | `string` | Chat completion result in `messages` object. When streaming mode is used, `delta` key is used. | -| `finish_reason` | `string` | The reason the model stopped generating tokens: <br>- `stop`: model hit a natural stop point or a provided stop sequence. <br>- `length`: if max number of tokens have been reached. <br>- `content_filter`: When RAI moderates and CMP forces moderation <br>- `content_filter_error`: an error during moderation and wasn't able to make decision on the response <br>- `null`: API response still in progress or incomplete.| -| `logprobs` | `object` | The log probabilities of the generated tokens in the output text. | +You can visualize how streaming generates content: -The `usage` object is a dictionary with the following fields. +```json +{ + "id": "23b54589eba14564ad8a2e6978775a39", + "object": "chat.completion.chunk", + "created": 1718726371, + "model": "Meta-Llama-3.1-405B-Instruct", + "choices": [ + { + "index": 0, + "delta": { + "role": "assistant", + "content": "" + }, + "finish_reason": null, + "logprobs": null + } + ] +} +``` ++The last message in the stream has `finish_reason` set, indicating the reason for the generation process to stop. + -| Key | Type | Value | -||--|--| -| `prompt_tokens` | `integer` | Number of tokens in the prompt. | -| `completion_tokens` | `integer` | Number of tokens generated in the completion. | -| `total_tokens` | `integer` | Total tokens. | +```json +{ + "id": "23b54589eba14564ad8a2e6978775a39", + "object": "chat.completion.chunk", + "created": 1718726371, + "model": "Meta-Llama-3.1-405B-Instruct", + "choices": [ + { + "index": 0, + "delta": { + "content": "" + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` -The `logprobs` object is a dictionary with the following fields: +#### Explore more parameters supported by the inference client -| Key | Type | Value | -||-|| -| `text_offsets` | `array` of `integers` | The position or index of each token in the completion output. | -| `token_logprobs` | `array` of `float` | Selected `logprobs` from dictionary in `top_logprobs` array. | -| `tokens` | `array` of `string` | Selected tokens. | -| `top_logprobs` | `array` of `dictionary` | Array of dictionary. In each dictionary, the key is the token and the value is the prob. | +Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). -#### Example +```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "presence_penalty": 0.1, + "frequency_penalty": 0.8, + "max_tokens": 2048, + "stop": ["<|endoftext|>"], + "temperature" :0, + "top_p": 1, + "response_format": { "type": "text" } +} +``` -The following is an example response: ```json {- "id": "12345678-1234-1234-1234-abcdefghijkl", + "id": "0a1234b5de6789f01gh2i345j6789klm", "object": "chat.completion",- "created": 2012359, - "model": "", + "created": 1718726686, + "model": "Meta-Llama-3.1-405B-Instruct", "choices": [ { "index": 0,- "finish_reason": "stop", "message": { "role": "assistant",- "content": "Sure, I\'d be happy to help! The translation of ""I love programming"" from English to Italian is:\n\n""Amo la programmazione.""\n\nHere\'s a breakdown of the translation:\n\n* ""I love"" in English becomes ""Amo"" in Italian.\n* ""programming"" in English becomes ""la programmazione"" in Italian.\n\nI hope that helps! Let me know if you have any other sentences you\'d like me to translate." - } + "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.", + "tool_calls": null + }, + "finish_reason": "stop", + "logprobs": null } ], "usage": {- "prompt_tokens": 10, - "total_tokens": 40, - "completion_tokens": 30 + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 } } ```- -## Deploy Meta Llama models to managed compute -Apart from deploying with the pay-as-you-go managed service, you can also deploy Meta Llama 3.1 models to managed compute in AI Studio. When deployed to managed compute, you can select all the details about the infrastructure running the model, including the virtual machines to use and the number of instances to handle the load you're expecting. Models deployed to managed compute consume quota from your subscription. The following models from the 3.1 release wave are available on managed compute: -- `Meta-Llama-3.1-8B-Instruct` (FT supported)-- `Meta-Llama-3.1-70B-Instruct` (FT supported)-- `Meta-Llama-3.1-8B` (FT supported)-- `Meta-Llama-3.1-70B` (FT supported)-- `Llama Guard 3 8B`-- `Prompt Guard`+> [!WARNING] +> Meta Llama doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). -Follow these steps to deploy a model such as `Meta-Llama-3.1-70B-Instruct ` to a managed compute in [Azure AI Studio](https://ai.azure.com). +### Pass extra parameters to the model -1. Choose the model you want to deploy from the Azure AI Studio [model catalog](https://ai.azure.com/explore/models). +The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. - Alternatively, you can initiate deployment by starting from your project in AI Studio. Select your project and then select **Deployments** > **+ Create**. +Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. -1. On the model's **Details** page, select **Deploy** next to the **View license** button. +```http +POST /chat/completions HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json +extra-parameters: pass-through +``` - :::image type="content" source="../media/deploy-monitor/llama/deploy-real-time-endpoint.png" alt-text="A screenshot showing how to deploy a model with the managed compute option." lightbox="../media/deploy-monitor/llama/deploy-real-time-endpoint.png"::: -1. On the **Deploy with Azure AI Content Safety (preview)** page, select **Skip Azure AI Content Safety** so that you can continue to deploy the model using the UI. +```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "logprobs": true +} +``` - > [!TIP] - > In general, we recommend that you select **Enable Azure AI Content Safety (Recommended)** for deployment of the Llama model. This deployment option is currently only supported using the Python SDK and it happens in a notebook. +The following extra parameters can be passed to Meta Llama chat models: -1. Select **Proceed**. -1. Select the project where you want to create a deployment. +| Name | Description | Type | +| -- | | | +| `n` | How many completions to generate for each prompt. Note: Because this parameter generates many completions, it can quickly consume your token quota. | `integer` | +| `best_of` | Generates best_of completions server-side and returns the *best* (the one with the lowest log probability per token). Results can't be streamed. When used with `n`, best_of controls the number of candidate completions and n specifies how many to returnΓÇöbest_of must be greater than `n`. Note: Because this parameter generates many completions, it can quickly consume your token quota. | `integer` | +| `logprobs` | A number indicating to include the log probabilities on the logprobs most likely tokens and the chosen tokens. For example, if logprobs is 10, the API returns a list of the 10 most likely tokens. the API will always return the logprob of the sampled token, so there might be up to logprobs+1 elements in the response. | `integer` | +| `ignore_eos` | Whether to ignore the `EOS` token and continue generating tokens after the `EOS` token is generated. | `boolean` | +| `use_beam_search` | Whether to use beam search instead of sampling. In such case, `best_of` must be greater than 1 and temperature must be 0. | `boolean` | +| `stop_token_ids` | List of IDs for tokens that, when generated, stop further token generation. The returned output contains the stop tokens unless the stop tokens are special tokens. | `array` | +| `skip_special_tokens` | Whether to skip special tokens in the output. | `boolean` | - > [!TIP] - > If you don't have enough quota available in the selected project, you can use the option **I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours**. - -1. Select the **Virtual machine** and the **Instance count** that you want to assign to the deployment. -1. Select if you want to create this deployment as part of a new endpoint or an existing one. Endpoints can host multiple deployments while keeping resource configuration exclusive for each of them. Deployments under the same endpoint share the endpoint URI and its access keys. - -1. Indicate if you want to enable **Inferencing data collection (preview)**. +### Apply content safety -1. Select **Deploy**. After a few moments, the endpoint's **Details** page opens up. +The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. -1. Wait for the endpoint creation and deployment to finish. This step can take a few minutes. +The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. -1. Select the **Consume** tab of the deployment to obtain code samples that can be used to consume the deployed model in your application. -### Consume Llama 2 models deployed to managed compute +```json +{ + "messages": [ + { + "role": "system", + "content": "You are an AI assistant that helps people find information." + }, + { + "role": "user", + "content": "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." + } + ] +} +``` +++```json +{ + "error": { + "message": "The response was filtered due to the prompt triggering Microsoft's content management policy. Please modify your prompt and retry.", + "type": null, + "param": "prompt", + "code": "content_filter", + "status": 400 + } +} +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). -For reference about how to invoke Llama models deployed to managed compute, see the model's card in the Azure AI Studio [model catalog](../how-to/model-catalog-overview.md). Each model's card has an overview page that includes a description of the model, samples for code-based inferencing, fine-tuning, and model evaluation. +> [!NOTE] +> Azure AI content safety is only available for models deployed as serverless API endpoints. -##### More inference examples -| **Package** | **Sample Notebook** | -|-|-| -| CLI using CURL and Python web requests | [webrequests.ipynb](https://aka.ms/meta-llama-3.1-405B-instruct-webrequests)| -| OpenAI SDK (experimental) | [openaisdk.ipynb](https://aka.ms/meta-llama-3.1-405B-instruct-openai)| -| LangChain | [langchain.ipynb](https://aka.ms/meta-llama-3.1-405B-instruct-langchain)| -| LiteLLM SDK | [litellm.ipynb](https://aka.ms/meta-llama-3.1-405B-instruct-litellm) | +## More inference examples -## Cost and quotas +For more examples of how to use Meta Llama, see the following examples and tutorials: -### Cost and quota considerations for Meta Llama 3.1 models deployed as a service +| Description | Language | Sample | +|-|-|- | +| CURL request | Bash | [Link](https://aka.ms/meta-llama-3.1-405B-instruct-webrequests) | +| Azure AI Inference package for JavaScript | JavaScript | [Link](https://aka.ms/azsdk/azure-ai-inference/javascript/samples) | +| Azure AI Inference package for Python | Python | [Link](https://aka.ms/azsdk/azure-ai-inference/python/samples) | +| Python web requests | Python | [Link](https://aka.ms/meta-llama-3.1-405B-instruct-webrequests) | +| OpenAI SDK (experimental) | Python | [Link](https://aka.ms/meta-llama-3.1-405B-instruct-openai) | +| LangChain | Python | [Link](https://aka.ms/meta-llama-3.1-405B-instruct-langchain) | +| LiteLLM | Python | [Link](https://aka.ms/meta-llama-3.1-405B-instruct-litellm) | -Meta Llama 3.1 models deployed as a service are offered by Meta through the Azure Marketplace and integrated with Azure AI Studio for use. You can find the Azure Marketplace pricing when deploying or [fine-tuning the models](./fine-tune-model-llama.md). +## Cost and quota considerations for Meta Llama family of models deployed as serverless API endpoints -Each time a project subscribes to a given offer from the Azure Marketplace, a new resource is created to track the costs associated with its consumption. The same resource is used to track costs associated with inference and fine-tuning; however, multiple meters are available to track each scenario independently. +Quota is managed per deployment. Each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute. However, we currently limit one deployment per model per project. Contact Microsoft Azure Support if the current rate limits aren't sufficient for your scenarios. -For more information on how to track costs, see [monitor costs for models offered throughout the Azure Marketplace](./costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace). +Meta Llama models deployed as a serverless API are offered by Meta through the Azure Marketplace and integrated with Azure AI Studio for use. You can find the Azure Marketplace pricing when deploying the model. +Each time a project subscribes to a given offer from the Azure Marketplace, a new resource is created to track the costs associated with its consumption. The same resource is used to track costs associated with inference; however, multiple meters are available to track each scenario independently. -Quota is managed per deployment. Each deployment has a rate limit of 400,000 tokens per minute and 1,000 API requests per minute. However, we currently limit one deployment per model per project. Contact Microsoft Azure Support if the current rate limits aren't sufficient for your scenarios. +For more information on how to track costs, see [Monitor costs for models offered through the Azure Marketplace](costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace). -### Cost and quota considerations for Meta Llama 3.1 models deployed as managed compute +## Cost and quota considerations for Meta Llama family of models deployed to managed compute -For deployment and inferencing of Meta Llama 3.1 models with managed compute, you consume virtual machine (VM) core quota that is assigned to your subscription on a per-region basis. When you sign up for Azure AI Studio, you receive a default VM quota for several VM families available in the region. You can continue to create deployments until you reach your quota limit. Once you reach this limit, you can request a quota increase. +Meta Llama models deployed to managed compute are billed based on core hours of the associated compute instance. The cost of the compute instance is determined by the size of the instance, the number of instances running, and the run duration. -## Content filtering +It is a good practice to start with a low number of instances and scale up as needed. You can monitor the cost of the compute instance in the Azure portal. -Models deployed as a serverless API with pay-as-you-go are protected by Azure AI Content Safety. When deployed to managed compute, you can opt out of this capability. With Azure AI content safety enabled, both the prompt and completion pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. Learn more about [Azure AI Content Safety](../concepts/content-filtering.md). +## Related content -## Next steps -- [What is Azure AI Studio?](../what-is-ai-studio.md)-- [Fine-tune a Meta Llama 3.1 models in Azure AI Studio](fine-tune-model-llama.md)-- [Azure AI FAQ article](../faq.yml)-- [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md)+* [Azure AI Model Inference API](../reference/reference-model-inference-api.md) +* [Deploy models as serverless APIs](deploy-models-serverless.md) +* [Consume serverless API endpoints from a different Azure AI Studio project or hub](deploy-models-serverless-connect.md) +* [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md) +* [Plan and manage costs (marketplace)](costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace) |
ai-studio | Deploy Models Mistral Nemo | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-studio/how-to/deploy-models-mistral-nemo.md | + + Title: How to use Mistral Nemo chat model with Azure AI Studio ++description: Learn how to use Mistral Nemo chat model with Azure AI Studio. +++ Last updated : 08/08/2024++reviewer: fkriti ++++zone_pivot_groups: azure-ai-model-catalog-samples-chat +++# How to use Mistral Nemo chat model ++In this article, you learn about Mistral Nemo chat model and how to use them. +Mistral AI offers two categories of models. Premium models including [Mistral Large and Mistral Small](deploy-models-mistral.md), available as serverless APIs with pay-as-you-go token-based billing. Open models including [Mistral Nemo](deploy-models-mistral-nemo.md), [Mixtral-8x7B-Instruct-v01, Mixtral-8x7B-v01, Mistral-7B-Instruct-v01, and Mistral-7B-v01](deploy-models-mistral-open.md); available to also download and run on self-hosted managed endpoints. ++++## Mistral Nemo chat model ++Mistral Nemo is a cutting-edge Language Model (LLM) boasting state-of-the-art reasoning, world knowledge, and coding capabilities within its size category. ++Mistral Nemo is a 12B model, making it a powerful drop-in replacement for any system using Mistral 7B, which it supersedes. It supports a context length of 128K, and it accepts only text inputs and generates text outputs. ++Additionally, Mistral Nemo is: ++* **Jointly developed with Nvidia**. This collaboration has resulted in a powerful 12B model that pushes the boundaries of language understanding and generation. +* **Multilingual proficient**. Mistral Nemo is equipped with a tokenizer called Tekken, which is designed for multilingual applications. It supports over 100 languages, such as English, French, German, and Spanish. Tekken is more efficient than the Llama 3 tokenizer in compressing text for approximately 85% of all languages, with significant improvements in Malayalam, Hindi, Arabic, and prevalent European languages. +* **Agent-centric**. Mistral Nemo possesses top-tier agentic capabilities, including native function calling and JSON outputting. +* **Advanced in reasoning**. Mistral Nemo demonstrates state-of-the-art mathematical and reasoning capabilities within its size category. +++You can learn more about the models in their respective model card: ++* [Mistral-Nemo](https://aka.ms/azureai/landing/Mistral-Nemo) +++> [!TIP] +> Additionally, MistralAI supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [MistralAI documentation](https://docs.mistral.ai/) or see the [inference examples](#more-inference-examples) section to code examples. ++## Prerequisites ++To use Mistral Nemo chat model with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Mistral Nemo chat model can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `azure-ai-inference` package with Python. To install this package, you need the following prerequisites: ++* Python 3.8 or later installed, including pip. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. + +Once you have these prerequisites, install the Azure AI inference package with the following command: ++```bash +pip install azure-ai-inference +``` ++Read more about the [Azure AI inference package and reference](https://aka.ms/azsdk/azure-ai-inference/python/reference). ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Mistral Nemo chat model. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```python +import os +from azure.ai.inference import ChatCompletionsClient +from azure.core.credentials import AzureKeyCredential ++client = ChatCompletionsClient( + endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"], + credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"]), +) +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```python +model_info = client.get_model_info() +``` ++The response is as follows: +++```python +print("Model name:", model_info.model_name) +print("Model type:", model_info.model_type) +print("Model provider name:", model_info.model_provider) +``` ++```console +Model name: Mistral-Nemo +Model type: chat-completions +Model provider name: MistralAI +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```python +from azure.ai.inference.models import SystemMessage, UserMessage ++response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], +) +``` ++The response is as follows, where you can see the model's usage statistics: +++```python +print("Response:", response.choices[0].message.content) +print("Model:", response.model) +print("Usage:") +print("\tPrompt tokens:", response.usage.prompt_tokens) +print("\tTotal tokens:", response.usage.total_tokens) +print("\tCompletion tokens:", response.usage.completion_tokens) +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: Mistral-Nemo +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```python +result = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + temperature=0, + top_p=1, + max_tokens=2048, + stream=True, +) +``` ++To stream completions, set `stream=True` when you call the model. ++To visualize the output, define a helper function to print the stream. ++```python +def print_stream(result): + """ + Prints the chat completion with streaming. Some delay is added to simulate + a real-time conversation. + """ + import time + for update in result: + if update.choices: + print(update.choices[0].delta.content, end="") + time.sleep(0.05) +``` ++You can visualize how streaming generates content: +++```python +print_stream(result) +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```python +from azure.ai.inference.models import ChatCompletionsResponseFormat ++response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + presence_penalty=0.1, + frequency_penalty=0.8, + max_tokens=2048, + stop=["<|endoftext|>"], + temperature=0, + top_p=1, + response_format={ "type": ChatCompletionsResponseFormat.TEXT }, +) +``` ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++#### Create JSON outputs ++Mistral Nemo chat model can create JSON outputs. Set `response_format` to `json_object` to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if `finish_reason="length"`, which indicates that the generation exceeded `max_tokens` or that the conversation exceeded the max context length. +++```python +response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant that always generate responses in JSON format, using." + " the following format: { ""answer"": ""response"" }."), + UserMessage(content="How many languages are in the world?"), + ], + response_format={ "type": ChatCompletionsResponseFormat.JSON_OBJECT } +) +``` ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```python +response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + model_extras={ + "logprobs": True + } +) +``` ++The following extra parameters can be passed to Mistral Nemo chat model: ++| Name | Description | Type | +| -- | | | +| `ignore_eos` | Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. | `boolean` | +| `safe_mode` | Whether to inject a safety prompt before all conversations. | `boolean` | +++### Safe mode ++Mistral Nemo chat model support the parameter `safe_prompt`. You can toggle the safe prompt to prepend your messages with the following system prompt: ++> Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity. ++The Azure AI Model Inference API allows you to pass this extra parameter as follows: +++```python +response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + model_extras={ + "safe_mode": True + } +) +``` ++### Use tools ++Mistral Nemo chat model support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way. ++The following code example creates a tool definition that is able to look from flight information from two different cities. +++```python +from azure.ai.inference.models import FunctionDefinition, ChatCompletionsFunctionToolDefinition ++flight_info = ChatCompletionsFunctionToolDefinition( + function=FunctionDefinition( + name="get_flight_info", + description="Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight", + parameters={ + "type": "object", + "properties": { + "origin_city": { + "type": "string", + "description": "The name of the city where the flight originates", + }, + "destination_city": { + "type": "string", + "description": "The flight destination city", + }, + }, + "required": ["origin_city", "destination_city"], + }, + ) +) ++tools = [flight_info] +``` ++In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train. +++```python +def get_flight_info(loc_origin: str, loc_destination: str): + return { + "info": f"There are no flights available from {loc_origin} to {loc_destination}. You should take a train, specially if it helps to reduce CO2 emissions." + } +``` ++Prompt the model to book flights with the help of this function: +++```python +messages = [ + SystemMessage( + content="You are a helpful assistant that help users to find information about traveling, how to get" + " to places and the different transportations options. You care about the environment and you" + " always have that in mind when answering inqueries.", + ), + UserMessage( + content="When is the next flight from Miami to Seattle?", + ), +] ++response = client.complete( + messages=messages, tools=tools, tool_choice="auto" +) +``` ++You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type `function`. +++```python +response_message = response.choices[0].message +tool_calls = response_message.tool_calls ++print("Finish reason:", response.choices[0].finish_reason) +print("Tool call:", tool_calls) +``` ++To continue, append this message to the chat history: +++```python +messages.append( + response_message +) +``` ++Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history. +++```python +import json +from azure.ai.inference.models import ToolMessage ++for tool_call in tool_calls: ++ # Get the tool details: ++ function_name = tool_call.function.name + function_args = json.loads(tool_call.function.arguments.replace("\'", "\"")) + tool_call_id = tool_call.id ++ print(f"Calling function `{function_name}` with arguments {function_args}") ++ # Call the function defined above using `locals()`, which returns the list of all functions + # available in the scope as a dictionary. Notice that this is just done as a simple way to get + # the function callable from its string name. Then we can call it with the corresponding + # arguments. ++ callable_func = locals()[function_name] + function_response = callable_func(**function_args) ++ print("->", function_response) ++ # Once we have a response from the function and its arguments, we can append a new message to the chat + # history. Notice how we are telling to the model that this chat message came from a tool: ++ messages.append( + ToolMessage( + tool_call_id=tool_call_id, + content=json.dumps(function_response) + ) + ) +``` ++View the response from the model: +++```python +response = client.complete( + messages=messages, + tools=tools, +) +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```python +from azure.ai.inference.models import AssistantMessage, UserMessage, SystemMessage ++try: + response = client.complete( + messages=[ + SystemMessage(content="You are an AI assistant that helps people find information."), + UserMessage(content="Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills."), + ] + ) ++ print(response.choices[0].message.content) ++except HttpResponseError as ex: + if ex.status_code == 400: + response = ex.response.json() + if isinstance(response, dict) and "error" in response: + print(f"Your request triggered an {response['error']['code']} error:\n\t {response['error']['message']}") + else: + raise + raise +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++++## Mistral Nemo chat model ++Mistral Nemo is a cutting-edge Language Model (LLM) boasting state-of-the-art reasoning, world knowledge, and coding capabilities within its size category. ++Mistral Nemo is a 12B model, making it a powerful drop-in replacement for any system using Mistral 7B, which it supersedes. It supports a context length of 128K, and it accepts only text inputs and generates text outputs. ++Additionally, Mistral Nemo is: ++* **Jointly developed with Nvidia**. This collaboration has resulted in a powerful 12B model that pushes the boundaries of language understanding and generation. +* **Multilingual proficient**. Mistral Nemo is equipped with a tokenizer called Tekken, which is designed for multilingual applications. It supports over 100 languages, such as English, French, German, and Spanish. Tekken is more efficient than the Llama 3 tokenizer in compressing text for approximately 85% of all languages, with significant improvements in Malayalam, Hindi, Arabic, and prevalent European languages. +* **Agent-centric**. Mistral Nemo possesses top-tier agentic capabilities, including native function calling and JSON outputting. +* **Advanced in reasoning**. Mistral Nemo demonstrates state-of-the-art mathematical and reasoning capabilities within its size category. +++You can learn more about the models in their respective model card: ++* [Mistral-Nemo](https://aka.ms/azureai/landing/Mistral-Nemo) +++> [!TIP] +> Additionally, MistralAI supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [MistralAI documentation](https://docs.mistral.ai/) or see the [inference examples](#more-inference-examples) section to code examples. ++## Prerequisites ++To use Mistral Nemo chat model with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Mistral Nemo chat model can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `@azure-rest/ai-inference` package from `npm`. To install this package, you need the following prerequisites: ++* LTS versions of `Node.js` with `npm`. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++Once you have these prerequisites, install the Azure Inference library for JavaScript with the following command: ++```bash +npm install @azure-rest/ai-inference +``` ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Mistral Nemo chat model. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```javascript +import ModelClient from "@azure-rest/ai-inference"; +import { isUnexpected } from "@azure-rest/ai-inference"; +import { AzureKeyCredential } from "@azure/core-auth"; ++const client = new ModelClient( + process.env.AZURE_INFERENCE_ENDPOINT, + new AzureKeyCredential(process.env.AZURE_INFERENCE_CREDENTIAL) +); +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```javascript +var model_info = await client.path("/info").get() +``` ++The response is as follows: +++```javascript +console.log("Model name: ", model_info.body.model_name) +console.log("Model type: ", model_info.body.model_type) +console.log("Model provider name: ", model_info.body.model_provider_name) +``` ++```console +Model name: Mistral-Nemo +Model type: chat-completions +Model provider name: MistralAI +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } +}); +``` ++The response is as follows, where you can see the model's usage statistics: +++```javascript +if (isUnexpected(response)) { + throw response.body.error; +} ++console.log("Response: ", response.body.choices[0].message.content); +console.log("Model: ", response.body.model); +console.log("Usage:"); +console.log("\tPrompt tokens:", response.body.usage.prompt_tokens); +console.log("\tTotal tokens:", response.body.usage.total_tokens); +console.log("\tCompletion tokens:", response.body.usage.completion_tokens); +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: Mistral-Nemo +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } +}).asNodeStream(); +``` ++To stream completions, use `.asNodeStream()` when you call the model. ++You can visualize how streaming generates content: +++```javascript +var stream = response.body; +if (!stream) { + stream.destroy(); + throw new Error(`Failed to get chat completions with status: ${response.status}`); +} ++if (response.status !== "200") { + throw new Error(`Failed to get chat completions: ${response.body.error}`); +} ++var sses = createSseStream(stream); ++for await (const event of sses) { + if (event.data === "[DONE]") { + return; + } + for (const choice of (JSON.parse(event.data)).choices) { + console.log(choice.delta?.content ?? ""); + } +} +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + presence_penalty: "0.1", + frequency_penalty: "0.8", + max_tokens: 2048, + stop: ["<|endoftext|>"], + temperature: 0, + top_p: 1, + response_format: { type: "text" }, + } +}); +``` ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++#### Create JSON outputs ++Mistral Nemo chat model can create JSON outputs. Set `response_format` to `json_object` to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if `finish_reason="length"`, which indicates that the generation exceeded `max_tokens` or that the conversation exceeded the max context length. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant that always generate responses in JSON format, using." + + " the following format: { \"answer\": \"response\" }." }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + response_format: { type: "json_object" } + } +}); +``` ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + headers: { + "extra-params": "pass-through" + }, + body: { + messages: messages, + logprobs: true + } +}); +``` ++The following extra parameters can be passed to Mistral Nemo chat model: ++| Name | Description | Type | +| -- | | | +| `ignore_eos` | Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. | `boolean` | +| `safe_mode` | Whether to inject a safety prompt before all conversations. | `boolean` | +++### Safe mode ++Mistral Nemo chat model support the parameter `safe_prompt`. You can toggle the safe prompt to prepend your messages with the following system prompt: ++> Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity. ++The Azure AI Model Inference API allows you to pass this extra parameter as follows: +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + headers: { + "extra-params": "pass-through" + }, + body: { + messages: messages, + safe_mode: true + } +}); +``` ++### Use tools ++Mistral Nemo chat model support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way. ++The following code example creates a tool definition that is able to look from flight information from two different cities. +++```javascript +const flight_info = { + name: "get_flight_info", + description: "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight", + parameters: { + type: "object", + properties: { + origin_city: { + type: "string", + description: "The name of the city where the flight originates", + }, + destination_city: { + type: "string", + description: "The flight destination city", + }, + }, + required: ["origin_city", "destination_city"], + }, +} ++const tools = [ + { + type: "function", + function: flight_info, + }, +]; +``` ++In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train. +++```javascript +function get_flight_info(loc_origin, loc_destination) { + return { + info: "There are no flights available from " + loc_origin + " to " + loc_destination + ". You should take a train, specially if it helps to reduce CO2 emissions." + } +} +``` ++Prompt the model to book flights with the help of this function: +++```javascript +var result = await client.path("/chat/completions").post({ + body: { + messages: messages, + tools: tools, + tool_choice: "auto" + } +}); +``` ++You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type `function`. +++```javascript +const response_message = response.body.choices[0].message; +const tool_calls = response_message.tool_calls; ++console.log("Finish reason: " + response.body.choices[0].finish_reason); +console.log("Tool call: " + tool_calls); +``` ++To continue, append this message to the chat history: +++```javascript +messages.push(response_message); +``` ++Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history. +++```javascript +function applyToolCall({ function: call, id }) { + // Get the tool details: + const tool_params = JSON.parse(call.arguments); + console.log("Calling function " + call.name + " with arguments " + tool_params); ++ // Call the function defined above using `window`, which returns the list of all functions + // available in the scope as a dictionary. Notice that this is just done as a simple way to get + // the function callable from its string name. Then we can call it with the corresponding + // arguments. + const function_response = tool_params.map(window[call.name]); + console.log("-> " + function_response); ++ return function_response +} ++for (const tool_call of tool_calls) { + var tool_response = tool_call.apply(applyToolCall); ++ messages.push( + { + role: "tool", + tool_call_id: tool_call.id, + content: tool_response + } + ); +} +``` ++View the response from the model: +++```javascript +var result = await client.path("/chat/completions").post({ + body: { + messages: messages, + tools: tools, + } +}); +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```javascript +try { + var messages = [ + { role: "system", content: "You are an AI assistant that helps people find information." }, + { role: "user", content: "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." }, + ]; ++ var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } + }); ++ console.log(response.body.choices[0].message.content); +} +catch (error) { + if (error.status_code == 400) { + var response = JSON.parse(error.response._content); + if (response.error) { + console.log(`Your request triggered an ${response.error.code} error:\n\t ${response.error.message}`); + } + else + { + throw error; + } + } +} +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++++## Mistral Nemo chat model ++Mistral Nemo is a cutting-edge Language Model (LLM) boasting state-of-the-art reasoning, world knowledge, and coding capabilities within its size category. ++Mistral Nemo is a 12B model, making it a powerful drop-in replacement for any system using Mistral 7B, which it supersedes. It supports a context length of 128K, and it accepts only text inputs and generates text outputs. ++Additionally, Mistral Nemo is: ++* **Jointly developed with Nvidia**. This collaboration has resulted in a powerful 12B model that pushes the boundaries of language understanding and generation. +* **Multilingual proficient**. Mistral Nemo is equipped with a tokenizer called Tekken, which is designed for multilingual applications. It supports over 100 languages, such as English, French, German, and Spanish. Tekken is more efficient than the Llama 3 tokenizer in compressing text for approximately 85% of all languages, with significant improvements in Malayalam, Hindi, Arabic, and prevalent European languages. +* **Agent-centric**. Mistral Nemo possesses top-tier agentic capabilities, including native function calling and JSON outputting. +* **Advanced in reasoning**. Mistral Nemo demonstrates state-of-the-art mathematical and reasoning capabilities within its size category. +++You can learn more about the models in their respective model card: ++* [Mistral-Nemo](https://aka.ms/azureai/landing/Mistral-Nemo) +++> [!TIP] +> Additionally, MistralAI supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [MistralAI documentation](https://docs.mistral.ai/) or see the [inference examples](#more-inference-examples) section to code examples. ++## Prerequisites ++To use Mistral Nemo chat model with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Mistral Nemo chat model can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `Azure.AI.Inference` package from [Nuget](https://www.nuget.org/). To install this package, you need the following prerequisites: ++* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++Once you have these prerequisites, install the Azure AI inference library with the following command: ++```dotnetcli +dotnet add package Azure.AI.Inference --prerelease +``` ++You can also authenticate with Microsoft Entra ID (formerly Azure Active Directory). To use credential providers provided with the Azure SDK, install the `Azure.Identity` package: ++```dotnetcli +dotnet add package Azure.Identity +``` ++Import the following namespaces: +++```csharp +using Azure; +using Azure.Identity; +using Azure.AI.Inference; +``` ++This example also use the following namespaces but you may not always need them: +++```csharp +using System.Text.Json; +using System.Text.Json.Serialization; +using System.Reflection; +``` ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Mistral Nemo chat model. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```csharp +ChatCompletionsClient client = new ChatCompletionsClient( + new Uri(Environment.GetEnvironmentVariable("AZURE_INFERENCE_ENDPOINT")), + new AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_INFERENCE_CREDENTIAL")) +); +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```csharp +Response<ModelInfo> modelInfo = client.GetModelInfo(); +``` ++The response is as follows: +++```csharp +Console.WriteLine($"Model name: {modelInfo.Value.ModelName}"); +Console.WriteLine($"Model type: {modelInfo.Value.ModelType}"); +Console.WriteLine($"Model provider name: {modelInfo.Value.ModelProviderName}"); +``` ++```console +Model name: Mistral-Nemo +Model type: chat-completions +Model provider name: MistralAI +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```csharp +ChatCompletionsOptions requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, +}; ++Response<ChatCompletions> response = client.Complete(requestOptions); +``` ++The response is as follows, where you can see the model's usage statistics: +++```csharp +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +Console.WriteLine($"Model: {response.Value.Model}"); +Console.WriteLine("Usage:"); +Console.WriteLine($"\tPrompt tokens: {response.Value.Usage.PromptTokens}"); +Console.WriteLine($"\tTotal tokens: {response.Value.Usage.TotalTokens}"); +Console.WriteLine($"\tCompletion tokens: {response.Value.Usage.CompletionTokens}"); +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: Mistral-Nemo +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```csharp +static async Task StreamMessageAsync(ChatCompletionsClient client) +{ + ChatCompletionsOptions requestOptions = new ChatCompletionsOptions() + { + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world? Write an essay about it.") + }, + MaxTokens=4096 + }; ++ StreamingResponse<StreamingChatCompletionsUpdate> streamResponse = await client.CompleteStreamingAsync(requestOptions); ++ await PrintStream(streamResponse); +} +``` ++To stream completions, use `CompleteStreamingAsync` method when you call the model. Notice that in this example we the call is wrapped in an asynchronous method. ++To visualize the output, define an asynchronous method to print the stream in the console. ++```csharp +static async Task PrintStream(StreamingResponse<StreamingChatCompletionsUpdate> response) +{ + await foreach (StreamingChatCompletionsUpdate chatUpdate in response) + { + if (chatUpdate.Role.HasValue) + { + Console.Write($"{chatUpdate.Role.Value.ToString().ToUpperInvariant()}: "); + } + if (!string.IsNullOrEmpty(chatUpdate.ContentUpdate)) + { + Console.Write(chatUpdate.ContentUpdate); + } + } +} +``` ++You can visualize how streaming generates content: +++```csharp +StreamMessageAsync(client).GetAwaiter().GetResult(); +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + PresencePenalty = 0.1f, + FrequencyPenalty = 0.8f, + MaxTokens = 2048, + StopSequences = { "<|endoftext|>" }, + Temperature = 0, + NucleusSamplingFactor = 1, + ResponseFormat = new ChatCompletionsResponseFormatText() +}; ++response = client.Complete(requestOptions); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++#### Create JSON outputs ++Mistral Nemo chat model can create JSON outputs. Set `response_format` to `json_object` to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if `finish_reason="length"`, which indicates that the generation exceeded `max_tokens` or that the conversation exceeded the max context length. +++```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage( + "You are a helpful assistant that always generate responses in JSON format, " + + "using. the following format: { \"answer\": \"response\" }." + ), + new ChatRequestUserMessage( + "How many languages are in the world?" + ) + }, + ResponseFormat = new ChatCompletionsResponseFormatJSON() +}; ++response = client.Complete(requestOptions); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + AdditionalProperties = { { "logprobs", BinaryData.FromString("true") } }, +}; ++response = client.Complete(requestOptions, extraParams: ExtraParameters.PassThrough); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` ++The following extra parameters can be passed to Mistral Nemo chat model: ++| Name | Description | Type | +| -- | | | +| `ignore_eos` | Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. | `boolean` | +| `safe_mode` | Whether to inject a safety prompt before all conversations. | `boolean` | +++### Safe mode ++Mistral Nemo chat model support the parameter `safe_prompt`. You can toggle the safe prompt to prepend your messages with the following system prompt: ++> Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity. ++The Azure AI Model Inference API allows you to pass this extra parameter as follows: +++```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + AdditionalProperties = { { "safe_mode", BinaryData.FromString("true") } }, +}; ++response = client.Complete(requestOptions, extraParams: ExtraParameters.PassThrough); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` ++### Use tools ++Mistral Nemo chat model support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way. ++The following code example creates a tool definition that is able to look from flight information from two different cities. +++```csharp +FunctionDefinition flightInfoFunction = new FunctionDefinition("getFlightInfo") +{ + Description = "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight", + Parameters = BinaryData.FromObjectAsJson(new + { + Type = "object", + Properties = new + { + origin_city = new + { + Type = "string", + Description = "The name of the city where the flight originates" + }, + destination_city = new + { + Type = "string", + Description = "The flight destination city" + } + } + }, + new JsonSerializerOptions() { PropertyNamingPolicy = JsonNamingPolicy.CamelCase } + ) +}; ++ChatCompletionsFunctionToolDefinition getFlightTool = new ChatCompletionsFunctionToolDefinition(flightInfoFunction); +``` ++In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train. +++```csharp +static string getFlightInfo(string loc_origin, string loc_destination) +{ + return JsonSerializer.Serialize(new + { + info = $"There are no flights available from {loc_origin} to {loc_destination}. You " + + "should take a train, specially if it helps to reduce CO2 emissions." + }); +} +``` ++Prompt the model to book flights with the help of this function: +++```csharp +var chatHistory = new List<ChatRequestMessage>(){ + new ChatRequestSystemMessage( + "You are a helpful assistant that help users to find information about traveling, " + + "how to get to places and the different transportations options. You care about the" + + "environment and you always have that in mind when answering inqueries." + ), + new ChatRequestUserMessage("When is the next flight from Miami to Seattle?") + }; ++requestOptions = new ChatCompletionsOptions(chatHistory); +requestOptions.Tools.Add(getFlightTool); +requestOptions.ToolChoice = ChatCompletionsToolChoice.Auto; ++response = client.Complete(requestOptions); +``` ++You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type `function`. +++```csharp +var responseMenssage = response.Value.Choices[0].Message; +var toolsCall = responseMenssage.ToolCalls; ++Console.WriteLine($"Finish reason: {response.Value.Choices[0].FinishReason}"); +Console.WriteLine($"Tool call: {toolsCall[0].Id}"); +``` ++To continue, append this message to the chat history: +++```csharp +requestOptions.Messages.Add(new ChatRequestAssistantMessage(response.Value.Choices[0].Message)); +``` ++Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history. +++```csharp +foreach (ChatCompletionsToolCall tool in toolsCall) +{ + if (tool is ChatCompletionsFunctionToolCall functionTool) + { + // Get the tool details: + string callId = functionTool.Id; + string toolName = functionTool.Name; + string toolArgumentsString = functionTool.Arguments; + Dictionary<string, object> toolArguments = JsonSerializer.Deserialize<Dictionary<string, object>>(toolArgumentsString); ++ // Here you have to call the function defined. In this particular example we use + // reflection to find the method we definied before in an static class called + // `ChatCompletionsExamples`. Using reflection allows us to call a function + // by string name. Notice that this is just done for demonstration purposes as a + // simple way to get the function callable from its string name. Then we can call + // it with the corresponding arguments. ++ var flags = BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic | BindingFlags.Static; + string toolResponse = (string)typeof(ChatCompletionsExamples).GetMethod(toolName, flags).Invoke(null, toolArguments.Values.Cast<object>().ToArray()); ++ Console.WriteLine("->", toolResponse); + requestOptions.Messages.Add(new ChatRequestToolMessage(toolResponse, callId)); + } + else + throw new Exception("Unsupported tool type"); +} +``` ++View the response from the model: +++```csharp +response = client.Complete(requestOptions); +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```csharp +try +{ + requestOptions = new ChatCompletionsOptions() + { + Messages = { + new ChatRequestSystemMessage("You are an AI assistant that helps people find information."), + new ChatRequestUserMessage( + "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." + ), + }, + }; ++ response = client.Complete(requestOptions); + Console.WriteLine(response.Value.Choices[0].Message.Content); +} +catch (RequestFailedException ex) +{ + if (ex.ErrorCode == "content_filter") + { + Console.WriteLine($"Your query has trigger Azure Content Safeaty: {ex.Message}"); + } + else + { + throw; + } +} +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++++## Mistral Nemo chat model ++Mistral Nemo is a cutting-edge Language Model (LLM) boasting state-of-the-art reasoning, world knowledge, and coding capabilities within its size category. ++Mistral Nemo is a 12B model, making it a powerful drop-in replacement for any system using Mistral 7B, which it supersedes. It supports a context length of 128K, and it accepts only text inputs and generates text outputs. ++Additionally, Mistral Nemo is: ++* **Jointly developed with Nvidia**. This collaboration has resulted in a powerful 12B model that pushes the boundaries of language understanding and generation. +* **Multilingual proficient**. Mistral Nemo is equipped with a tokenizer called Tekken, which is designed for multilingual applications. It supports over 100 languages, such as English, French, German, and Spanish. Tekken is more efficient than the Llama 3 tokenizer in compressing text for approximately 85% of all languages, with significant improvements in Malayalam, Hindi, Arabic, and prevalent European languages. +* **Agent-centric**. Mistral Nemo possesses top-tier agentic capabilities, including native function calling and JSON outputting. +* **Advanced in reasoning**. Mistral Nemo demonstrates state-of-the-art mathematical and reasoning capabilities within its size category. +++You can learn more about the models in their respective model card: ++* [Mistral-Nemo](https://aka.ms/azureai/landing/Mistral-Nemo) +++> [!TIP] +> Additionally, MistralAI supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [MistralAI documentation](https://docs.mistral.ai/) or see the [inference examples](#more-inference-examples) section to code examples. ++## Prerequisites ++To use Mistral Nemo chat model with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Mistral Nemo chat model can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### A REST client ++Models deployed with the [Azure AI model inference API](https://aka.ms/azureai/modelinference) can be consumed using any REST client. To use the REST client, you need the following prerequisites: ++* To construct the requests, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name`` is your unique model deployment host name and `your-azure-region`` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Mistral Nemo chat model. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: ++```http +GET /info HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json +``` ++The response is as follows: +++```json +{ + "model_name": "Mistral-Nemo", + "model_type": "chat-completions", + "model_provider_name": "MistralAI" +} +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ] +} +``` ++The response is as follows, where you can see the model's usage statistics: +++```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718726686, + "model": "Mistral-Nemo", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.", + "tool_calls": null + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "stream": true, + "temperature": 0, + "top_p": 1, + "max_tokens": 2048 +} +``` ++You can visualize how streaming generates content: +++```json +{ + "id": "23b54589eba14564ad8a2e6978775a39", + "object": "chat.completion.chunk", + "created": 1718726371, + "model": "Mistral-Nemo", + "choices": [ + { + "index": 0, + "delta": { + "role": "assistant", + "content": "" + }, + "finish_reason": null, + "logprobs": null + } + ] +} +``` ++The last message in the stream has `finish_reason` set, indicating the reason for the generation process to stop. +++```json +{ + "id": "23b54589eba14564ad8a2e6978775a39", + "object": "chat.completion.chunk", + "created": 1718726371, + "model": "Mistral-Nemo", + "choices": [ + { + "index": 0, + "delta": { + "content": "" + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "presence_penalty": 0.1, + "frequency_penalty": 0.8, + "max_tokens": 2048, + "stop": ["<|endoftext|>"], + "temperature" :0, + "top_p": 1, + "response_format": { "type": "text" } +} +``` +++```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718726686, + "model": "Mistral-Nemo", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.", + "tool_calls": null + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++#### Create JSON outputs ++Mistral Nemo chat model can create JSON outputs. Set `response_format` to `json_object` to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if `finish_reason="length"`, which indicates that the generation exceeded `max_tokens` or that the conversation exceeded the max context length. +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant that always generate responses in JSON format, using the following format: { \"answer\": \"response\" }" + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "response_format": { "type": "json_object" } +} +``` +++```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718727522, + "model": "Mistral-Nemo", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "{\"answer\": \"There are approximately 7,117 living languages in the world today, according to the latest estimates. However, this number can vary as some languages become extinct and others are newly discovered or classified.\"}", + "tool_calls": null + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 39, + "total_tokens": 87, + "completion_tokens": 48 + } +} +``` ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. ++```http +POST /chat/completions HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json +extra-parameters: pass-through +``` +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "logprobs": true +} +``` ++The following extra parameters can be passed to Mistral Nemo chat model: ++| Name | Description | Type | +| -- | | | +| `ignore_eos` | Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. | `boolean` | +| `safe_mode` | Whether to inject a safety prompt before all conversations. | `boolean` | +++### Safe mode ++Mistral Nemo chat model support the parameter `safe_prompt`. You can toggle the safe prompt to prepend your messages with the following system prompt: ++> Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity. ++The Azure AI Model Inference API allows you to pass this extra parameter as follows: ++```http +POST /chat/completions HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json +extra-parameters: pass-through +``` +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "safemode": true +} +``` ++### Use tools ++Mistral Nemo chat model support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way. ++The following code example creates a tool definition that is able to look from flight information from two different cities. +++```json +{ + "type": "function", + "function": { + "name": "get_flight_info", + "description": "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight", + "parameters": { + "type": "object", + "properties": { + "origin_city": { + "type": "string", + "description": "The name of the city where the flight originates" + }, + "destination_city": { + "type": "string", + "description": "The flight destination city" + } + }, + "required": [ + "origin_city", + "destination_city" + ] + } + } +} +``` ++In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train. ++Prompt the model to book flights with the help of this function: +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant that help users to find information about traveling, how to get to places and the different transportations options. You care about the environment and you always have that in mind when answering inqueries" + }, + { + "role": "user", + "content": "When is the next flight from Miami to Seattle?" + } + ], + "tool_choice": "auto", + "tools": [ + { + "type": "function", + "function": { + "name": "get_flight_info", + "description": "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight", + "parameters": { + "type": "object", + "properties": { + "origin_city": { + "type": "string", + "description": "The name of the city where the flight originates" + }, + "destination_city": { + "type": "string", + "description": "The flight destination city" + } + }, + "required": [ + "origin_city", + "destination_city" + ] + } + } + } + ] +} +``` ++You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type `function`. +++```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718726007, + "model": "Mistral-Nemo", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "", + "tool_calls": [ + { + "id": "abc0dF1gh", + "type": "function", + "function": { + "name": "get_flight_info", + "arguments": "{\"origin_city\": \"Miami\", \"destination_city\": \"Seattle\"}", + "call_id": null + } + } + ] + }, + "finish_reason": "tool_calls", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 190, + "total_tokens": 226, + "completion_tokens": 36 + } +} +``` ++To continue, append this message to the chat history: ++Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history. ++View the response from the model: +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant that help users to find information about traveling, how to get to places and the different transportations options. You care about the environment and you always have that in mind when answering inqueries" + }, + { + "role": "user", + "content": "When is the next flight from Miami to Seattle?" + }, + { + "role": "assistant", + "content": "", + "tool_calls": [ + { + "id": "abc0DeFgH", + "type": "function", + "function": { + "name": "get_flight_info", + "arguments": "{\"origin_city\": \"Miami\", \"destination_city\": \"Seattle\"}", + "call_id": null + } + } + ] + }, + { + "role": "tool", + "content": "{ \"info\": \"There are no flights available from Miami to Seattle. You should take a train, specially if it helps to reduce CO2 emissions.\" }", + "tool_call_id": "abc0DeFgH" + } + ], + "tool_choice": "auto", + "tools": [ + { + "type": "function", + "function": { + "name": "get_flight_info", + "description": "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight", + "parameters":{ + "type": "object", + "properties": { + "origin_city": { + "type": "string", + "description": "The name of the city where the flight originates" + }, + "destination_city": { + "type": "string", + "description": "The flight destination city" + } + }, + "required": ["origin_city", "destination_city"] + } + } + } + ] +} +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are an AI assistant that helps people find information." + }, + { + "role": "user", + "content": "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." + } + ] +} +``` +++```json +{ + "error": { + "message": "The response was filtered due to the prompt triggering Microsoft's content management policy. Please modify your prompt and retry.", + "type": null, + "param": "prompt", + "code": "content_filter", + "status": 400 + } +} +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++## More inference examples ++For more examples of how to use Mistral, see the following examples and tutorials: ++| Description | Language | Sample | +|-|-|--| +| CURL request | Bash | [Link](https://aka.ms/mistral-large/webrequests-sample) | +| Azure AI Inference package for JavaScript | JavaScript | [Link](https://aka.ms/azsdk/azure-ai-inference/javascript/samples) | +| Azure AI Inference package for Python | Python | [Link](https://aka.ms/azsdk/azure-ai-inference/python/samples) | +| Python web requests | Python | [Link](https://aka.ms/mistral-large/webrequests-sample) | +| OpenAI SDK (experimental) | Python | [Link](https://aka.ms/mistral-large/openaisdk) | +| LangChain | Python | [Link](https://aka.ms/mistral-large/langchain-sample) | +| Mistral AI | Python | [Link](https://aka.ms/mistral-large/mistralai-sample) | +| LiteLLM | Python | [Link](https://aka.ms/mistral-large/litellm-sample) | +++## Cost and quota considerations for Mistral family of models deployed as serverless API endpoints ++Quota is managed per deployment. Each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute. However, we currently limit one deployment per model per project. Contact Microsoft Azure Support if the current rate limits aren't sufficient for your scenarios. ++Mistral models deployed as a serverless API are offered by MistralAI through the Azure Marketplace and integrated with Azure AI Studio for use. You can find the Azure Marketplace pricing when deploying the model. ++Each time a project subscribes to a given offer from the Azure Marketplace, a new resource is created to track the costs associated with its consumption. The same resource is used to track costs associated with inference; however, multiple meters are available to track each scenario independently. ++For more information on how to track costs, see [Monitor costs for models offered through the Azure Marketplace](costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace). ++## Related content +++* [Azure AI Model Inference API](../reference/reference-model-inference-api.md) +* [Deploy models as serverless APIs](deploy-models-serverless.md) +* [Consume serverless API endpoints from a different Azure AI Studio project or hub](deploy-models-serverless-connect.md) +* [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md) +* [Plan and manage costs (marketplace)](costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace) |
ai-studio | Deploy Models Mistral Open | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-studio/how-to/deploy-models-mistral-open.md | + + Title: How to use Mistral-7B and Mixtral chat models with Azure AI Studio ++description: Learn how to use Mistral-7B and Mixtral chat models with Azure AI Studio. +++ Last updated : 08/08/2024++reviewer: fkriti ++++zone_pivot_groups: azure-ai-model-catalog-samples-chat +++# How to use Mistral-7B and Mixtral chat models ++In this article, you learn about Mistral-7B and Mixtral chat models and how to use them. +Mistral AI offers two categories of models. Premium models including [Mistral Large and Mistral Small](deploy-models-mistral.md), available as serverless APIs with pay-as-you-go token-based billing. Open models including [Mistral Nemo](deploy-models-mistral-nemo.md), [Mixtral-8x7B-Instruct-v01, Mixtral-8x7B-v01, Mistral-7B-Instruct-v01, and Mistral-7B-v01](deploy-models-mistral-open.md); available to also download and run on self-hosted managed endpoints. ++++## Mistral-7B and Mixtral chat models ++The Mistral-7B and Mixtral chat models include the following models: ++# [Mistral-7B-Instruct](#tab/mistral-7b-instruct) ++The Mistral-7B-Instruct Large Language Model (LLM) is an instruct, fine-tuned version of the Mistral-7B, a transformer model with the following architecture choices: ++* Grouped-Query Attention +* Sliding-Window Attention +* Byte-fallback BPE tokenizer +++The following models are available: ++* [mistralai-Mistral-7B-Instruct-v01](https://aka.ms/azureai/landing/mistralai-Mistral-7B-Instruct-v01) +* [mistralai-Mistral-7B-Instruct-v02](https://aka.ms/azureai/landing/mistralai-Mistral-7B-Instruct-v02) +++# [Mixtral-8x7B-Instruct](#tab/mistral-8x7B-instruct) ++The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks with 6x faster inference. ++Mixtral-8x7B-v0.1 is a decoder-only model with eight distinct groups or the "experts". At every layer, for every token, a router network chooses two of these experts to process the token and combine their output additively. Mixtral has 46.7B total parameters but only uses 12.9B parameters per token with this technique; therefore, the model can perform with the same speed and cost as a 12.9B model. +++The following models are available: ++* [mistralai-Mixtral-8x7B-Instruct-v01](https://aka.ms/azureai/landing/mistralai-Mixtral-8x7B-Instruct-v01) +++# [Mixtral-8x22B-Instruct](#tab/mistral-8x22b-instruct) ++The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct, fine-tuned version of the Mixtral-8x22B-v0.1. The model is a sparse Mixture-of-Experts (SMoE) model that uses only 39B active parameters out of 141B, offering unparalleled cost efficiency for its size. ++Mixtral 8x22B comes with the following strengths: ++* Fluent in English, French, Italian, German, and Spanish +* Strong mathematics and coding capabilities +* Natively capable of function calling; along with the constrained output mode implemented on la Plateforme, this enables application development and tech stack modernization at scale +* Pprecise information recall from large documents, due to its 64K tokens context window +++The following models are available: ++* [mistralai-Mixtral-8x22B-Instruct-v0-1](https://aka.ms/azureai/landing/mistralai-Mixtral-8x22B-Instruct-v0-1) +++++> [!TIP] +> Additionally, MistralAI supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [MistralAI documentation](https://docs.mistral.ai/) or see the [inference examples](#more-inference-examples) section to code examples. ++## Prerequisites ++To use Mistral-7B and Mixtral chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to a self-hosted managed compute** ++Mistral-7B and Mixtral chat models can be deployed to our self-hosted managed inference solution, which allows you to customize and control all the details about how the model is served. ++For deployment to a self-hosted managed compute, you must have enough quota in your subscription. If you don't have enough quota available, you can use our temporary quota access by selecting the option **I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours.** ++> [!div class="nextstepaction"] +> [Deploy the model to managed compute](../concepts/deployments-overview.md) ++### The inference package installed ++You can consume predictions from this model by using the `azure-ai-inference` package with Python. To install this package, you need the following prerequisites: ++* Python 3.8 or later installed, including pip. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. + +Once you have these prerequisites, install the Azure AI inference package with the following command: ++```bash +pip install azure-ai-inference +``` ++Read more about the [Azure AI inference package and reference](https://aka.ms/azsdk/azure-ai-inference/python/reference). ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Mistral-7B and Mixtral chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```python +import os +from azure.ai.inference import ChatCompletionsClient +from azure.core.credentials import AzureKeyCredential ++client = ChatCompletionsClient( + endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"], + credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"]), +) +``` ++When you deploy the model to a self-hosted online endpoint with **Microsoft Entra ID** support, you can use the following code snippet to create a client. +++```python +import os +from azure.ai.inference import ChatCompletionsClient +from azure.identity import DefaultAzureCredential ++client = ChatCompletionsClient( + endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"], + credential=DefaultAzureCredential(), +) +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```python +model_info = client.get_model_info() +``` ++The response is as follows: +++```python +print("Model name:", model_info.model_name) +print("Model type:", model_info.model_type) +print("Model provider name:", model_info.model_provider) +``` ++```console +Model name: mistralai-Mistral-7B-Instruct-v01 +Model type: chat-completions +Model provider name: MistralAI +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```python +from azure.ai.inference.models import SystemMessage, UserMessage ++response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], +) +``` ++> [!NOTE] +> mistralai-Mistral-7B-Instruct-v01, mistralai-Mistral-7B-Instruct-v02 and mistralai-Mixtral-8x22B-Instruct-v0-1 don't support system messages (`role="system"`). When you use the Azure AI model inference API, system messages are translated to user messages, which is the closest capability available. This translation is offered for convenience, but it's important for you to verify that the model is following the instructions in the system message with the right level of confidence. ++The response is as follows, where you can see the model's usage statistics: +++```python +print("Response:", response.choices[0].message.content) +print("Model:", response.model) +print("Usage:") +print("\tPrompt tokens:", response.usage.prompt_tokens) +print("\tTotal tokens:", response.usage.total_tokens) +print("\tCompletion tokens:", response.usage.completion_tokens) +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: mistralai-Mistral-7B-Instruct-v01 +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```python +result = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + temperature=0, + top_p=1, + max_tokens=2048, + stream=True, +) +``` ++To stream completions, set `stream=True` when you call the model. ++To visualize the output, define a helper function to print the stream. ++```python +def print_stream(result): + """ + Prints the chat completion with streaming. Some delay is added to simulate + a real-time conversation. + """ + import time + for update in result: + if update.choices: + print(update.choices[0].delta.content, end="") + time.sleep(0.05) +``` ++You can visualize how streaming generates content: +++```python +print_stream(result) +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```python +from azure.ai.inference.models import ChatCompletionsResponseFormat ++response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + presence_penalty=0.1, + frequency_penalty=0.8, + max_tokens=2048, + stop=["<|endoftext|>"], + temperature=0, + top_p=1, + response_format={ "type": ChatCompletionsResponseFormat.TEXT }, +) +``` ++> [!WARNING] +> Mistral doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```python +response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + model_extras={ + "logprobs": True + } +) +``` ++The following extra parameters can be passed to Mistral-7B and Mixtral chat models: ++| Name | Description | Type | +| -- | | | +| `logit_bias` | Accepts a JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token. | `float` | +| `logprobs` | Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the `content` of `message`. | `int` | +| `top_logprobs` | An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability. `logprobs` must be set to `true` if this parameter is used. | `float` | +| `n` | How many chat completion choices to generate for each input message. Note that you will be charged based on the number of generated tokens across all of the choices. | `int` | ++++++## Mistral-7B and Mixtral chat models ++The Mistral-7B and Mixtral chat models include the following models: ++# [Mistral-7B-Instruct](#tab/mistral-7b-instruct) ++The Mistral-7B-Instruct Large Language Model (LLM) is an instruct, fine-tuned version of the Mistral-7B, a transformer model with the following architecture choices: ++* Grouped-Query Attention +* Sliding-Window Attention +* Byte-fallback BPE tokenizer +++The following models are available: ++* [mistralai-Mistral-7B-Instruct-v01](https://aka.ms/azureai/landing/mistralai-Mistral-7B-Instruct-v01) +* [mistralai-Mistral-7B-Instruct-v02](https://aka.ms/azureai/landing/mistralai-Mistral-7B-Instruct-v02) +++# [Mixtral-8x7B-Instruct](#tab/mistral-8x7B-instruct) ++The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks with 6x faster inference. ++Mixtral-8x7B-v0.1 is a decoder-only model with eight distinct groups or the "experts". At every layer, for every token, a router network chooses two of these experts to process the token and combine their output additively. Mixtral has 46.7B total parameters but only uses 12.9B parameters per token with this technique; therefore, the model can perform with the same speed and cost as a 12.9B model. +++The following models are available: ++* [mistralai-Mixtral-8x7B-Instruct-v01](https://aka.ms/azureai/landing/mistralai-Mixtral-8x7B-Instruct-v01) +++# [Mixtral-8x22B-Instruct](#tab/mistral-8x22b-instruct) ++The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct, fine-tuned version of the Mixtral-8x22B-v0.1. The model is a sparse Mixture-of-Experts (SMoE) model that uses only 39B active parameters out of 141B, offering unparalleled cost efficiency for its size. ++Mixtral 8x22B comes with the following strengths: ++* Fluent in English, French, Italian, German, and Spanish +* Strong mathematics and coding capabilities +* Natively capable of function calling; along with the constrained output mode implemented on la Plateforme, this enables application development and tech stack modernization at scale +* Pprecise information recall from large documents, due to its 64K tokens context window +++The following models are available: ++* [mistralai-Mixtral-8x22B-Instruct-v0-1](https://aka.ms/azureai/landing/mistralai-Mixtral-8x22B-Instruct-v0-1) +++++> [!TIP] +> Additionally, MistralAI supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [MistralAI documentation](https://docs.mistral.ai/) or see the [inference examples](#more-inference-examples) section to code examples. ++## Prerequisites ++To use Mistral-7B and Mixtral chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to a self-hosted managed compute** ++Mistral-7B and Mixtral chat models can be deployed to our self-hosted managed inference solution, which allows you to customize and control all the details about how the model is served. ++For deployment to a self-hosted managed compute, you must have enough quota in your subscription. If you don't have enough quota available, you can use our temporary quota access by selecting the option **I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours.** ++> [!div class="nextstepaction"] +> [Deploy the model to managed compute](../concepts/deployments-overview.md) ++### The inference package installed ++You can consume predictions from this model by using the `@azure-rest/ai-inference` package from `npm`. To install this package, you need the following prerequisites: ++* LTS versions of `Node.js` with `npm`. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++Once you have these prerequisites, install the Azure Inference library for JavaScript with the following command: ++```bash +npm install @azure-rest/ai-inference +``` ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Mistral-7B and Mixtral chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```javascript +import ModelClient from "@azure-rest/ai-inference"; +import { isUnexpected } from "@azure-rest/ai-inference"; +import { AzureKeyCredential } from "@azure/core-auth"; ++const client = new ModelClient( + process.env.AZURE_INFERENCE_ENDPOINT, + new AzureKeyCredential(process.env.AZURE_INFERENCE_CREDENTIAL) +); +``` ++When you deploy the model to a self-hosted online endpoint with **Microsoft Entra ID** support, you can use the following code snippet to create a client. +++```javascript +import ModelClient from "@azure-rest/ai-inference"; +import { isUnexpected } from "@azure-rest/ai-inference"; +import { DefaultAzureCredential } from "@azure/identity"; ++const client = new ModelClient( + process.env.AZURE_INFERENCE_ENDPOINT, + new DefaultAzureCredential() +); +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```javascript +var model_info = await client.path("/info").get() +``` ++The response is as follows: +++```javascript +console.log("Model name: ", model_info.body.model_name) +console.log("Model type: ", model_info.body.model_type) +console.log("Model provider name: ", model_info.body.model_provider_name) +``` ++```console +Model name: mistralai-Mistral-7B-Instruct-v01 +Model type: chat-completions +Model provider name: MistralAI +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } +}); +``` ++> [!NOTE] +> mistralai-Mistral-7B-Instruct-v01, mistralai-Mistral-7B-Instruct-v02 and mistralai-Mixtral-8x22B-Instruct-v0-1 don't support system messages (`role="system"`). When you use the Azure AI model inference API, system messages are translated to user messages, which is the closest capability available. This translation is offered for convenience, but it's important for you to verify that the model is following the instructions in the system message with the right level of confidence. ++The response is as follows, where you can see the model's usage statistics: +++```javascript +if (isUnexpected(response)) { + throw response.body.error; +} ++console.log("Response: ", response.body.choices[0].message.content); +console.log("Model: ", response.body.model); +console.log("Usage:"); +console.log("\tPrompt tokens:", response.body.usage.prompt_tokens); +console.log("\tTotal tokens:", response.body.usage.total_tokens); +console.log("\tCompletion tokens:", response.body.usage.completion_tokens); +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: mistralai-Mistral-7B-Instruct-v01 +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } +}).asNodeStream(); +``` ++To stream completions, use `.asNodeStream()` when you call the model. ++You can visualize how streaming generates content: +++```javascript +var stream = response.body; +if (!stream) { + stream.destroy(); + throw new Error(`Failed to get chat completions with status: ${response.status}`); +} ++if (response.status !== "200") { + throw new Error(`Failed to get chat completions: ${response.body.error}`); +} ++var sses = createSseStream(stream); ++for await (const event of sses) { + if (event.data === "[DONE]") { + return; + } + for (const choice of (JSON.parse(event.data)).choices) { + console.log(choice.delta?.content ?? ""); + } +} +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + presence_penalty: "0.1", + frequency_penalty: "0.8", + max_tokens: 2048, + stop: ["<|endoftext|>"], + temperature: 0, + top_p: 1, + response_format: { type: "text" }, + } +}); +``` ++> [!WARNING] +> Mistral doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + headers: { + "extra-params": "pass-through" + }, + body: { + messages: messages, + logprobs: true + } +}); +``` ++The following extra parameters can be passed to Mistral-7B and Mixtral chat models: ++| Name | Description | Type | +| -- | | | +| `logit_bias` | Accepts a JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token. | `float` | +| `logprobs` | Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the `content` of `message`. | `int` | +| `top_logprobs` | An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability. `logprobs` must be set to `true` if this parameter is used. | `float` | +| `n` | How many chat completion choices to generate for each input message. Note that you will be charged based on the number of generated tokens across all of the choices. | `int` | ++++++## Mistral-7B and Mixtral chat models ++The Mistral-7B and Mixtral chat models include the following models: ++# [Mistral-7B-Instruct](#tab/mistral-7b-instruct) ++The Mistral-7B-Instruct Large Language Model (LLM) is an instruct, fine-tuned version of the Mistral-7B, a transformer model with the following architecture choices: ++* Grouped-Query Attention +* Sliding-Window Attention +* Byte-fallback BPE tokenizer +++The following models are available: ++* [mistralai-Mistral-7B-Instruct-v01](https://aka.ms/azureai/landing/mistralai-Mistral-7B-Instruct-v01) +* [mistralai-Mistral-7B-Instruct-v02](https://aka.ms/azureai/landing/mistralai-Mistral-7B-Instruct-v02) +++# [Mixtral-8x7B-Instruct](#tab/mistral-8x7B-instruct) ++The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks with 6x faster inference. ++Mixtral-8x7B-v0.1 is a decoder-only model with eight distinct groups or the "experts". At every layer, for every token, a router network chooses two of these experts to process the token and combine their output additively. Mixtral has 46.7B total parameters but only uses 12.9B parameters per token with this technique; therefore, the model can perform with the same speed and cost as a 12.9B model. +++The following models are available: ++* [mistralai-Mixtral-8x7B-Instruct-v01](https://aka.ms/azureai/landing/mistralai-Mixtral-8x7B-Instruct-v01) +++# [Mixtral-8x22B-Instruct](#tab/mistral-8x22b-instruct) ++The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct, fine-tuned version of the Mixtral-8x22B-v0.1. The model is a sparse Mixture-of-Experts (SMoE) model that uses only 39B active parameters out of 141B, offering unparalleled cost efficiency for its size. ++Mixtral 8x22B comes with the following strengths: ++* Fluent in English, French, Italian, German, and Spanish +* Strong mathematics and coding capabilities +* Natively capable of function calling; along with the constrained output mode implemented on la Plateforme, this enables application development and tech stack modernization at scale +* Pprecise information recall from large documents, due to its 64K tokens context window +++The following models are available: ++* [mistralai-Mixtral-8x22B-Instruct-v0-1](https://aka.ms/azureai/landing/mistralai-Mixtral-8x22B-Instruct-v0-1) +++++> [!TIP] +> Additionally, MistralAI supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [MistralAI documentation](https://docs.mistral.ai/) or see the [inference examples](#more-inference-examples) section to code examples. ++## Prerequisites ++To use Mistral-7B and Mixtral chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to a self-hosted managed compute** ++Mistral-7B and Mixtral chat models can be deployed to our self-hosted managed inference solution, which allows you to customize and control all the details about how the model is served. ++For deployment to a self-hosted managed compute, you must have enough quota in your subscription. If you don't have enough quota available, you can use our temporary quota access by selecting the option **I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours.** ++> [!div class="nextstepaction"] +> [Deploy the model to managed compute](../concepts/deployments-overview.md) ++### The inference package installed ++You can consume predictions from this model by using the `Azure.AI.Inference` package from [Nuget](https://www.nuget.org/). To install this package, you need the following prerequisites: ++* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++Once you have these prerequisites, install the Azure AI inference library with the following command: ++```dotnetcli +dotnet add package Azure.AI.Inference --prerelease +``` ++You can also authenticate with Microsoft Entra ID (formerly Azure Active Directory). To use credential providers provided with the Azure SDK, install the `Azure.Identity` package: ++```dotnetcli +dotnet add package Azure.Identity +``` ++Import the following namespaces: +++```csharp +using Azure; +using Azure.Identity; +using Azure.AI.Inference; +``` ++This example also use the following namespaces but you may not always need them: +++```csharp +using System.Text.Json; +using System.Text.Json.Serialization; +using System.Reflection; +``` ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Mistral-7B and Mixtral chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```csharp +ChatCompletionsClient client = new ChatCompletionsClient( + new Uri(Environment.GetEnvironmentVariable("AZURE_INFERENCE_ENDPOINT")), + new AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_INFERENCE_CREDENTIAL")) +); +``` ++When you deploy the model to a self-hosted online endpoint with **Microsoft Entra ID** support, you can use the following code snippet to create a client. +++```csharp +client = new ChatCompletionsClient( + new Uri(Environment.GetEnvironmentVariable("AZURE_INFERENCE_ENDPOINT")), + new DefaultAzureCredential(includeInteractiveCredentials: true) +); +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```csharp +Response<ModelInfo> modelInfo = client.GetModelInfo(); +``` ++The response is as follows: +++```csharp +Console.WriteLine($"Model name: {modelInfo.Value.ModelName}"); +Console.WriteLine($"Model type: {modelInfo.Value.ModelType}"); +Console.WriteLine($"Model provider name: {modelInfo.Value.ModelProviderName}"); +``` ++```console +Model name: mistralai-Mistral-7B-Instruct-v01 +Model type: chat-completions +Model provider name: MistralAI +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```csharp +ChatCompletionsOptions requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, +}; ++Response<ChatCompletions> response = client.Complete(requestOptions); +``` ++> [!NOTE] +> mistralai-Mistral-7B-Instruct-v01, mistralai-Mistral-7B-Instruct-v02 and mistralai-Mixtral-8x22B-Instruct-v0-1 don't support system messages (`role="system"`). When you use the Azure AI model inference API, system messages are translated to user messages, which is the closest capability available. This translation is offered for convenience, but it's important for you to verify that the model is following the instructions in the system message with the right level of confidence. ++The response is as follows, where you can see the model's usage statistics: +++```csharp +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +Console.WriteLine($"Model: {response.Value.Model}"); +Console.WriteLine("Usage:"); +Console.WriteLine($"\tPrompt tokens: {response.Value.Usage.PromptTokens}"); +Console.WriteLine($"\tTotal tokens: {response.Value.Usage.TotalTokens}"); +Console.WriteLine($"\tCompletion tokens: {response.Value.Usage.CompletionTokens}"); +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: mistralai-Mistral-7B-Instruct-v01 +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```csharp +static async Task StreamMessageAsync(ChatCompletionsClient client) +{ + ChatCompletionsOptions requestOptions = new ChatCompletionsOptions() + { + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world? Write an essay about it.") + }, + MaxTokens=4096 + }; ++ StreamingResponse<StreamingChatCompletionsUpdate> streamResponse = await client.CompleteStreamingAsync(requestOptions); ++ await PrintStream(streamResponse); +} +``` ++To stream completions, use `CompleteStreamingAsync` method when you call the model. Notice that in this example we the call is wrapped in an asynchronous method. ++To visualize the output, define an asynchronous method to print the stream in the console. ++```csharp +static async Task PrintStream(StreamingResponse<StreamingChatCompletionsUpdate> response) +{ + await foreach (StreamingChatCompletionsUpdate chatUpdate in response) + { + if (chatUpdate.Role.HasValue) + { + Console.Write($"{chatUpdate.Role.Value.ToString().ToUpperInvariant()}: "); + } + if (!string.IsNullOrEmpty(chatUpdate.ContentUpdate)) + { + Console.Write(chatUpdate.ContentUpdate); + } + } +} +``` ++You can visualize how streaming generates content: +++```csharp +StreamMessageAsync(client).GetAwaiter().GetResult(); +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + PresencePenalty = 0.1f, + FrequencyPenalty = 0.8f, + MaxTokens = 2048, + StopSequences = { "<|endoftext|>" }, + Temperature = 0, + NucleusSamplingFactor = 1, + ResponseFormat = new ChatCompletionsResponseFormatText() +}; ++response = client.Complete(requestOptions); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` ++> [!WARNING] +> Mistral doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + AdditionalProperties = { { "logprobs", BinaryData.FromString("true") } }, +}; ++response = client.Complete(requestOptions, extraParams: ExtraParameters.PassThrough); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` ++The following extra parameters can be passed to Mistral-7B and Mixtral chat models: ++| Name | Description | Type | +| -- | | | +| `logit_bias` | Accepts a JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token. | `float` | +| `logprobs` | Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the `content` of `message`. | `int` | +| `top_logprobs` | An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability. `logprobs` must be set to `true` if this parameter is used. | `float` | +| `n` | How many chat completion choices to generate for each input message. Note that you will be charged based on the number of generated tokens across all of the choices. | `int` | ++++++## Mistral-7B and Mixtral chat models ++The Mistral-7B and Mixtral chat models include the following models: ++# [Mistral-7B-Instruct](#tab/mistral-7b-instruct) ++The Mistral-7B-Instruct Large Language Model (LLM) is an instruct, fine-tuned version of the Mistral-7B, a transformer model with the following architecture choices: ++* Grouped-Query Attention +* Sliding-Window Attention +* Byte-fallback BPE tokenizer +++The following models are available: ++* [mistralai-Mistral-7B-Instruct-v01](https://aka.ms/azureai/landing/mistralai-Mistral-7B-Instruct-v01) +* [mistralai-Mistral-7B-Instruct-v02](https://aka.ms/azureai/landing/mistralai-Mistral-7B-Instruct-v02) +++# [Mixtral-8x7B-Instruct](#tab/mistral-8x7B-instruct) ++The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks with 6x faster inference. ++Mixtral-8x7B-v0.1 is a decoder-only model with eight distinct groups or the "experts". At every layer, for every token, a router network chooses two of these experts to process the token and combine their output additively. Mixtral has 46.7B total parameters but only uses 12.9B parameters per token with this technique; therefore, the model can perform with the same speed and cost as a 12.9B model. +++The following models are available: ++* [mistralai-Mixtral-8x7B-Instruct-v01](https://aka.ms/azureai/landing/mistralai-Mixtral-8x7B-Instruct-v01) +++# [Mixtral-8x22B-Instruct](#tab/mistral-8x22b-instruct) ++The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct, fine-tuned version of the Mixtral-8x22B-v0.1. The model is a sparse Mixture-of-Experts (SMoE) model that uses only 39B active parameters out of 141B, offering unparalleled cost efficiency for its size. ++Mixtral 8x22B comes with the following strengths: ++* Fluent in English, French, Italian, German, and Spanish +* Strong mathematics and coding capabilities +* Natively capable of function calling; along with the constrained output mode implemented on la Plateforme, this enables application development and tech stack modernization at scale +* Pprecise information recall from large documents, due to its 64K tokens context window +++The following models are available: ++* [mistralai-Mixtral-8x22B-Instruct-v0-1](https://aka.ms/azureai/landing/mistralai-Mixtral-8x22B-Instruct-v0-1) +++++> [!TIP] +> Additionally, MistralAI supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [MistralAI documentation](https://docs.mistral.ai/) or see the [inference examples](#more-inference-examples) section to code examples. ++## Prerequisites ++To use Mistral-7B and Mixtral chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to a self-hosted managed compute** ++Mistral-7B and Mixtral chat models can be deployed to our self-hosted managed inference solution, which allows you to customize and control all the details about how the model is served. ++For deployment to a self-hosted managed compute, you must have enough quota in your subscription. If you don't have enough quota available, you can use our temporary quota access by selecting the option **I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours.** ++> [!div class="nextstepaction"] +> [Deploy the model to managed compute](../concepts/deployments-overview.md) ++### A REST client ++Models deployed with the [Azure AI model inference API](https://aka.ms/azureai/modelinference) can be consumed using any REST client. To use the REST client, you need the following prerequisites: ++* To construct the requests, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name`` is your unique model deployment host name and `your-azure-region`` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Mistral-7B and Mixtral chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. ++When you deploy the model to a self-hosted online endpoint with **Microsoft Entra ID** support, you can use the following code snippet to create a client. ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: ++```http +GET /info HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json +``` ++The response is as follows: +++```json +{ + "model_name": "mistralai-Mistral-7B-Instruct-v01", + "model_type": "chat-completions", + "model_provider_name": "MistralAI" +} +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ] +} +``` ++> [!NOTE] +> mistralai-Mistral-7B-Instruct-v01, mistralai-Mistral-7B-Instruct-v02 and mistralai-Mixtral-8x22B-Instruct-v0-1 don't support system messages (`role="system"`). When you use the Azure AI model inference API, system messages are translated to user messages, which is the closest capability available. This translation is offered for convenience, but it's important for you to verify that the model is following the instructions in the system message with the right level of confidence. ++The response is as follows, where you can see the model's usage statistics: +++```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718726686, + "model": "mistralai-Mistral-7B-Instruct-v01", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.", + "tool_calls": null + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "stream": true, + "temperature": 0, + "top_p": 1, + "max_tokens": 2048 +} +``` ++You can visualize how streaming generates content: +++```json +{ + "id": "23b54589eba14564ad8a2e6978775a39", + "object": "chat.completion.chunk", + "created": 1718726371, + "model": "mistralai-Mistral-7B-Instruct-v01", + "choices": [ + { + "index": 0, + "delta": { + "role": "assistant", + "content": "" + }, + "finish_reason": null, + "logprobs": null + } + ] +} +``` ++The last message in the stream has `finish_reason` set, indicating the reason for the generation process to stop. +++```json +{ + "id": "23b54589eba14564ad8a2e6978775a39", + "object": "chat.completion.chunk", + "created": 1718726371, + "model": "mistralai-Mistral-7B-Instruct-v01", + "choices": [ + { + "index": 0, + "delta": { + "content": "" + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "presence_penalty": 0.1, + "frequency_penalty": 0.8, + "max_tokens": 2048, + "stop": ["<|endoftext|>"], + "temperature" :0, + "top_p": 1, + "response_format": { "type": "text" } +} +``` +++```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718726686, + "model": "mistralai-Mistral-7B-Instruct-v01", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.", + "tool_calls": null + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` ++> [!WARNING] +> Mistral doesn't support JSON output formatting (`response_format = { "type": "json_object" }`). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON. ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. ++```http +POST /chat/completions HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json +extra-parameters: pass-through +``` +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "logprobs": true +} +``` ++The following extra parameters can be passed to Mistral-7B and Mixtral chat models: ++| Name | Description | Type | +| -- | | | +| `logit_bias` | Accepts a JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token. | `float` | +| `logprobs` | Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the `content` of `message`. | `int` | +| `top_logprobs` | An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability. `logprobs` must be set to `true` if this parameter is used. | `float` | +| `n` | How many chat completion choices to generate for each input message. Note that you will be charged based on the number of generated tokens across all of the choices. | `int` | ++++## More inference examples ++For more examples of how to use Mistral, see the following examples and tutorials: ++| Description | Language | Sample | +|-|-|--| +| CURL request | Bash | [Link](https://aka.ms/mistral-large/webrequests-sample) | +| Azure AI Inference package for JavaScript | JavaScript | [Link](https://aka.ms/azsdk/azure-ai-inference/javascript/samples) | +| Azure AI Inference package for Python | Python | [Link](https://aka.ms/azsdk/azure-ai-inference/python/samples) | +| Python web requests | Python | [Link](https://aka.ms/mistral-large/webrequests-sample) | +| OpenAI SDK (experimental) | Python | [Link](https://aka.ms/mistral-large/openaisdk) | +| LangChain | Python | [Link](https://aka.ms/mistral-large/langchain-sample) | +| Mistral AI | Python | [Link](https://aka.ms/mistral-large/mistralai-sample) | +| LiteLLM | Python | [Link](https://aka.ms/mistral-large/litellm-sample) | +++## Cost and quota considerations for Mistral family of models deployed to managed compute ++Mistral models deployed to managed compute are billed based on core hours of the associated compute instance. The cost of the compute instance is determined by the size of the instance, the number of instances running, and the run duration. ++It is a good practice to start with a low number of instances and scale up as needed. You can monitor the cost of the compute instance in the Azure portal. ++## Related content +++* [Azure AI Model Inference API](../reference/reference-model-inference-api.md) +* [Deploy models as serverless APIs](deploy-models-serverless.md) +* [Consume serverless API endpoints from a different Azure AI Studio project or hub](deploy-models-serverless-connect.md) +* [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md) +* [Plan and manage costs (marketplace)](costs-plan-manage.md#monitor-costs-for-models-offered-through-the-azure-marketplace) |
ai-studio | Deploy Models Mistral | https://github.com/MicrosoftDocs/azure-docs/commits/main/articles/ai-studio/how-to/deploy-models-mistral.md | Title: How to deploy Mistral family of models with Azure AI Studio + Title: How to use Mistral premium chat models with Azure AI Studio -description: Learn how to deploy Mistral family of models with Azure AI Studio. -+description: Learn how to use Mistral premium chat models with Azure AI Studio. + Previously updated : 5/21/2024 Last updated : 08/08/2024 reviewer: fkriti -++zone_pivot_groups: azure-ai-model-catalog-samples-chat +++# How to use Mistral premium chat models ++In this article, you learn about Mistral premium chat models and how to use them. +Mistral AI offers two categories of models. Premium models including [Mistral Large and Mistral Small](deploy-models-mistral.md), available as serverless APIs with pay-as-you-go token-based billing. Open models including [Mistral Nemo](deploy-models-mistral-nemo.md), [Mixtral-8x7B-Instruct-v01, Mixtral-8x7B-v01, Mistral-7B-Instruct-v01, and Mistral-7B-v01](deploy-models-mistral-open.md); available to also download and run on self-hosted managed endpoints. ++++## Mistral premium chat models ++The Mistral premium chat models include the following models: ++# [Mistral Large](#tab/mistral-large) ++Mistral Large is Mistral AI's most advanced Large Language Model (LLM). It can be used on any language-based task, thanks to its state-of-the-art reasoning and knowledge capabilities. ++Additionally, Mistral Large is: ++* **Specialized in RAG**. Crucial information isn't lost in the middle of long context windows (up to 32-K tokens). +* **Strong in coding**. Code generation, review, and comments. Supports all mainstream coding languages. +* **Multi-lingual by design**. Best-in-class performance in French, German, Spanish, Italian, and English. Dozens of other languages are supported. +* **Responsible AI compliant**. Efficient guardrails baked in the model and extra safety layer with the safe_mode option. ++And attributes of Mistral Large (2407) include: ++* **Multi-lingual by design**. Supports dozens of languages, including English, French, German, Spanish, and Italian. +* **Proficient in coding**. Trained on more than 80 coding languages, including Python, Java, C, C++, JavaScript, and Bash. Also trained on more specific languages such as Swift and Fortran. +* **Agent-centric**. Possesses agentic capabilities with native function calling and JSON outputting. +* **Advanced in reasoning**. Demonstrates state-of-the-art mathematical and reasoning capabilities. +++The following models are available: ++* [Mistral-Large](https://aka.ms/azureai/landing/Mistral-Large) +* [Mistral-Large-2407](https://aka.ms/azureai/landing/Mistral-Large-2407) +++# [Mistral Small](#tab/mistral-small) ++Mistral Small is Mistral AI's most efficient Large Language Model (LLM). It can be used on any language-based task that requires high efficiency and low latency. ++Mistral Small is: ++* **A small model optimized for low latency**. Efficient for high volume and low latency workloads. Mistral Small is Mistral's smallest proprietary model, it outperforms Mixtral-8x7B and has lower latency. +* **Specialized in RAG**. Crucial information isn't lost in the middle of long context windows (up to 32K tokens). +* **Strong in coding**. Code generation, review, and comments. Supports all mainstream coding languages. +* **Multi-lingual by design**. Best-in-class performance in French, German, Spanish, Italian, and English. Dozens of other languages are supported. +* **Responsible AI compliant**. Efficient guardrails baked in the model, and extra safety layer with the safe_mode option. +++The following models are available: ++* [Mistral-Small](https://aka.ms/azureai/landing/Mistral-Small) +++++> [!TIP] +> Additionally, MistralAI supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [MistralAI documentation](https://docs.mistral.ai/) or see the [inference examples](#more-inference-examples) section to code examples. ++## Prerequisites ++To use Mistral premium chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Mistral premium chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `azure-ai-inference` package with Python. To install this package, you need the following prerequisites: ++* Python 3.8 or later installed, including pip. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. + +Once you have these prerequisites, install the Azure AI inference package with the following command: ++```bash +pip install azure-ai-inference +``` ++Read more about the [Azure AI inference package and reference](https://aka.ms/azsdk/azure-ai-inference/python/reference). ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Mistral premium chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```python +import os +from azure.ai.inference import ChatCompletionsClient +from azure.core.credentials import AzureKeyCredential ++client = ChatCompletionsClient( + endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"], + credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"]), +) +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```python +model_info = client.get_model_info() +``` ++The response is as follows: +++```python +print("Model name:", model_info.model_name) +print("Model type:", model_info.model_type) +print("Model provider name:", model_info.model_provider) +``` ++```console +Model name: Mistral-Large +Model type: chat-completions +Model provider name: MistralAI +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```python +from azure.ai.inference.models import SystemMessage, UserMessage ++response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], +) +``` ++The response is as follows, where you can see the model's usage statistics: +++```python +print("Response:", response.choices[0].message.content) +print("Model:", response.model) +print("Usage:") +print("\tPrompt tokens:", response.usage.prompt_tokens) +print("\tTotal tokens:", response.usage.total_tokens) +print("\tCompletion tokens:", response.usage.completion_tokens) +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: Mistral-Large +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```python +result = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + temperature=0, + top_p=1, + max_tokens=2048, + stream=True, +) +``` ++To stream completions, set `stream=True` when you call the model. ++To visualize the output, define a helper function to print the stream. ++```python +def print_stream(result): + """ + Prints the chat completion with streaming. Some delay is added to simulate + a real-time conversation. + """ + import time + for update in result: + if update.choices: + print(update.choices[0].delta.content, end="") + time.sleep(0.05) +``` ++You can visualize how streaming generates content: +++```python +print_stream(result) +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```python +from azure.ai.inference.models import ChatCompletionsResponseFormat ++response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + presence_penalty=0.1, + frequency_penalty=0.8, + max_tokens=2048, + stop=["<|endoftext|>"], + temperature=0, + top_p=1, + response_format={ "type": ChatCompletionsResponseFormat.TEXT }, +) +``` ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++#### Create JSON outputs ++Mistral premium chat models can create JSON outputs. Set `response_format` to `json_object` to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if `finish_reason="length"`, which indicates that the generation exceeded `max_tokens` or that the conversation exceeded the max context length. +++```python +response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant that always generate responses in JSON format, using." + " the following format: { ""answer"": ""response"" }."), + UserMessage(content="How many languages are in the world?"), + ], + response_format={ "type": ChatCompletionsResponseFormat.JSON_OBJECT } +) +``` ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```python +response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + model_extras={ + "logprobs": True + } +) +``` ++The following extra parameters can be passed to Mistral premium chat models: ++| Name | Description | Type | +| -- | | | +| `ignore_eos` | Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. | `boolean` | +| `safe_mode` | Whether to inject a safety prompt before all conversations. | `boolean` | +++### Safe mode ++Mistral premium chat models support the parameter `safe_prompt`. You can toggle the safe prompt to prepend your messages with the following system prompt: ++> Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity. ++The Azure AI Model Inference API allows you to pass this extra parameter as follows: +++```python +response = client.complete( + messages=[ + SystemMessage(content="You are a helpful assistant."), + UserMessage(content="How many languages are in the world?"), + ], + model_extras={ + "safe_mode": True + } +) +``` ++### Use tools ++Mistral premium chat models support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way. ++The following code example creates a tool definition that is able to look from flight information from two different cities. +++```python +from azure.ai.inference.models import FunctionDefinition, ChatCompletionsFunctionToolDefinition ++flight_info = ChatCompletionsFunctionToolDefinition( + function=FunctionDefinition( + name="get_flight_info", + description="Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight", + parameters={ + "type": "object", + "properties": { + "origin_city": { + "type": "string", + "description": "The name of the city where the flight originates", + }, + "destination_city": { + "type": "string", + "description": "The flight destination city", + }, + }, + "required": ["origin_city", "destination_city"], + }, + ) +) ++tools = [flight_info] +``` ++In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train. +++```python +def get_flight_info(loc_origin: str, loc_destination: str): + return { + "info": f"There are no flights available from {loc_origin} to {loc_destination}. You should take a train, specially if it helps to reduce CO2 emissions." + } +``` ++Prompt the model to book flights with the help of this function: +++```python +messages = [ + SystemMessage( + content="You are a helpful assistant that help users to find information about traveling, how to get" + " to places and the different transportations options. You care about the environment and you" + " always have that in mind when answering inqueries.", + ), + UserMessage( + content="When is the next flight from Miami to Seattle?", + ), +] ++response = client.complete( + messages=messages, tools=tools, tool_choice="auto" +) +``` ++You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type `function`. +++```python +response_message = response.choices[0].message +tool_calls = response_message.tool_calls ++print("Finish reason:", response.choices[0].finish_reason) +print("Tool call:", tool_calls) +``` ++To continue, append this message to the chat history: +++```python +messages.append( + response_message +) +``` ++Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history. +++```python +import json +from azure.ai.inference.models import ToolMessage ++for tool_call in tool_calls: ++ # Get the tool details: ++ function_name = tool_call.function.name + function_args = json.loads(tool_call.function.arguments.replace("\'", "\"")) + tool_call_id = tool_call.id ++ print(f"Calling function `{function_name}` with arguments {function_args}") ++ # Call the function defined above using `locals()`, which returns the list of all functions + # available in the scope as a dictionary. Notice that this is just done as a simple way to get + # the function callable from its string name. Then we can call it with the corresponding + # arguments. ++ callable_func = locals()[function_name] + function_response = callable_func(**function_args) ++ print("->", function_response) ++ # Once we have a response from the function and its arguments, we can append a new message to the chat + # history. Notice how we are telling to the model that this chat message came from a tool: ++ messages.append( + ToolMessage( + tool_call_id=tool_call_id, + content=json.dumps(function_response) + ) + ) +``` ++View the response from the model: +++```python +response = client.complete( + messages=messages, + tools=tools, +) +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```python +from azure.ai.inference.models import AssistantMessage, UserMessage, SystemMessage ++try: + response = client.complete( + messages=[ + SystemMessage(content="You are an AI assistant that helps people find information."), + UserMessage(content="Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills."), + ] + ) ++ print(response.choices[0].message.content) ++except HttpResponseError as ex: + if ex.status_code == 400: + response = ex.response.json() + if isinstance(response, dict) and "error" in response: + print(f"Your request triggered an {response['error']['code']} error:\n\t {response['error']['message']}") + else: + raise + raise +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++++## Mistral premium chat models ++The Mistral premium chat models include the following models: ++# [Mistral Large](#tab/mistral-large) ++Mistral Large is Mistral AI's most advanced Large Language Model (LLM). It can be used on any language-based task, thanks to its state-of-the-art reasoning and knowledge capabilities. ++Additionally, Mistral Large is: ++* **Specialized in RAG**. Crucial information isn't lost in the middle of long context windows (up to 32-K tokens). +* **Strong in coding**. Code generation, review, and comments. Supports all mainstream coding languages. +* **Multi-lingual by design**. Best-in-class performance in French, German, Spanish, Italian, and English. Dozens of other languages are supported. +* **Responsible AI compliant**. Efficient guardrails baked in the model and extra safety layer with the safe_mode option. ++And attributes of Mistral Large (2407) include: ++* **Multi-lingual by design**. Supports dozens of languages, including English, French, German, Spanish, and Italian. +* **Proficient in coding**. Trained on more than 80 coding languages, including Python, Java, C, C++, JavaScript, and Bash. Also trained on more specific languages such as Swift and Fortran. +* **Agent-centric**. Possesses agentic capabilities with native function calling and JSON outputting. +* **Advanced in reasoning**. Demonstrates state-of-the-art mathematical and reasoning capabilities. +++The following models are available: ++* [Mistral-Large](https://aka.ms/azureai/landing/Mistral-Large) +* [Mistral-Large-2407](https://aka.ms/azureai/landing/Mistral-Large-2407) +++# [Mistral Small](#tab/mistral-small) ++Mistral Small is Mistral AI's most efficient Large Language Model (LLM). It can be used on any language-based task that requires high efficiency and low latency. ++Mistral Small is: ++* **A small model optimized for low latency**. Efficient for high volume and low latency workloads. Mistral Small is Mistral's smallest proprietary model, it outperforms Mixtral-8x7B and has lower latency. +* **Specialized in RAG**. Crucial information isn't lost in the middle of long context windows (up to 32K tokens). +* **Strong in coding**. Code generation, review, and comments. Supports all mainstream coding languages. +* **Multi-lingual by design**. Best-in-class performance in French, German, Spanish, Italian, and English. Dozens of other languages are supported. +* **Responsible AI compliant**. Efficient guardrails baked in the model, and extra safety layer with the safe_mode option. +++The following models are available: ++* [Mistral-Small](https://aka.ms/azureai/landing/Mistral-Small) ++ -# How to deploy Mistral models with Azure AI Studio +> [!TIP] +> Additionally, MistralAI supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [MistralAI documentation](https://docs.mistral.ai/) or see the [inference examples](#more-inference-examples) section to code examples. ++## Prerequisites ++To use Mistral premium chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Mistral premium chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `@azure-rest/ai-inference` package from `npm`. To install this package, you need the following prerequisites: ++* LTS versions of `Node.js` with `npm`. +* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++Once you have these prerequisites, install the Azure Inference library for JavaScript with the following command: ++```bash +npm install @azure-rest/ai-inference +``` ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Mistral premium chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```javascript +import ModelClient from "@azure-rest/ai-inference"; +import { isUnexpected } from "@azure-rest/ai-inference"; +import { AzureKeyCredential } from "@azure/core-auth"; ++const client = new ModelClient( + process.env.AZURE_INFERENCE_ENDPOINT, + new AzureKeyCredential(process.env.AZURE_INFERENCE_CREDENTIAL) +); +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```javascript +var model_info = await client.path("/info").get() +``` ++The response is as follows: +++```javascript +console.log("Model name: ", model_info.body.model_name) +console.log("Model type: ", model_info.body.model_type) +console.log("Model provider name: ", model_info.body.model_provider_name) +``` ++```console +Model name: Mistral-Large +Model type: chat-completions +Model provider name: MistralAI +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } +}); +``` ++The response is as follows, where you can see the model's usage statistics: +++```javascript +if (isUnexpected(response)) { + throw response.body.error; +} ++console.log("Response: ", response.body.choices[0].message.content); +console.log("Model: ", response.body.model); +console.log("Usage:"); +console.log("\tPrompt tokens:", response.body.usage.prompt_tokens); +console.log("\tTotal tokens:", response.body.usage.total_tokens); +console.log("\tCompletion tokens:", response.body.usage.completion_tokens); +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: Mistral-Large +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } +}).asNodeStream(); +``` ++To stream completions, use `.asNodeStream()` when you call the model. ++You can visualize how streaming generates content: +++```javascript +var stream = response.body; +if (!stream) { + stream.destroy(); + throw new Error(`Failed to get chat completions with status: ${response.status}`); +} ++if (response.status !== "200") { + throw new Error(`Failed to get chat completions: ${response.body.error}`); +} ++var sses = createSseStream(stream); ++for await (const event of sses) { + if (event.data === "[DONE]") { + return; + } + for (const choice of (JSON.parse(event.data)).choices) { + console.log(choice.delta?.content ?? ""); + } +} +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + presence_penalty: "0.1", + frequency_penalty: "0.8", + max_tokens: 2048, + stop: ["<|endoftext|>"], + temperature: 0, + top_p: 1, + response_format: { type: "text" }, + } +}); +``` ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++#### Create JSON outputs ++Mistral premium chat models can create JSON outputs. Set `response_format` to `json_object` to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if `finish_reason="length"`, which indicates that the generation exceeded `max_tokens` or that the conversation exceeded the max context length. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant that always generate responses in JSON format, using." + + " the following format: { \"answer\": \"response\" }." }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + response_format: { type: "json_object" } + } +}); +``` ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + headers: { + "extra-params": "pass-through" + }, + body: { + messages: messages, + logprobs: true + } +}); +``` ++The following extra parameters can be passed to Mistral premium chat models: ++| Name | Description | Type | +| -- | | | +| `ignore_eos` | Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. | `boolean` | +| `safe_mode` | Whether to inject a safety prompt before all conversations. | `boolean` | +++### Safe mode ++Mistral premium chat models support the parameter `safe_prompt`. You can toggle the safe prompt to prepend your messages with the following system prompt: ++> Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity. ++The Azure AI Model Inference API allows you to pass this extra parameter as follows: +++```javascript +var messages = [ + { role: "system", content: "You are a helpful assistant" }, + { role: "user", content: "How many languages are in the world?" }, +]; ++var response = await client.path("/chat/completions").post({ + headers: { + "extra-params": "pass-through" + }, + body: { + messages: messages, + safe_mode: true + } +}); +``` ++### Use tools ++Mistral premium chat models support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way. ++The following code example creates a tool definition that is able to look from flight information from two different cities. +++```javascript +const flight_info = { + name: "get_flight_info", + description: "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight", + parameters: { + type: "object", + properties: { + origin_city: { + type: "string", + description: "The name of the city where the flight originates", + }, + destination_city: { + type: "string", + description: "The flight destination city", + }, + }, + required: ["origin_city", "destination_city"], + }, +} ++const tools = [ + { + type: "function", + function: flight_info, + }, +]; +``` ++In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train. +++```javascript +function get_flight_info(loc_origin, loc_destination) { + return { + info: "There are no flights available from " + loc_origin + " to " + loc_destination + ". You should take a train, specially if it helps to reduce CO2 emissions." + } +} +``` ++Prompt the model to book flights with the help of this function: +++```javascript +var result = await client.path("/chat/completions").post({ + body: { + messages: messages, + tools: tools, + tool_choice: "auto" + } +}); +``` ++You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type `function`. +++```javascript +const response_message = response.body.choices[0].message; +const tool_calls = response_message.tool_calls; ++console.log("Finish reason: " + response.body.choices[0].finish_reason); +console.log("Tool call: " + tool_calls); +``` ++To continue, append this message to the chat history: +++```javascript +messages.push(response_message); +``` ++Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history. +++```javascript +function applyToolCall({ function: call, id }) { + // Get the tool details: + const tool_params = JSON.parse(call.arguments); + console.log("Calling function " + call.name + " with arguments " + tool_params); ++ // Call the function defined above using `window`, which returns the list of all functions + // available in the scope as a dictionary. Notice that this is just done as a simple way to get + // the function callable from its string name. Then we can call it with the corresponding + // arguments. + const function_response = tool_params.map(window[call.name]); + console.log("-> " + function_response); ++ return function_response +} ++for (const tool_call of tool_calls) { + var tool_response = tool_call.apply(applyToolCall); ++ messages.push( + { + role: "tool", + tool_call_id: tool_call.id, + content: tool_response + } + ); +} +``` ++View the response from the model: +++```javascript +var result = await client.path("/chat/completions").post({ + body: { + messages: messages, + tools: tools, + } +}); +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. +++```javascript +try { + var messages = [ + { role: "system", content: "You are an AI assistant that helps people find information." }, + { role: "user", content: "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." }, + ]; ++ var response = await client.path("/chat/completions").post({ + body: { + messages: messages, + } + }); ++ console.log(response.body.choices[0].message.content); +} +catch (error) { + if (error.status_code == 400) { + var response = JSON.parse(error.response._content); + if (response.error) { + console.log(`Your request triggered an ${response.error.code} error:\n\t ${response.error.message}`); + } + else + { + throw error; + } + } +} +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). +++++## Mistral premium chat models ++The Mistral premium chat models include the following models: ++# [Mistral Large](#tab/mistral-large) ++Mistral Large is Mistral AI's most advanced Large Language Model (LLM). It can be used on any language-based task, thanks to its state-of-the-art reasoning and knowledge capabilities. ++Additionally, Mistral Large is: ++* **Specialized in RAG**. Crucial information isn't lost in the middle of long context windows (up to 32-K tokens). +* **Strong in coding**. Code generation, review, and comments. Supports all mainstream coding languages. +* **Multi-lingual by design**. Best-in-class performance in French, German, Spanish, Italian, and English. Dozens of other languages are supported. +* **Responsible AI compliant**. Efficient guardrails baked in the model and extra safety layer with the safe_mode option. ++And attributes of Mistral Large (2407) include: ++* **Multi-lingual by design**. Supports dozens of languages, including English, French, German, Spanish, and Italian. +* **Proficient in coding**. Trained on more than 80 coding languages, including Python, Java, C, C++, JavaScript, and Bash. Also trained on more specific languages such as Swift and Fortran. +* **Agent-centric**. Possesses agentic capabilities with native function calling and JSON outputting. +* **Advanced in reasoning**. Demonstrates state-of-the-art mathematical and reasoning capabilities. +++The following models are available: ++* [Mistral-Large](https://aka.ms/azureai/landing/Mistral-Large) +* [Mistral-Large-2407](https://aka.ms/azureai/landing/Mistral-Large-2407) +++# [Mistral Small](#tab/mistral-small) ++Mistral Small is Mistral AI's most efficient Large Language Model (LLM). It can be used on any language-based task that requires high efficiency and low latency. ++Mistral Small is: ++* **A small model optimized for low latency**. Efficient for high volume and low latency workloads. Mistral Small is Mistral's smallest proprietary model, it outperforms Mixtral-8x7B and has lower latency. +* **Specialized in RAG**. Crucial information isn't lost in the middle of long context windows (up to 32K tokens). +* **Strong in coding**. Code generation, review, and comments. Supports all mainstream coding languages. +* **Multi-lingual by design**. Best-in-class performance in French, German, Spanish, Italian, and English. Dozens of other languages are supported. +* **Responsible AI compliant**. Efficient guardrails baked in the model, and extra safety layer with the safe_mode option. +++The following models are available: ++* [Mistral-Small](https://aka.ms/azureai/landing/Mistral-Small) +++++> [!TIP] +> Additionally, MistralAI supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [MistralAI documentation](https://docs.mistral.ai/) or see the [inference examples](#more-inference-examples) section to code examples. ++## Prerequisites ++To use Mistral premium chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Mistral premium chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). ++> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) ++### The inference package installed ++You can consume predictions from this model by using the `Azure.AI.Inference` package from [Nuget](https://www.nuget.org/). To install this package, you need the following prerequisites: ++* The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name` is your unique model deployment host name and `your-azure-region` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. ++Once you have these prerequisites, install the Azure AI inference library with the following command: ++```dotnetcli +dotnet add package Azure.AI.Inference --prerelease +``` ++You can also authenticate with Microsoft Entra ID (formerly Azure Active Directory). To use credential providers provided with the Azure SDK, install the `Azure.Identity` package: ++```dotnetcli +dotnet add package Azure.Identity +``` ++Import the following namespaces: +++```csharp +using Azure; +using Azure.Identity; +using Azure.AI.Inference; +``` ++This example also use the following namespaces but you may not always need them: +++```csharp +using System.Text.Json; +using System.Text.Json.Serialization; +using System.Reflection; +``` ++## Work with chat completions ++In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. ++> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Mistral premium chat models. ++### Create a client to consume the model ++First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. +++```csharp +ChatCompletionsClient client = new ChatCompletionsClient( + new Uri(Environment.GetEnvironmentVariable("AZURE_INFERENCE_ENDPOINT")), + new AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_INFERENCE_CREDENTIAL")) +); +``` ++### Get the model's capabilities ++The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: +++```csharp +Response<ModelInfo> modelInfo = client.GetModelInfo(); +``` ++The response is as follows: +++```csharp +Console.WriteLine($"Model name: {modelInfo.Value.ModelName}"); +Console.WriteLine($"Model type: {modelInfo.Value.ModelType}"); +Console.WriteLine($"Model provider name: {modelInfo.Value.ModelProviderName}"); +``` ++```console +Model name: Mistral-Large +Model type: chat-completions +Model provider name: MistralAI +``` ++### Create a chat completion request ++The following example shows how you can create a basic chat completions request to the model. ++```csharp +ChatCompletionsOptions requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, +}; ++Response<ChatCompletions> response = client.Complete(requestOptions); +``` ++The response is as follows, where you can see the model's usage statistics: +++```csharp +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +Console.WriteLine($"Model: {response.Value.Model}"); +Console.WriteLine("Usage:"); +Console.WriteLine($"\tPrompt tokens: {response.Value.Usage.PromptTokens}"); +Console.WriteLine($"\tTotal tokens: {response.Value.Usage.TotalTokens}"); +Console.WriteLine($"\tCompletion tokens: {response.Value.Usage.CompletionTokens}"); +``` ++```console +Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred. +Model: Mistral-Large +Usage: + Prompt tokens: 19 + Total tokens: 91 + Completion tokens: 72 +``` ++Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. ++#### Stream content ++By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. ++You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. +++```csharp +static async Task StreamMessageAsync(ChatCompletionsClient client) +{ + ChatCompletionsOptions requestOptions = new ChatCompletionsOptions() + { + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world? Write an essay about it.") + }, + MaxTokens=4096 + }; ++ StreamingResponse<StreamingChatCompletionsUpdate> streamResponse = await client.CompleteStreamingAsync(requestOptions); ++ await PrintStream(streamResponse); +} +``` ++To stream completions, use `CompleteStreamingAsync` method when you call the model. Notice that in this example we the call is wrapped in an asynchronous method. ++To visualize the output, define an asynchronous method to print the stream in the console. ++```csharp +static async Task PrintStream(StreamingResponse<StreamingChatCompletionsUpdate> response) +{ + await foreach (StreamingChatCompletionsUpdate chatUpdate in response) + { + if (chatUpdate.Role.HasValue) + { + Console.Write($"{chatUpdate.Role.Value.ToString().ToUpperInvariant()}: "); + } + if (!string.IsNullOrEmpty(chatUpdate.ContentUpdate)) + { + Console.Write(chatUpdate.ContentUpdate); + } + } +} +``` ++You can visualize how streaming generates content: +++```csharp +StreamMessageAsync(client).GetAwaiter().GetResult(); +``` ++#### Explore more parameters supported by the inference client ++Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). ++```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + PresencePenalty = 0.1f, + FrequencyPenalty = 0.8f, + MaxTokens = 2048, + StopSequences = { "<|endoftext|>" }, + Temperature = 0, + NucleusSamplingFactor = 1, + ResponseFormat = new ChatCompletionsResponseFormatText() +}; ++response = client.Complete(requestOptions); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` ++If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++#### Create JSON outputs ++Mistral premium chat models can create JSON outputs. Set `response_format` to `json_object` to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if `finish_reason="length"`, which indicates that the generation exceeded `max_tokens` or that the conversation exceeded the max context length. +++```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage( + "You are a helpful assistant that always generate responses in JSON format, " + + "using. the following format: { \"answer\": \"response\" }." + ), + new ChatRequestUserMessage( + "How many languages are in the world?" + ) + }, + ResponseFormat = new ChatCompletionsResponseFormatJSON() +}; ++response = client.Complete(requestOptions); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` ++### Pass extra parameters to the model ++The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. ++Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. +++```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + AdditionalProperties = { { "logprobs", BinaryData.FromString("true") } }, +}; ++response = client.Complete(requestOptions, extraParams: ExtraParameters.PassThrough); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` ++The following extra parameters can be passed to Mistral premium chat models: ++| Name | Description | Type | +| -- | | | +| `ignore_eos` | Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. | `boolean` | +| `safe_mode` | Whether to inject a safety prompt before all conversations. | `boolean` | +++### Safe mode ++Mistral premium chat models support the parameter `safe_prompt`. You can toggle the safe prompt to prepend your messages with the following system prompt: ++> Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity. ++The Azure AI Model Inference API allows you to pass this extra parameter as follows: +++```csharp +requestOptions = new ChatCompletionsOptions() +{ + Messages = { + new ChatRequestSystemMessage("You are a helpful assistant."), + new ChatRequestUserMessage("How many languages are in the world?") + }, + AdditionalProperties = { { "safe_mode", BinaryData.FromString("true") } }, +}; ++response = client.Complete(requestOptions, extraParams: ExtraParameters.PassThrough); +Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}"); +``` ++### Use tools ++Mistral premium chat models support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way. ++The following code example creates a tool definition that is able to look from flight information from two different cities. +++```csharp +FunctionDefinition flightInfoFunction = new FunctionDefinition("getFlightInfo") +{ + Description = "Returns information about the next flight between two cities. This includes the name of the airline, flight number and the date and time of the next flight", + Parameters = BinaryData.FromObjectAsJson(new + { + Type = "object", + Properties = new + { + origin_city = new + { + Type = "string", + Description = "The name of the city where the flight originates" + }, + destination_city = new + { + Type = "string", + Description = "The flight destination city" + } + } + }, + new JsonSerializerOptions() { PropertyNamingPolicy = JsonNamingPolicy.CamelCase } + ) +}; ++ChatCompletionsFunctionToolDefinition getFlightTool = new ChatCompletionsFunctionToolDefinition(flightInfoFunction); +``` ++In this example, the function's output is that there are no flights available for the selected route, but the user should consider taking a train. -In this article, you learn how to use Azure AI Studio to deploy the Mistral family of models as serverless APIs with pay-as-you-go token-based billing. -Mistral AI offers two categories of models in the [Azure AI Studio](https://ai.azure.com). These models are available in the [model catalog](model-catalog-overview.md): +```csharp +static string getFlightInfo(string loc_origin, string loc_destination) +{ + return JsonSerializer.Serialize(new + { + info = $"There are no flights available from {loc_origin} to {loc_destination}. You " + + "should take a train, specially if it helps to reduce CO2 emissions." + }); +} +``` ++Prompt the model to book flights with the help of this function: +++```csharp +var chatHistory = new List<ChatRequestMessage>(){ + new ChatRequestSystemMessage( + "You are a helpful assistant that help users to find information about traveling, " + + "how to get to places and the different transportations options. You care about the" + + "environment and you always have that in mind when answering inqueries." + ), + new ChatRequestUserMessage("When is the next flight from Miami to Seattle?") + }; ++requestOptions = new ChatCompletionsOptions(chatHistory); +requestOptions.Tools.Add(getFlightTool); +requestOptions.ToolChoice = ChatCompletionsToolChoice.Auto; ++response = client.Complete(requestOptions); +``` ++You can inspect the response to find out if a tool needs to be called. Inspect the finish reason to determine if the tool should be called. Remember that multiple tool types can be indicated. This example demonstrates a tool of type `function`. +++```csharp +var responseMenssage = response.Value.Choices[0].Message; +var toolsCall = responseMenssage.ToolCalls; ++Console.WriteLine($"Finish reason: {response.Value.Choices[0].FinishReason}"); +Console.WriteLine($"Tool call: {toolsCall[0].Id}"); +``` ++To continue, append this message to the chat history: +++```csharp +requestOptions.Messages.Add(new ChatRequestAssistantMessage(response.Value.Choices[0].Message)); +``` ++Now, it's time to call the appropriate function to handle the tool call. The following code snippet iterates over all the tool calls indicated in the response and calls the corresponding function with the appropriate parameters. The response is also appended to the chat history. +++```csharp +foreach (ChatCompletionsToolCall tool in toolsCall) +{ + if (tool is ChatCompletionsFunctionToolCall functionTool) + { + // Get the tool details: + string callId = functionTool.Id; + string toolName = functionTool.Name; + string toolArgumentsString = functionTool.Arguments; + Dictionary<string, object> toolArguments = JsonSerializer.Deserialize<Dictionary<string, object>>(toolArgumentsString); ++ // Here you have to call the function defined. In this particular example we use + // reflection to find the method we definied before in an static class called + // `ChatCompletionsExamples`. Using reflection allows us to call a function + // by string name. Notice that this is just done for demonstration purposes as a + // simple way to get the function callable from its string name. Then we can call + // it with the corresponding arguments. ++ var flags = BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic | BindingFlags.Static; + string toolResponse = (string)typeof(ChatCompletionsExamples).GetMethod(toolName, flags).Invoke(null, toolArguments.Values.Cast<object>().ToArray()); ++ Console.WriteLine("->", toolResponse); + requestOptions.Messages.Add(new ChatRequestToolMessage(toolResponse, callId)); + } + else + throw new Exception("Unsupported tool type"); +} +``` ++View the response from the model: +++```csharp +response = client.Complete(requestOptions); +``` ++### Apply content safety ++The Azure AI model inference API supports [Azure AI content safety](https://aka.ms/azureaicontentsafety). When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. ++The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled. -* __Premium models__: Mistral Large (2402), Mistral Large (2407), and Mistral Small. -* __Open models__: Mistral Nemo, Mixtral-8x7B-Instruct-v01, Mixtral-8x7B-v01, Mistral-7B-Instruct-v01, and Mistral-7B-v01. -All the premium models and Mistral Nemo (an open model) can be deployed as serverless APIs with pay-as-you-go token-based billing. The other open models can be deployed to managed computes in your own Azure subscription. +```csharp +try +{ + requestOptions = new ChatCompletionsOptions() + { + Messages = { + new ChatRequestSystemMessage("You are an AI assistant that helps people find information."), + new ChatRequestUserMessage( + "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." + ), + }, + }; ++ response = client.Complete(requestOptions); + Console.WriteLine(response.Value.Choices[0].Message.Content); +} +catch (RequestFailedException ex) +{ + if (ex.ErrorCode == "content_filter") + { + Console.WriteLine($"Your query has trigger Azure Content Safeaty: {ex.Message}"); + } + else + { + throw; + } +} +``` ++> [!TIP] +> To learn more about how you can configure and control Azure AI content safety settings, check the [Azure AI content safety documentation](https://aka.ms/azureaicontentsafety). + -You can browse the Mistral family of models in the model catalog by filtering on the Mistral collection. -## Mistral family of models ++## Mistral premium chat models ++The Mistral premium chat models include the following models: # [Mistral Large](#tab/mistral-large) -Mistral Large is Mistral AI's most advanced Large Language Model (LLM). It can be used on any language-based task, thanks to its state-of-the-art reasoning and knowledge capabilities. There are two variants available for the Mistral Large model version: +Mistral Large is Mistral AI's most advanced Large Language Model (LLM). It can be used on any language-based task, thanks to its state-of-the-art reasoning and knowledge capabilities. ++Additionally, Mistral Large is: ++* **Specialized in RAG**. Crucial information isn't lost in the middle of long context windows (up to 32-K tokens). +* **Strong in coding**. Code generation, review, and comments. Supports all mainstream coding languages. +* **Multi-lingual by design**. Best-in-class performance in French, German, Spanish, Italian, and English. Dozens of other languages are supported. +* **Responsible AI compliant**. Efficient guardrails baked in the model and extra safety layer with the safe_mode option. -- Mistral Large (2402)-- Mistral Large (2407)+And attributes of Mistral Large (2407) include: -Additionally, some attributes of _Mistral Large (2402)_ include: +* **Multi-lingual by design**. Supports dozens of languages, including English, French, German, Spanish, and Italian. +* **Proficient in coding**. Trained on more than 80 coding languages, including Python, Java, C, C++, JavaScript, and Bash. Also trained on more specific languages such as Swift and Fortran. +* **Agent-centric**. Possesses agentic capabilities with native function calling and JSON outputting. +* **Advanced in reasoning**. Demonstrates state-of-the-art mathematical and reasoning capabilities. -* __Specialized in RAG.__ Crucial information isn't lost in the middle of long context windows (up to 32-K tokens). -* __Strong in coding.__ Code generation, review, and comments. Supports all mainstream coding languages. -* __Multi-lingual by design.__ Best-in-class performance in French, German, Spanish, Italian, and English. Dozens of other languages are supported. -* __Responsible AI compliant.__ Efficient guardrails baked in the model and extra safety layer with the `safe_mode` option. -And attributes of _Mistral Large (2407)_ include: +The following models are available: -- **Multi-lingual by design.** Supports dozens of languages, including English, French, German, Spanish, and Italian.-- **Proficient in coding.** Trained on more than 80 coding languages, including Python, Java, C, C++, JavaScript, and Bash. Also trained on more specific languages such as Swift and Fortran.-- **Agent-centric.** Possesses agentic capabilities with native function calling and JSON outputting.-- **Advanced in reasoning.** Demonstrates state-of-the-art mathematical and reasoning capabilities.+* [Mistral-Large](https://aka.ms/azureai/landing/Mistral-Large) +* [Mistral-Large-2407](https://aka.ms/azureai/landing/Mistral-Large-2407) # [Mistral Small](#tab/mistral-small) Mistral Small is Mistral AI's most efficient Large Language Model (LLM). It can Mistral Small is: -- **A small model optimized for low latency.** Efficient for high volume and low latency workloads. Mistral Small is Mistral's smallest proprietary model, it outperforms Mixtral-8x7B and has lower latency. -- **Specialized in RAG.** Crucial information isn't lost in the middle of long context windows (up to 32K tokens).-- **Strong in coding.** Code generation, review, and comments. Supports all mainstream coding languages.-- **Multi-lingual by design.** Best-in-class performance in French, German, Spanish, Italian, and English. Dozens of other languages are supported.-- **Responsible AI compliant.** Efficient guardrails baked in the model, and extra safety layer with the `safe_mode` option.+* **A small model optimized for low latency**. Efficient for high volume and low latency workloads. Mistral Small is Mistral's smallest proprietary model, it outperforms Mixtral-8x7B and has lower latency. +* **Specialized in RAG**. Crucial information isn't lost in the middle of long context windows (up to 32K tokens). +* **Strong in coding**. Code generation, review, and comments. Supports all mainstream coding languages. +* **Multi-lingual by design**. Best-in-class performance in French, German, Spanish, Italian, and English. Dozens of other languages are supported. +* **Responsible AI compliant**. Efficient guardrails baked in the model, and extra safety layer with the safe_mode option. -# [Mistral Nemo](#tab/mistral-nemo) +The following models are available: -Mistral Nemo is a cutting-edge Language Model (LLM) boasting state-of-the-art reasoning, world knowledge, and coding capabilities within its size category. +* [Mistral-Small](https://aka.ms/azureai/landing/Mistral-Small) -Mistral Nemo is a 12B model, making it a powerful drop-in replacement for any system using Mistral 7B, which it supersedes. It supports a context length of 128K, and it accepts only text inputs and generates text outputs. -Additionally, Mistral Nemo is: +++> [!TIP] +> Additionally, MistralAI supports the use of a tailored API for use with specific features of the model. To use the model-provider specific API, check [MistralAI documentation](https://docs.mistral.ai/) or see the [inference examples](#more-inference-examples) section to code examples. -- **Jointly developed with Nvidia.** This collaboration has resulted in a powerful 12B model that pushes the boundaries of language understanding and generation.-- **Multilingual proficient.** Mistral Nemo is equipped with a tokenizer called Tekken, which is designed for multilingual applications. It supports over 100 languages, such as English, French, German, and Spanish. Tekken is more efficient than the Llama 3 tokenizer in compressing text for approximately 85% of all languages, with significant improvements in Malayalam, Hindi, Arabic, and prevalent European languages.-- **Agent-centric.** Mistral Nemo possesses top-tier agentic capabilities, including native function calling and JSON outputting.-- **Advanced in reasoning.** Mistral Nemo demonstrates state-of-the-art mathematical and reasoning capabilities within its size category.+## Prerequisites -+To use Mistral premium chat models with Azure AI Studio, you need the following prerequisites: ++### A model deployment ++**Deployment to serverless APIs** ++Mistral premium chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. ++Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to [deploy the model as a serverless API](deploy-models-serverless.md). -## Deploy Mistral family of models as a serverless API +> [!div class="nextstepaction"] +> [Deploy the model to serverless API endpoints](deploy-models-serverless.md) -Certain models in the model catalog can be deployed as a serverless API with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. This deployment option doesn't require quota from your subscription. +### A REST client -**Mistral Large (2402)**, **Mistral Large (2407)**, **Mistral Small**, and **Mistral Nemo** can be deployed as a serverless API with pay-as-you-go billing and are offered by Mistral AI through the Microsoft Azure Marketplace. Mistral AI can change or update the terms of use and pricing of these models. +Models deployed with the [Azure AI model inference API](https://aka.ms/azureai/modelinference) can be consumed using any REST client. To use the REST client, you need the following prerequisites: -### Prerequisites +* To construct the requests, you need to pass in the endpoint URL. The endpoint URL has the form `https://your-host-name.your-azure-region.inference.ai.azure.com`, where `your-host-name`` is your unique model deployment host name and `your-azure-region`` is the Azure region where the model is deployed (for example, eastus2). +* Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string. -- An Azure subscription with a valid payment method. Free or trial Azure subscriptions won't work. If you don't have an Azure subscription, create a [paid Azure account](https://azure.microsoft.com/pricing/purchase-options/pay-as-you-go) to begin.-- An [AI Studio hub](../how-to/create-azure-ai-resource.md). The serverless API model deployment offering for eligible models in the Mistral family is only available with hubs created in these regions:+## Work with chat completions - - East US - - East US 2 - - North Central US - - South Central US - - West US - - West US 3 - - Sweden Central +In this section, you use the [Azure AI model inference API](https://aka.ms/azureai/modelinference) with a chat completions model for chat. - For a list of regions that are available for each of the models supporting serverless API endpoint deployments, see [Region availability for models in serverless API endpoints](deploy-models-serverless-availability.md). +> [!TIP] +> The [Azure AI model inference API](https://aka.ms/azureai/modelinference) allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Mistral premium chat models. -- An [Azure AI Studio project](../how-to/create-projects.md).-- Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure AI Studio. To perform the steps in this article, your user account must be assigned the __Azure AI Developer role__ on the resource group. For more information on permissions, see [Role-based access control in Azure AI Studio](../concepts/rbac-ai-studio.md).+### Create a client to consume the model -### Create a new deployment +First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables. -The following steps demonstrate the deployment of Mistral Large (2402), but you can use the same steps to deploy Mistral Nemo or any of the premium Mistral models by replacing the model name. +### Get the model's capabilities -To create a deployment: +The `/info` route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method: -1. Sign in to [Azure AI Studio](https://ai.azure.com). -1. Select **Model catalog** from the left sidebar. -1. Search for and select the Mistral Large (2402) model to open its Details page. +```http +GET /info HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json +``` - :::image type="content" source="../media/deploy-monitor/mistral/mistral-large-deploy-directly-from-catalog.png" alt-text="A screenshot showing how to access the model details page by going through the model catalog." lightbox="../media/deploy-monitor/mistral/mistral-large-deploy-directly-from-catalog.png"::: +The response is as follows: -1. Select **Deploy** to open a serverless API deployment window for the model. -1. Alternatively, you can initiate a deployment by starting from your project in AI Studio. - 1. From the left sidebar of your project, select **Components** > **Deployments**. - 1. Select **+ Create deployment**. - 1. Search for and select the Mistral Large (2402) model to open the Model's Details page. +```json +{ + "model_name": "Mistral-Large", + "model_type": "chat-completions", + "model_provider_name": "MistralAI" +} +``` - :::image type="content" source="../media/deploy-monitor/mistral/mistral-large-deploy-starting-from-project.png" alt-text="A screenshot showing how to access the model details page by going through the Deployments page in your project." lightbox="../media/deploy-monitor/mistral/mistral-large-deploy-starting-from-project.png"::: +### Create a chat completion request - 1. Select **Confirm** to open a serverless API deployment window for the model. +The following example shows how you can create a basic chat completions request to the model. - :::image type="content" source="../media/deploy-monitor/mistral/mistral-large-deploy-pay-as-you-go.png" alt-text="A screenshot showing how to deploy a model as a serverless API." lightbox="../media/deploy-monitor/mistral/mistral-large-deploy-pay-as-you-go.png"::: +```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ] +} +``` -1. Select the project in which you want to deploy your model. To use the serverless API model deployment offering, your project must belong to one of the regions listed in the [prerequisites](#prerequisites). -1. In the deployment wizard, select the link to **Azure Marketplace Terms** to learn more about the terms of use. -1. Select the **Pricing and terms** tab to learn about pricing for the selected model. -1. Select the **Subscribe and Deploy** button. If this is your first time deploying the model in the project, you have to subscribe your project for the particular offering. This step requires that your account has the **Azure AI Developer role** permissions on the resource group, as listed in the prerequisites. Each project has its own subscription to the particular Azure Marketplace offering of the model, which allows you to control and monitor spending. Currently, you can have only one deployment for each model within a project. -1. Once you subscribe the project for the particular Azure Marketplace offering, subsequent deployments of the _same_ offering in the _same_ project don't require subscribing again. If this scenario applies to you, there's a **Continue to deploy** option to select. +The response is as follows, where you can see the model's usage statistics: - :::image type="content" source="../media/deploy-monitor/mistral/mistral-large-existing-subscription.png" alt-text="A screenshot showing a project that is already subscribed to the offering." lightbox="../media/deploy-monitor/mistral/mistral-large-existing-subscription.png"::: -1. Give the deployment a name. This name becomes part of the deployment API URL. This URL must be unique in each Azure region. - :::image type="content" source="../media/deploy-monitor/mistral/mistral-large-deployment-name.png" alt-text="A screenshot showing how to indicate the name of the deployment you want to create." lightbox="../media/deploy-monitor/mistral/mistral-large-deployment-name.png"::: +```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718726686, + "model": "Mistral-Large", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.", + "tool_calls": null + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` -1. Select **Deploy**. Wait until the deployment is ready and you're redirected to the Deployments page. -1. Select **Open in playground** to start interacting with the model. -1. Return to the Deployments page, select the deployment, and note the endpoint's **Target** URL and the Secret **Key**. For more information on using the APIs, see the [reference](#reference-for-mistral-family-of-models-deployed-as-a-service) section. -1. You can always find the endpoint's details, URL, and access keys by navigating to your **Project overview** page. Then, from the left sidebar of your project, select **Components** > **Deployments**. +Inspect the `usage` section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion. -To learn about billing for the Mistral AI model deployed as a serverless API with pay-as-you-go token-based billing, see [Cost and quota considerations for Mistral family of models deployed as a service](#cost-and-quota-considerations-for-mistral-family-of-models-deployed-as-a-service). +#### Stream content -### Consume the Mistral family of models as a service +By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds. -You can consume Mistral models by using the chat API. +You can _stream_ the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as [data-only server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events). Extract chunks from the delta field, rather than the message field. -1. From your **Project overview** page, go to the left sidebar and select **Components** > **Deployments**. -1. Find and select the deployment you created. +```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "stream": true, + "temperature": 0, + "top_p": 1, + "max_tokens": 2048 +} +``` -1. Copy the **Target** URL and the **Key** value. +You can visualize how streaming generates content: -1. Make an API request using to either the [Azure AI Model Inference API](../reference/reference-model-inference-api.md) on the route `/chat/completions` and the native [Mistral Chat API](#mistral-chat-api) on `/v1/chat/completions`. -For more information on using the APIs, see the [reference](#reference-for-mistral-family-of-models-deployed-as-a-service) section. +```json +{ + "id": "23b54589eba14564ad8a2e6978775a39", + "object": "chat.completion.chunk", + "created": 1718726371, + "model": "Mistral-Large", + "choices": [ + { + "index": 0, + "delta": { + "role": "assistant", + "content": "" + }, + "finish_reason": null, + "logprobs": null + } + ] +} +``` -## Reference for Mistral family of models deployed as a service +The last message in the stream has `finish_reason` set, indicating the reason for the generation process to stop. -Mistral models accept both the [Azure AI Model Inference API](../reference/reference-model-inference-api.md) on the route `/chat/completions` and the native [Mistral Chat API](#mistral-chat-api) on `/v1/chat/completions`. -### Azure AI Model Inference API +```json +{ + "id": "23b54589eba14564ad8a2e6978775a39", + "object": "chat.completion.chunk", + "created": 1718726371, + "model": "Mistral-Large", + "choices": [ + { + "index": 0, + "delta": { + "content": "" + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} +``` -The [Azure AI Model Inference API](../reference/reference-model-inference-api.md) schema can be found in the [reference for Chat Completions](../reference/reference-model-inference-chat-completions.md) article and an [OpenAPI specification can be obtained from the endpoint itself](../reference/reference-model-inference-api.md?tabs=rest#getting-started). +#### Explore more parameters supported by the inference client -### Mistral Chat API +Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see [Azure AI Model Inference API reference](https://aka.ms/azureai/modelinference). -Use the method `POST` to send the request to the `/v1/chat/completions` route: +```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "presence_penalty": 0.1, + "frequency_penalty": 0.8, + "max_tokens": 2048, + "stop": ["<|endoftext|>"], + "temperature" :0, + "top_p": 1, + "response_format": { "type": "text" } +} +``` -__Request__ -```rest -POST /v1/chat/completions HTTP/1.1 -Host: <DEPLOYMENT_URI> -Authorization: Bearer <TOKEN> -Content-type: application/json +```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718726686, + "model": "Mistral-Large", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.", + "tool_calls": null + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 19, + "total_tokens": 91, + "completion_tokens": 72 + } +} ``` -#### Request schema +If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using *extra parameters*. See [Pass extra parameters to the model](#pass-extra-parameters-to-the-model). ++#### Create JSON outputs ++Mistral premium chat models can create JSON outputs. Set `response_format` to `json_object` to enable JSON mode and guarantee that the message the model generates is valid JSON. You must also instruct the model to produce JSON yourself via a system or user message. Also, the message content might be partially cut off if `finish_reason="length"`, which indicates that the generation exceeded `max_tokens` or that the conversation exceeded the max context length. +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant that always generate responses in JSON format, using the following format: { \"answer\": \"response\" }" + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "response_format": { "type": "json_object" } +} +``` -Payload is a JSON formatted string containing the following parameters: -| Key | Type | Default | Description | -|--|--|--|--| -| `messages` | `string` | No default. This value must be specified. | The message or history of messages to use to prompt the model. | -| `stream` | `boolean` | `False` | Streaming allows the generated tokens to be sent as data-only server-sent events whenever they become available. | -| `max_tokens` | `integer` | `8192` | The maximum number of tokens to generate in the completion. The token count of your prompt plus `max_tokens` can't exceed the model's context length. | -| `top_p` | `float` | `1` | An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with `top_p` probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering `top_p` or `temperature`, but not both. | -| `temperature` | `float` | `1` | The sampling temperature to use, between 0 and 2. Higher values mean the model samples more broadly the distribution of tokens. Zero means greedy sampling. We recommend altering this parameter or `top_p`, but not both. | -| `ignore_eos` | `boolean` | `False` | Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. | -| `safe_prompt` | `boolean` | `False` | Whether to inject a safety prompt before all conversations. | +```json +{ + "id": "0a1234b5de6789f01gh2i345j6789klm", + "object": "chat.completion", + "created": 1718727522, + "model": "Mistral-Large", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "{\"answer\": \"There are approximately 7,117 living languages in the world today, according to the latest estimates. However, this number can vary as some languages become extinct and others are newly discovered or classified.\"}", + "tool_calls": null + }, + "finish_reason": "stop", + "logprobs": null + } + ], + "usage": { + "prompt_tokens": 39, + "total_tokens": 87, + "completion_tokens": 48 + } +} +``` -The `messages` object has the following fields: +### Pass extra parameters to the model -| Key | Type | Value | -|--|--|| -| `content` | `string` | The contents of the message. Content is required for all messages. | -| `role` | `string` | The role of the message's author. One of `system`, `user`, or `assistant`. | +The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter `logprobs` to the model. +Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header `extra-parameters` is passed to the model with the value `pass-through`. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported. -#### Request example +```http +POST /chat/completions HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json +extra-parameters: pass-through +``` -__Body__ ```json {- "messages": - [ - { - "role": "system", - "content": "You are a helpful assistant that translates English to Italian." + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." }, {- "role": "user", - "content": "Translate the following sentence from English to Italian: I love programming." + "role": "user", + "content": "How many languages are in the world?" } ],- "temperature": 0.8, - "max_tokens": 512, + "logprobs": true } ``` -#### Response schema +The following extra parameters can be passed to Mistral premium chat models: -The response payload is a dictionary with the following fields: +| Name | Description | Type | +| -- | | | +| `ignore_eos` | Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. | `boolean` | +| `safe_mode` | Whether to inject a safety prompt before all conversations. | `boolean` | -| Key | Type | Description | -|--|--|-| -| `id` | `string` | A unique identifier for the completion. | -| `choices` | `array` | The list of completion choices the model generated for the input messages. | -| `created` | `integer` | The Unix timestamp (in seconds) of when the completion was created. | -| `model` | `string` | The model_id used for completion. | -| `object` | `string` | The object type, which is always `chat.completion`. | -| `usage` | `object` | Usage statistics for the completion request. | -> [!TIP] -> In the streaming mode, for each chunk of response, `finish_reason` is always `null`, except from the last one which is terminated by a payload `[DONE]`. In each `choices` object, the key for `messages` is changed by `delta`. +### Safe mode ++Mistral premium chat models support the parameter `safe_prompt`. You can toggle the safe prompt to prepend your messages with the following system prompt: ++> Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity. ++The Azure AI Model Inference API allows you to pass this extra parameter as follows: ++```http +POST /chat/completions HTTP/1.1 +Host: <ENDPOINT_URI> +Authorization: Bearer <TOKEN> +Content-Type: application/json +extra-parameters: pass-through +``` +++```json +{ + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "How many languages are in the world?" + } + ], + "safemode": true +} +``` ++### Use tools +Mistral premium chat models support the use of tools, which can be an extraordinary resource when you need to offload specific tasks from the language model and instead rely on a more deterministic system or even a different language model. The Azure AI Model Inference API allows you to define tools in the following way. -The `choices` object is a dictionary with the following fields: +The following code example creates a tool definition that is able to look from flight information from two different cities. -| Key | Type | Description | -||--|--| -| `index` | `integer` | Choice index. When `best_of` > 1, the index in this array might not be in order and might not be `0` to `n-1`. | -| `messages` or `delta` | `string` | Chat completion result in `messages` object. When streaming mode is used, `delta` key is used. | -| `finish_reason` | `string` | The reason the model stopped generating tokens: <br>- `stop`: model hit a natural stop point or a provided stop sequence. <br>- `length`: if max number of tokens have been reached. <br>- `content_filter`: When RAI moderates and CMP forces moderation <br>- `content_filter_error`: an error during moderation and wasn't able to make decision on the response <br>- `null`: API response still in progress or incomplete.| -| `logprobs` | `object` | The log probabilities of the generated tokens in the output text. | +```json +{ + |