Using the Azure AI Inference Service – baeke.info

Spread the love


If you are a generative AI developer that works with different LLMs, it can be cumbersome to make sure your code works with your LLM of choice. You might start with Azure OpenAI models and the OpenAI APIs but later decide you want to use a Phi-3 model. What do you do in that case? Ideally, you would want your code to work with either model. The Azure AI Inference Services allows you to do just that.

The API is available via SDKs in Python, JavaScript, C# and as a generic REST service. In this post, we will look at the Python SDK. Note that the API does not work with all models in the Azure AI Foundry model catalog. Below are some of the supported models:

  • Via serverless endpoints: Cohere, Llama, Mistral, Phi-3 and some others
  • Via managed inference (on VMs): Mistral, Mixtral, Phi-3 and Llama 3 instruct

In this post, we will use the serverless endpoints. Let’s stop talking about it and look at some code. Although you can use the inferencing services fully on its own, I will focus on some other ways to use it:

  • From GitHub Marketplace: for experimentation; authenticate with GitHub
  • From Azure AI Foundry: towards production quality code; authenticate with Entra ID

Table of Contents

Getting started from GitHub Marketplace

Perhaps somewhat unexpectedly, an easy way to start exploring these APIs is via models in GitHub Marketplace. GitHub supports the inferencing service and allows you to authenticate via your GitHub personal access token (PAT).

If you have a GitHub account, even as a free user, simply go to the GitHub model catalog at https://github.com/marketplace/models/catalog. Select any model from the list and click Get API key:

Ministral 3B in the GitHub model catalog

In the Get API key screen, you can select your language and SDK. Below, I selected Python and Azure AI Inference SDK:

Steps to get started with Ministral and the AI Inference SDK

Instead of setting this up on you workstation, you can click on Run codespace. A codespace will be opened with lots of sample code:

Codespace with sample code for different SDKs, including the AI Inference

Above, I opened the Getting Started notebook for the Azure AI Inference SDK. You can run the cells in that notebook to see the results. To create a client, the following code is used:

import os
import dotenv
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage
from azure.core.credentials import AzureKeyCredential

dotenv.load_dotenv()

if not os.getenv("GITHUB_TOKEN"):
    raise ValueError("GITHUB_TOKEN is not set")

github_token = os.environ["GITHUB_TOKEN"]
endpoint = "https://models.inference.ai.azure.com"


# Create a client
client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(github_token),
)

The endpoint above is similar to the endpoint you would use without GitHub. The SDK, however, supports authenticating with your GITHUB_TOKEN which is available to the codespace as an environment variable.

When you have the ChatCompletionsClient, you can start using the client as if this was an OpenAI model. Indeed, the AI Inference SDK work similarly to the OpenAI SDK:

response = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="What is the capital of France?"),
    ],
    model=model_name,
    # Optional parameters
    temperature=1.,
    max_tokens=1000,
    top_p=1.    
)

print(response.choices[0].message.content)

The code above is indeed similar to the OpenAI SDK. The model is set via the model_name variable. Model name can be any of the supported GitHub models:

  • AI21 Labs: `AI21-Jamba-Instruct`
  • Cohere: `Cohere-command-r`, `Cohere-command-r-plus`
  • Meta: `Meta-Llama-3-70B-Instruct`, `Meta-Llama-3-8B-Instruct` and others
  • Mistral AI: `Mistral-large`, `Mistral-large-2407`, `Mistral-Nemo`, `Mistral-small`
  • Azure OpenAI: `gpt-4o-mini`, `gpt-4o`
  • Microsoft: `Phi-3-medium-128k-instruct`, `Phi-3-medium-4k-instruct`, and others

The full list of models is in the notebook. It’s easy to get started with GitHub models to evaluate and try out models. Do note that these models are for experimentation only and heavily throttled. In production, use models deployed in Azure. One of the ways to do that is with Azure AI Foundry.

Azure AI Foundry and its SDK

Another way to use the inferencing service is via Azure AI Foundry and its SDK. To use the inferencing service via Azure AI Foundry, simply create a project. If this is the first time you create a project, a hub will be created as well. Check Microsoft Learn for more information.

Project in AI Foundry with the inference endpoint

The endpoint above can be used directly with the Azure AI Inference SDK. There is no need to use the Azure AI Foundry SDK in that case. In what follows, I will focus on the Azure AI Foundry SDK and not use the inference SDK on its own.

Unlike GitHub models, you need to deploy models in Azure before you can use them:

Deployment of Mistral Large and Phi-3 small 128k instruct

To deploy a model, simply click on Deploy model and follow the steps. Take the serverless deployment when asked. Above, I deployed Mistral Large and Phi-3 small 128k.

The Azure AI Foundry SDK makes it easy to work with services available to your project. A service can be a model via the inferencing SDK but also Azure AI Search and other services.

In code, you connect to your project with a connection string and authenticate with Entra ID. From a project client, you then obtain a generic chat completion client. Under the hood, the correct AI inferencing endpoint is used.

from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient

project_connection_string="your_conn_str"

project = AIProjectClient.from_connection_string(
  conn_str=project_connection_string,
  credential=DefaultAzureCredential())

model_name ="Phi-3-small-128k-instruct"

client = project.inference.get_chat_completions_client()

response = client.complete(
    model=model_name,
    messages=[
        {"role": "system", "content": "You are a helpful writing assistant"},
        {"role": "user", "content": "Write me a poem about flowers"},
    ]
)

print(response.choices[0].message.content)

Above, replace your_conn_str with the connection string from your project:

AI Foundry project connection string

Now, if you want to run your code with another model, simply deploy it and switch the model name in your code. Note that you do not use the deployment name. Instead, use the model name.

Note that these models are typically deployed with content filtering. If the filter is triggered, you will get a HttpResponseError 400. This will also happen if you use GitHub because they use the same models and content filters.

Other capabilities of the inferencing service

Below, some of the other capabilities of the inferencing service are listed:

  • Next to chat completions, text completions, text embeddings and image embeddings are supported
  • If the underlying model supports parameters not supported by the inferencing service, use model_extras. The properties you put in model extras are passed to the API that is specific to the model. One example is the safe_mode parameter in Mistral.
  • You can configure the API to give you an error when you use a parameter the underlying model does not support
  • The API supports images as input with select models
  • Streaming is supported
  • Tools and function calling is supported
  • Prompt templates are supported, including Prompty.

Should you use it?

Whether or not you should use the AI inferencing services is not easy to answer. If you use frameworks such as LangChain or Semantic Kernel, they already have abstractions to work with multiple models. They also make it easier to work with functions and tool calling and also support prompt templates. If you use those, stick with them.

If you do not use those frameworks and you simply want to use an OpenAI-compatible API, the inferencing service in combination with Azure AI Foundry is a good fit! There are many developers that prefer using the OpenAI API directly without the abstractions of a higher-level framework. If you do, you can easily switch models.

It’s important to note that if you use more advanced features such as tool calling, not all models support that. In practice, that means that the amount of models you can switch between are limited. In my experience, even with models that support tool calling, if can go wrong easily. If your application is heavily dependent on function calling, it’s best to use frameworks like Semantic Kernel.

The service in general is useful in other ways though. Copilot Studio for example, can use custom models to answer questions and uses the inferencing service under the hood to make that happen!


Share this content:

I am a passionate blogger with extensive experience in web design. As a seasoned YouTube SEO expert, I have helped numerous creators optimize their content for maximum visibility.

Leave a Comment