LLMOPS With Azure AI, PromptFlow, Bicep & GitHub Actions

23 min readSep 9, 2024

Explain the growing importance of LLMOPS (Large Language Model Operations) in AI development, focusing on how Azure AI Studio simplifies LLM deployment, monitoring, and scaling. Mention the challenges of managing large language models and how LLMOPS addresses them.

Section 1: What is LLMOPS?

Section 2: Overview of Azure AI Studio

Section 3: E2E Project : Onboarding Chat AI — Building and Deploying LLMs in Azure AI Studio using PromptFlow

Section 4: CI/CD for LLMs in Azure AI Studio

Used Code Repository : charotAmine/LLMOPS-Medium-Post (github.com)

LLMOPS (Large Language Model Operations) refers to the set of practices, tools, and processes used to efficiently manage and operate large language models (LLMs) in production environments. As LLMs, like GPT-3.5 or GPT-4, become more prominent in various applications, managing them effectively is crucial to ensure reliable performance, scalability, and cost-effectiveness.

Key Components of LLMOPS:

Model Deployment:

Deploying large language models into production involves making them accessible through APIs or integrated into applications. For LLMs, deployment is not just about running the model but also ensuring it is optimized for real-time usage.
Deployment considerations include setting up proper endpoints, securing the APIs (often with authentication like OAuth), and ensuring low-latency response times to meet application needs.
For example, deploying a chatbot that responds to user queries using a pre-trained GPT model.

Monitoring and Observability:

Once the model is deployed, continuous monitoring is essential to track its performance, availability, and user interaction. Monitoring can include metrics like:

Latency: How quickly the model responds to a request.
Token usage: The number of tokens used in each request, which directly affects the cost in many commercial APIs.
Success rate: The percentage of successful responses vs. errors.

Observability tools can also provide insights into user behavior, helping to detect anomalies or performance bottlenecks in real-time.

Maintenance and Lifecycle Management:

Managing the lifecycle of a large language model includes tasks like updating the model when new data becomes available, retraining the model to improve accuracy, and deprecating outdated models.
Maintenance also includes ensuring that the model is compliant with security policies, regularly updating the code, and managing model versioning so that different versions can be tracked and rolled back if necessary.
Continuous updates and maintenance are necessary to ensure that the LLM keeps providing relevant and accurate information.

Azure AI Studio is a cloud-based platform designed to help developers and data scientists build, deploy, and manage AI applications at scale. It integrates seamlessly with other Azure services to provide a robust environment for managing large language models, making it a powerful tool for LLMOPS.

Capabilities for LLMOPS:

Azure AI Studio offers several features that streamline the process of deploying, monitoring, scaling, and maintaining large language models in production:

Seamless Integration with Azure Machine Learning and OpenAI API:

Azure AI Studio integrates tightly with Azure Machine Learning, allowing users to build and train custom models or leverage pre-trained models such as GPT models through the OpenAI API.
Developers can deploy models to Azure’s scalable cloud infrastructure, ensuring that the models are ready for production usage and can handle high volumes of requests with low latency.
The platform supports end-to-end workflows, from data preparation and model training to deployment and monitoring, simplifying the LLMOPS process.

Pre-built Support for GPT Models and Other Large Language Models:

Azure AI Studio comes with pre-built support for OpenAI’s GPT models and other LLMs, which can be directly deployed and integrated into applications.
The platform provides APIs for these models, allowing developers to quickly build applications such as chatbots, virtual assistants, or automated content generation tools.
Azure also provides fine-tuning capabilities, so models can be adapted to meet the specific needs of different business use cases.

Model Versioning, Monitoring, and Logging Tools:

Model Versioning: Azure AI Studio supports multiple versions of a model, making it easy to manage the lifecycle of the model. Teams can deploy new versions while keeping the older versions available as backups, allowing for smooth rollbacks if needed.
Monitoring: The platform integrates with Azure Monitor and Application Insights, providing real-time visibility into the performance of the models. Key metrics like response times, error rates, and resource consumption can be tracked continuously.
Logging Tools: Azure AI Studio logs each inference request, including inputs, outputs, and model performance, which is critical for debugging, troubleshooting, and improving the models over time.

Project Overview: Building an AI-Powered Onboarding Application

We’re working on building an onboarding application to streamline the process for new hires at our company. Imagine you’re a new team member (like Marouane, our use case), and you’re bombarded with tons of information, tools, and processes. Navigating all that can be overwhelming!

To make onboarding smoother and more efficient, we’re leveraging AI — specifically using LLMOPS (Large Language Model Operations) to handle the heavy lifting. By integrating PromptFlow and Azure AI Studio, we can create a smart assistant that pulls from relevant company data and answers questions, guides users, and helps them get up to speed without feeling lost.

Here’s the plan:

LLMOPS: This will allow us to manage and scale large language models, helping our AI assistant interact in real-time with relevant information.
PromptFlow: With PromptFlow, we can fine-tune our prompts and workflows, ensuring our AI is answering questions effectively and learning from each interaction.
Azure AI Studio: This is where we deploy and manage all the AI services — whether it’s language models, cognitive services, or AI-powered search. It’s the backbone of our application.

The goal is to build an intelligent onboarding tool that helps new hires like Marouane find their way in the company by asking simple questions and getting accurate, up-to-date answers.

Step 0 : Setting Up Infrastructure for Success

First things first, we need to get our infrastructure ready. We’ll be using Bicep (the coding kind, not the muscle), and you can find all the code on GitHub.

In this architecture, we are deploying several key services to configure our Azure environment for Azure AI Studio. These include:

Hub Workspace: This is our central control room, where all the magic happens. It connects to various services and resources, acting as the main management layer for everything.
Project Workspace: Each project gets its own dedicated workspace. This keeps everything organized and ensures that different development environments are properly separated, while still being linked to the Hub Workspace.

IMPORTANT NOTE:

We can deploy the Hub Workspace in a subscription, and the Project in another subscription.
Hub Workspace => Subscription A
Project => Subscription B Linked to the Hub Workspace of the Subscription A
Under one condition, they must be in the same regions.

Key Vault: We’ll be using Key Vault to securely store and manage sensitive information like secrets, keys, and certificates.
Storage Account: This is where all the data — files, logs, and datasets — will live. It’s connected to the Hub Workspace so we can easily access the data for processing and analysis.
AI Services: This includes a range of AI tools like Cognitive Services or other AI models that will power our AI-driven applications.
Application Insights: This will help us keep an eye on the app, providing monitoring and telemetry data. It collects all the key metrics, logs, and traces we need to track performance.
Log Analytics: This service will handle collecting, analyzing, and visualizing data from various sources, including Application Insights. It’s crucial for diagnosing issues and improving performance.
AI Search: This is the intelligent search component that helps pull relevant data using AI. Whether it’s documents, knowledge bases, or other content, AI Search makes sure we can find exactly what we need, when we need it.

After we run the deployment, we’ll get everything up and running. Now, let’s get to setting up the project!

the code is available here : LLMOPS-Medium-Post/bicep at main · charotAmine/LLMOPS-Medium-Post (github.com)

After running the deployment, we should see that :

Let’s setup now the project !

Meet Marouane: Our New Guy

Marouane is the latest addition to the Squad LTD team, and like anyone stepping into a new job, he’s got a lot to learn. With all the tools and processes we have in place, it can be a bit… well, let’s just say it’s like trying to drink from a fire hose. So, we had a thought: why not use AI to make onboarding smoother? We’ll create an AI-powered tool to pull relevant data and help Marouane with all his questions — no more frantic Googling.

With the infrastructure we just deployed, we can totally do that. But before we dive in, let’s generate some dummy data (because what’s a project without some random test data?). You can find the onboarding data on GitHub.

The onboarding data is here : LLMOPS-Medium-Post/data/onboarding.md at main · charotAmine/LLMOPS-Medium-Post (github.com)

Now, let’s set up the first version of our project using PromptFlow and AI Studio. To kick things off, let’s ask my trusty GPT a quick question to get us started.

All right, now that we’ve got everything set up, we can start building something that’ll really help Marouane — because let’s be honest, who wouldn’t want a personal AI assistant on their first day?

Step 1 : No Data Available

Uh-oh! It seems like GPT is clueless — our poor AI doesn’t even know who the CEO is! Why? Because we haven’t fed it the data yet. Let’s not leave it in the dark. Time to upload the necessary info and make sure it’s up to date with all the corporate gossip (okay, maybe just the important stuff). Here’s the script to get that data into Azure AI Search:

import os
from dotenv import load_dotenv
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import Index

from promptflow.rag.config import (
    LocalSource,
    AzureAISearchConfig,
    EmbeddingsModelConfig,
    ConnectionConfig,
)

from pathlib import Path
from promptflow.rag import build_index

load_dotenv()

client = MLClient(
    DefaultAzureCredential(),
    os.getenv("AZURE_SUBSCRIPTION_ID"),
    os.getenv("AZURE_RESOURCE_GROUP"),
    os.getenv("AZUREAI_PROJECT_NAME"),
)

data_directory = Path(__file__).resolve().parent / "../data"
files = list(data_directory.glob('*')) if data_directory.exists() else None

if files:
    print(f"Data directory '{data_directory}' exists and contains {len(files)} files.")
elif files is not None:
    print(f"Data directory '{data_directory}' exists but is empty.")
    exit()
else:
    print(f"Data directory '{data_directory}' does not exist.")
    exit()

index_name = "sqd-index" 
index_path = build_index(
    name=index_name,  
    vector_store="azure_ai_search",
    embeddings_model_config=EmbeddingsModelConfig(
        model_name=os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT"),
        deployment_name=os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT"),
        connection_config=ConnectionConfig(
            subscription_id=client.subscription_id,
            resource_group_name=client.resource_group_name,
            workspace_name=client.workspace_name,
            connection_name=os.getenv("AZURE_OPENAI_CONNECTION_NAME"),
        ),
    ),
    input_source=LocalSource(input_data=data_directory),
    index_config=AzureAISearchConfig(
        ai_search_index_name=index_name,
        ai_search_connection_config=ConnectionConfig(
            subscription_id=client.subscription_id,
            resource_group_name=client.resource_group_name,
            workspace_name=client.workspace_name,
            connection_name=os.getenv("AZURE_SEARCH_CONNECTION_NAME"),
        ),
    ),
    tokens_per_chunk=800,
    token_overlap_across_chunks=0,
)

client.indexes.create_or_update(Index(name=index_name, path=index_path))

Summary of the Script:

Environment Setup:

The script loads environment variables using dotenv to retrieve Azure credentials and configurations (subscription ID, resource group, etc.).

Azure ML Client Initialization:

It initializes an Azure ML Client using DefaultAzureCredential for authentication and environment variables for subscription and resource group details.

Data Directory Check:

It checks for the existence of a data directory and lists its files. If the directory doesn't exist or is empty, the script exits.

Index Creation:

The script builds a search index using Azure AI Search and OpenAI embeddings.
It uses configuration details such as model name, embedding model deployment, and Azure AI Search connection info (all loaded from environment variables).

Indexing Process:

The build_index function creates the index using files from the local data source.
The index is configured with specific parameters like chunk size (tokens_per_chunk) and overlap.

Deploy the Index:

Finally, the script creates or updates the Azure AI Search index by calling the create_or_update method on the client.indexes object, using the generated index.

Key Components:

Azure AI Search: Used for indexing the documents.
OpenAI Embeddings: Used for building the index.
Local Data Source: A directory containing the files to be indexed.
Environment Variables: For configurations like subscription ID and connection names.

Once the script is run :

We can verify if the index has been created :

Let me know retry again to check if he knows what is SQD LTD :

Step 2 : Create the flow

While the inner workings of LLMs may seem complex to many developers, the structure of LLM applications is straightforward — they typically consist of a sequence of calls to external services like LLMs, databases, or search engines, along with intermediate data processing steps that tie everything together.

Flex Flow and DAG Flow are two different approaches for building LLM apps in Promptflow, each catering to different development styles and needs.

Flex Flow

Code-centric approach: Flex Flow allows you to create LLM apps by writing Python functions or classes that encapsulate your app’s logic. The entry point to the app can be a function or a class, making it flexible for developers who prefer writing code directly.
Testing and execution: You can test or run these functions directly in a pure coding environment without a UI. Alternatively, you can define a flow.flex.yaml file that points to the entry points, enabling the ability to view execution traces, test, and run the app using the Promptflow VS Code Extension.
Customization: Flex Flow is well-suited for developers who want more control over their app logic, allowing for easier customization and integration with advanced Python libraries and workflows.

DAG Flow

Graph-based approach: In DAG Flow, LLM apps are represented as Directed Acyclic Graphs (DAGs). Each function call is a node (referred to as a “tool”) in the DAG, and the flow is constructed by connecting these nodes via input/output dependencies.
Execution order: The flow is executed based on the topology of the DAG, meaning that functions/tools are executed in the proper order, respecting their dependencies. This is managed by the Promptflow executor.
Visualization: A DAG Flow is defined using a flow.dag.yaml file, which can be visualized using the Promptflow VS Code extension, providing a clear view of the app's logic and execution paths.

In summary, Flex Flow offers a code-first development experience with Python, while DAG Flow provides a UI-driven, graph-based approach for managing function dependencies. Both can be used depending on whether you prefer a more hands-on coding experience or a visually guided flow design.

In Promptflow, flows are organized into three types: Standard Flow, Chat Flow, and Evaluation Flow, each serving a different purpose in LLM app development.

1. Standard Flow

Purpose: This is used to develop general-purpose LLM applications.
Structure: Standard Flows are designed to run sequences of operations like processing data, querying LLMs, or interacting with APIs.
Use case: It is ideal for LLM apps that don’t involve conversational input or a chat-based user interface.

2. Chat Flow

Purpose: Specifically designed for developing chat-based LLM applications.
Structure: Chat Flow adds additional support for handling elements specific to conversational AI, such as chat_input, chat_output, and chat_history. This allows developers to build applications that manage conversation history and provide a chat-like experience.
Use case: It’s particularly useful for chatbots or any application where user input/output revolves around dialogue or conversation.
Feature: It provides a sample chat interface during development and a ready-to-deploy chat application template.

3. Evaluation Flow

Purpose: Used to test and evaluate the quality of an LLM application developed through a Standard Flow or Chat Flow.
Structure: Evaluation flows usually process the outputs of the LLM app and compute metrics to assess performance.
Metrics: This can include checking the accuracy of the LLM’s responses, fact-based validation, or other quality measures like fluency or relevance.
Use case: Helps developers determine if their LLM is performing as expected before deployment. For instance, is the response accurate? Does it follow the expected patterns?

In summary, Standard Flows and Chat Flows are used to develop different kinds of LLM applications, while Evaluation Flows help validate the effectiveness and quality of those apps.

Create Prompty :

Let’s first create two prompties :

 ---
name: Onboarding Assistant Prompt
description: A prompty that uses the chat API to respond to queries grounded in the onboarding documents.
model:
    api: chat
    configuration:
        type: azure_openai
inputs:
    chat_input:
        type: string
    chat_history:
        type: list
        is_chat_history: true
        default: []
    documents:
        type: object

---
system:
You are an AI assistant helping new employees with onboarding at sqd LTD.
If the question is not related to onboarding or company processes, just say 'Sorry, I can only answer queries related to onboarding at sqd LTD. How can I assist you?'
Don't fabricate answers.
If the question is related to onboarding but vague, ask for clarification before referring to documents. For example, if the user uses "it" or "they," ask them to specify what process or resource they are referring to.
Use the following pieces of context from the onboarding documents to answer questions as clearly, accurately, and briefly as possible.
Do not add document references in your response.

# Onboarding Documents
{{documents}}

{% for item in chat_history %}
{{item.role}}
{{item.content}}
{% endfor %}

user:
{{chat_input}}

The “Onboarding Assistant Prompt” is an AI-powered chatbot designed to assist new employees at sqd LTD with onboarding-related queries. It uses the Azure OpenAI chat API and pulls information from onboarding documents to provide accurate responses.

Inputs: It accepts user questions (chat_input), previous conversation history (chat_history), and onboarding documents (documents) to ground its answers.
Functionality: The AI answers questions specifically about onboarding at sqd LTD, requests clarification for vague queries, and ensures all responses are concise and based on the provided documents. The bot won’t answer non-onboarding-related questions and won’t fabricate responses. It asks for more information if a query is unclear.

---
name: Onboarding Intent Extraction Prompt
description: A prompty that extracts the user's query intent based on the current query and chat history during the onboarding process.
model:
    api: chat
    configuration:
        type: azure_openai
inputs:
    query:
      type: string
    chat_history:
        type: list
        is_chat_history: true
        default: []

---
system:
- You are an AI assistant analyzing a user's current onboarding-related query and chat history.
- Based on the chat history and current user query, infer the user's intent regarding the onboarding process.
- Once you infer the intent, respond with a search query that can be used to retrieve relevant onboarding documents or information for the user's query.
- Be specific in identifying the user's intent, but disregard chat history that is not relevant to the current intent.

Example 1:
With a chat_history like below:
\```
chat_history: [    
  {
    "role": "user",
    "content": "How do I access my company email?"
  },
  {
    "role": "assistant",
    "content": "To access your company email, go to the Office 365 portal and log in with your company credentials."
  }
]
\```
User query: "Where can I reset my password?"

Intent: "The user wants to know how to reset their company email password."
Search query: "password reset for company email access"

Example 2:
With a chat_history like below:
\```
chat_history: [    
  {
    "role": "user",
    "content": "How do I access my company email?"
  },
  {
    "role": "assistant",
    "content": "To access your company email, go to the Office 365 portal and log in with your company credentials."
  },
  {
    "role": "user",
    "content": "What other tools do I need to set up?"
  },
  {
    "role": "assistant",
    "content": "You will also need to set up access to Microsoft Teams and the internal wiki."
  }
]
\```
User query: "How do I set up Microsoft Teams?"

Intent: "The user wants to know how to set up Microsoft Teams."
Search query: "setting up Microsoft Teams for company use"

{% for item in chat_history %}
{{item.role}}
{{item.content}}
{% endfor %}

Current user query:
{{query}}

Search query:

The “Onboarding Intent Extraction Prompt” is an AI assistant designed to infer the intent behind a user’s query during the onboarding process at sqd LTD. It analyzes both the current query and previous chat history to identify the user’s specific request and then generates a search query that can be used to retrieve relevant onboarding documents.

Inputs: It takes the current user query (query) and previous conversations (chat_history) to understand the context and intent behind the query.
Functionality: The AI assistant interprets the user’s intent by reviewing the chat history and current query, then formulates a search query that can be used to find information or documents relevant to the user’s onboarding needs.
Examples: The system gives examples to show how it identifies specific intents, like password resets or tool setup, from the user’s questions and then generates precise search queries.
Goal: The prompt helps to streamline the onboarding process by making it easier for users to get the information they need based on their intent, saving time and improving accuracy.

Now that we have our prompty, let’s create our function based flow :

import os
from dotenv import load_dotenv
from pathlib import Path
from typing import TypedDict

from promptflow.core import Prompty, AzureOpenAIModelConfiguration
from promptflow.tracing import trace
from openai import AzureOpenAI
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery

load_dotenv()

# Helper function to initialize the AzureOpenAI client
def initialize_aoai_client() -> AzureOpenAI:
    return AzureOpenAI(
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
        api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    )


# Helper function to initialize the search client
def initialize_search_client() -> SearchClient:
    key = os.environ["AZURE_SEARCH_API_KEY"]
    index_name = os.getenv("AZUREAI_SEARCH_INDEX_NAME")

    return SearchClient(
        endpoint=os.getenv("AZURE_SEARCH_ENDPOINT"),
        credential=AzureKeyCredential(key),
        index_name=index_name,
    )


# <get_documents>
@trace
def get_documents(search_query: str, num_docs=3):
    search_client = initialize_search_client()
    aoai_client = initialize_aoai_client()
    print("OPENAI:")
    print(os.getenv("AZURE_OPENAI_ENDPOINT"))
    print(os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT"))
    print(aoai_client.base_url)
    # Generate vector embedding of the user's query
    embedding = aoai_client.embeddings.create(
        input=search_query, model=os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT")
    )
    embedding_to_query = embedding.data[0].embedding

    # Vector search on the index
    vector_query = VectorizedQuery(
        vector=embedding_to_query, k_nearest_neighbors=num_docs, fields="contentVector"
    )
    results = search_client.search(
        search_text="", vector_queries=[vector_query], select=["id", "content"]
    )

    # Combine search results into context string
    context = "\n".join(
        f">>> From: {result['id']}\n{result['content']}" for result in results
    )

    return context


# Data structure for chat response
class ChatResponse(TypedDict):
    context: str
    reply: str


# Get chat response
def get_chat_response(chat_input: str, chat_history: list = []) -> ChatResponse:
    model_config = AzureOpenAIModelConfiguration(
        azure_deployment=os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT"),
        api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    )

    search_query = chat_input

    # Extract intent from chat history if provided
    if chat_history:
        intent_prompty = Prompty.load(
            f"{Path(__file__).parent.absolute().as_posix()}/queryIntent.prompty",
            model={"configuration": model_config, "parameters": {"max_tokens": 256}},
        )
        search_query = intent_prompty(query=chat_input, chat_history=chat_history)

    # Retrieve relevant documents based on query and chat history
    documents = get_documents(search_query, 3)
    
    # Generate chat response using the context from the documents
    chat_prompty = Prompty.load(
        f"{Path(__file__).parent.absolute().as_posix()}/chat.prompty",
        model={
            "configuration": model_config,
            "parameters": {"max_tokens": 256, "temperature": 0.2},
        },
    )
    result = chat_prompty(
        chat_history=chat_history, chat_input=chat_input, documents=documents
    )

    return {"reply": result, "context": documents}

This code implements a conversational AI assistant designed to handle onboarding-related queries. It uses two core prompt flows:

Onboarding Intent Extraction Prompt: This identifies the intent behind user queries based on the current question and chat history, ensuring relevant information is retrieved.
Onboarding Assistant Prompt: It retrieves documents from Azure Cognitive Search, using them to generate accurate responses related to the onboarding process.

The assistant extracts user intent, fetches related documents, and uses these to generate helpful responses, all while relying on Azure OpenAI and Cognitive Search services.

Let’s test it using :

pf flow test --flow sqd_azure:get_chat_response --inputs
 chat_input="what is sqd ?"

Nice ! Seems our Flow is now working !

Deploy to Azure :

To deploy to Azure, we need to use the following script :

import os
from dotenv import load_dotenv

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment, Model, Environment, BuildContext

# Load environment variables
load_dotenv()

# Initialize MLClient
client = MLClient(
    DefaultAzureCredential(),
    os.getenv("AZURE_SUBSCRIPTION_ID"),
    os.getenv("AZURE_RESOURCE_GROUP"),
    os.getenv("AZUREAI_PROJECT_NAME"),
)


# Constants
endpoint_name = "sqd-endpoint"
deployment_name = "sqd-deployment"
script_dir = os.path.dirname(os.path.abspath(__file__))
copilot_path = os.path.join(script_dir, "sqd_flow")

# Define endpoint
endpoint = ManagedOnlineEndpoint(
    name=endpoint_name,
    properties={"enforce_access_to_default_secret_stores": "enabled"},
    auth_mode="aad_token",
)

# Define deployment
deployment = ManagedOnlineDeployment(
    name=deployment_name,
    endpoint_name=endpoint_name,
    model=Model(
        name="copilot_flow_model",
        path=copilot_path,
        properties=[
            ["azureml.promptflow.source_flow_id", "basic-chat"],
            ["azureml.promptflow.mode", "chat"],
            ["azureml.promptflow.chat_input", "chat_input"],
            ["azureml.promptflow.chat_output", "reply"],
        ],
    ),
    environment=Environment(
        build=BuildContext(path=copilot_path),
        inference_config={
            "liveness_route": {"path": "/health", "port": 8080},
            "readiness_route": {"path": "/health", "port": 8080},
            "scoring_route": {"path": "/score", "port": 8080},
        },
    ),
    instance_type="Standard_DS3_v2",
    instance_count=1,
    environment_variables={
        "PRT_CONFIG_OVERRIDE": f"deployment.subscription_id={client.subscription_id},deployment.resource_group={client.resource_group_name},deployment.workspace_name={client.workspace_name},deployment.endpoint_name={endpoint_name},deployment.deployment_name={deployment_name}",
        "AZURE_OPENAI_ENDPOINT": client.connections.get(os.getenv("AZURE_OPENAI_CONNECTION_NAME")).api_base,
        "AZURE_SEARCH_ENDPOINT": client.connections.get(os.getenv("AZURE_SEARCH_CONNECTION_NAME")).api_base,
        "AZURE_OPENAI_API_VERSION": os.getenv("AZURE_OPENAI_API_VERSION"),
        "AZURE_OPENAI_CHAT_DEPLOYMENT": os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT"),
        "AZURE_OPENAI_EVALUATION_DEPLOYMENT": os.getenv("AZURE_OPENAI_EVALUATION_DEPLOYMENT"),
        "AZURE_OPENAI_EMBEDDING_DEPLOYMENT": os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT"),
        "AZUREAI_SEARCH_INDEX_NAME": os.getenv("AZUREAI_SEARCH_INDEX_NAME"),
        "AZURE_OPENAI_API_KEY": client.connections.get(os.getenv("AZURE_OPENAI_CONNECTION_NAME"), populate_secrets=True).api_key,
        "AZURE_SEARCH_API_KEY": client.connections.get(os.getenv("AZURE_SEARCH_CONNECTION_NAME"), populate_secrets=True).api_key,
    },
)

# Deploy endpoint and deployment
client.begin_create_or_update(endpoint).result()
client.begin_create_or_update(deployment).result()

# Update endpoint traffic
endpoint.traffic = {deployment_name: 100}
client.begin_create_or_update(endpoint).result()

# Get deployment URL
def get_ai_studio_url_for_deploy(client: MLClient, endpoint_name: str, deployment_name: str) -> str:
    studio_base_url = "https://ai.azure.com"
    return f"{studio_base_url}/projectdeployments/realtime/{endpoint_name}/{deployment_name}/detail?wsid=/subscriptions/{client.subscription_id}/resourceGroups/{client.resource_group_name}/providers/Microsoft.MachineLearningServices/workspaces/{client.workspace_name}&deploymentName={deployment_name}"

# Print deployment details
print("\n ~~~Deployment details~~~")
print(f"Your online endpoint name is: {endpoint_name}")
print(f"Your deployment name is: {deployment_name}")

print("\n ~~~Test in the Azure AI Studio~~~")
print("\n Follow this link to your deployment in the Azure AI Studio:")
print(get_ai_studio_url_for_deploy(client, endpoint_name, deployment_name))

This script automates the deployment of a model toAI Project using the Azure SDK for Python. Here’s a summary of its functionality:

Environment Setup: Loads environment variables from a .env file using dotenv.

Client Initialization: Creates an instance of MLClient using DefaultAzureCredential for authentication. The client is configured with subscription ID, resource group, and project name.

Constants Definition: Defines constants for the endpoint and deployment names, and constructs the path to the model directory.

Endpoint Definition: Creates a ManagedOnlineEndpoint with properties for secret store access and AAD token authentication.

Deployment Definition: Defines a ManagedOnlineDeployment which includes:

A model with configuration for a chat interface.
Environment settings for the model deployment.
Deployment-specific environment variables for configuration and API keys.

Deployment Execution:

Creates or updates the endpoint and deployment.
Sets the endpoint traffic to direct 100% to the newly created deployment.

Deployment URL Generation: Defines a function to generate the URL for the Azure AI Studio where the deployment can be tested.

Output: Prints details of the deployment and provides a link to view and test the deployment in the Azure AI Studio.

The script efficiently handles deployment creation and configuration while ensuring environment variables and API keys are securely managed.

How to test if it is working ? Well let’s go to the Azure AI Studio :

Testing automatically ? Of course it is possible using this script !

import requests
import argparse
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
import os
from dotenv import load_dotenv
load_dotenv()
def get_client() -> MLClient:
  # check if env variables are set and initialize client from those
  client = MLClient(DefaultAzureCredential(), os.environ["AZURE_SUBSCRIPTION_ID"], os.environ["AZURE_RESOURCE_GROUP"], os.environ["AZUREAI_PROJECT_NAME"])
  if client:
    return client
  
  raise Exception("Necessary values for subscription, resource group, and project are not defined")

def invoke_deployment(endpoint_name: str, query: str, stream: bool = False):
    client = get_client()
    
    accept_header = "text/event-stream" if stream else "application/json"

    scoring_url = client.online_endpoints.get(endpoint_name).scoring_uri

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {client._credential.get_token('https://ml.azure.com').token}",
        "Accept": accept_header
    }

    response = requests.post(
        scoring_url,
        headers=headers,
        json={"chat_input": query, "stream": stream}
    )

    if stream:
        for item in response.iter_lines(chunk_size=None):
            print(item)
    else:
        response_data = response.json()
        chat_reply = response_data.get('reply', 'No reply in response')
        print(f"\n{chat_reply}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Invoke a deployment endpoint.")
    parser.add_argument("--endpoint-name", required=True, help="Endpoint name to use when deploying or invoking the flow")
    parser.add_argument("--query", help="Query to test the deployment with")
    parser.add_argument("--stream", action="store_true", help="Whether the response should be streamed or not")
    
    args = parser.parse_args()

    query = args.query if args.query else "who is the CEO of sqd ?"

    invoke_deployment(args.endpoint_name, query=query, stream=args.stream)

Step 3 : Evaluation

The evaluation step is so important ! Thanks to it we can define if we should deploy or not the model.

import argparse
import os
import pandas as pd
from tabulate import tabulate

from promptflow.core import AzureOpenAIModelConfiguration
from promptflow.evals.evaluate import evaluate
from promptflow.evals.evaluators import (
    CoherenceEvaluator,
    F1ScoreEvaluator,
    FluencyEvaluator,
    GroundednessEvaluator,
    RelevanceEvaluator,
    SimilarityEvaluator,
    QAEvaluator,
)
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

from sqd_flow.sqd_azure import get_chat_response

from dotenv import load_dotenv

load_dotenv()
def calculate_percentage(metric_value):
    """Convert a metric score to a percentage."""
    if isinstance(metric_value, (int, float)):
        return (metric_value * 100)/5  # Convert to percentage with 2 decimal places
    else:
        raise ValueError("Metric value must be a number")

def display_metrics_with_percentages(metrics):
    """Display metric values and their corresponding percentages."""
    for metric_name, metric_value in metrics.items():
        percentage = calculate_percentage(metric_value)
        print(f"{metric_name}: {percentage}%")
    return percentage
        
# Initialize MLClient
client = MLClient(
    DefaultAzureCredential(),
    os.getenv("AZURE_SUBSCRIPTION_ID"),
    os.getenv("AZURE_RESOURCE_GROUP"),
    os.getenv("AZUREAI_PROJECT_NAME"),
)
azure_ai_project = {
    "subscription_id": os.getenv("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.getenv("AZURE_RESOURCE_GROUP"),
    "project_name": os.getenv("AZUREAI_PROJECT_NAME"),
}

os.environ['AZURE_OPENAI_API_KEY'] = client.connections.get(os.getenv("AZURE_OPENAI_CONNECTION_NAME"), populate_secrets=True).api_key
os.environ['AZURE_SEARCH_API_KEY'] = client.connections.get(os.getenv("AZURE_SEARCH_CONNECTION_NAME"), populate_secrets=True).api_key
os.environ['AZURE_SEARCH_ENDPOINT'] = client.connections.get(os.getenv("AZURE_SEARCH_CONNECTION_NAME")).api_base
os.environ['AZURE_OPENAI_ENDPOINT'] = client.connections.get(os.getenv("AZURE_OPENAI_CONNECTION_NAME")).api_base


def get_model_config(evaluation_endpoint, evaluation_model):
    """Get the model configuration for the evaluation."""
    if "AZURE_OPENAI_API_KEY" in os.environ:
        api_key = client.connections.get(os.getenv("AZURE_OPENAI_CONNECTION_NAME"), populate_secrets=True).api_key
        
        model_config = AzureOpenAIModelConfiguration(
            azure_endpoint=evaluation_endpoint,
            api_key=api_key,
            azure_deployment=evaluation_model,
        )
    else:
        model_config = AzureOpenAIModelConfiguration(
            azure_endpoint=evaluation_endpoint,
            azure_deployment=evaluation_model,
        )

    return model_config

def run_evaluation(
    evaluation_name,
    evaluation_model_config,
    evaluation_data_path,
    metrics,
    output_path=None,
):
    """Run the evaluation routine."""
    completion_func = get_chat_response

    evaluators = {}
    evaluators_config = {}
    for metric_name in metrics:
        if metric_name == "coherence":
            evaluators[metric_name] = CoherenceEvaluator(evaluation_model_config)
            evaluators_config[metric_name] = {
                "question": "${data.chat_input}",
                "answer": "${target.reply}",
            }
        elif metric_name == "f1score":
            evaluators[metric_name] = F1ScoreEvaluator()
            evaluators_config[metric_name] = {
                "answer": "${target.reply}",
                "ground_truth": "${data.ground_truth}",
            }
        elif metric_name == "fluency":
            evaluators[metric_name] = FluencyEvaluator(evaluation_model_config)
            evaluators_config[metric_name] = {
                "question": "${data.chat_input}",
                "answer": "${target.reply}",
            }
        elif metric_name == "groundedness":
            evaluators[metric_name] = GroundednessEvaluator(evaluation_model_config)
            evaluators_config[metric_name] = {
                "answer": "${target.reply}",
                "context": "${target.context}",
            }
        elif metric_name == "relevance":
            evaluators[metric_name] = RelevanceEvaluator(evaluation_model_config)
            evaluators_config[metric_name] = {
                "question": "${data.chat_input}",
                "answer": "${target.reply}",
                "context": "${target.context}",
            }
        elif metric_name == "similarity":
            evaluators[metric_name] = SimilarityEvaluator(evaluation_model_config)
            evaluators_config[metric_name] = {
                "question": "${data.chat_input}",
                "answer": "${target.reply}",
                "ground_truth": "${data.ground_truth}",
            }
        elif metric_name == "qa":
            evaluators[metric_name] = QAEvaluator(evaluation_model_config)
            evaluators_config[metric_name] = {
                "question": "${data.chat_input}",
                "answer": "${target.reply}",
                "context": "${target.context}",
                "ground_truth": "${data.ground_truth}",
            }
        elif metric_name == "latency":
            raise NotImplementedError("Latency metric is not implemented yet")
        else:
            raise ValueError(f"Unknown metric: {metric_name}")

    result = evaluate(
        target=completion_func,
        evaluation_name=evaluation_name,
        evaluators=evaluators,
        evaluator_config=evaluators_config,
        data=evaluation_data_path,
        azure_ai_project=azure_ai_project,
    )

    tabular_result = pd.DataFrame(result.get("rows"))
    return result, tabular_result

def main():
    """Run the evaluation script."""
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--evaluation-data-path",
        help="Path to JSONL file containing evaluation dataset",
        required=True,
    )
    parser.add_argument(
        "--evaluation-name",
        help="Evaluation name used to log the evaluation to AI Studio",
        type=str,
        default="eval-sdk-dev",
    )
    parser.add_argument(
        "--evaluation-endpoint",
        help="Azure OpenAI endpoint used for evaluation",
        type=str,
        default=client.connections.get(os.getenv("AZURE_OPENAI_CONNECTION_NAME")).api_base,
    )
    parser.add_argument(
        "--evaluation-model",
        help="Azure OpenAI model deployment name used for evaluation",
        type=str,
        default=os.getenv("AZURE_OPENAI_EVALUATION_DEPLOYMENT"),
    )
    parser.add_argument(
        "--metrics",
        nargs="+",
        help="List of metrics to evaluate",
        choices=[
            "coherence",
            "f1score",
            "fluency",
            "groundedness",
            "relevance",
            "similarity",
            "qa",
            "chat",
            "latency",
        ],
        required=True,
    )
    args = parser.parse_args()

    eval_model_config = get_model_config(
        args.evaluation_endpoint, args.evaluation_model
    )

    result, tabular_result = run_evaluation(
        evaluation_name=args.evaluation_name,
        evaluation_model_config=eval_model_config,
        evaluation_data_path=args.evaluation_data_path,
        metrics=args.metrics,
    )
    # Extract and display metrics with percentages
    print("-----Summarized Metrics-----")
    percentage = display_metrics_with_percentages(result["metrics"])
    print("The final Percentage is: ", percentage)
    print(f"View evaluation results in AI Studio: {result['studio_url']}")
    if percentage < 90:
        print("Not all metrics are above 90%. Please review the results.")
        exit(1)
    
if __name__ == "__main__":
    main()

This gives us :

Putting all together in GitHub Actions

Now that we have all the scripts ready we need to achieve a consistant CI/CD. We need to achieve the following :

I am using GitHub Actions, and here is my pipeline :

name: LLMOPS CI/CD Pipeline

on:
  push:
    branches:
      - main
permissions:
  id-token: write
  contents: read
jobs:
  # Job 1: Deploy Infrastructure using Bicep
  deploy_infrastructure:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v2

      - name: "Az CLI Login"
        uses: azure/login@v1
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - name: Deploy Bicep template
        run: |
          az deployment group create --resource-group ${{ secrets.AZURE_RESOURCE_GROUP }} --template-file ./bicep/aistudio-main.bicep --parameters ./bicep/main.bicepparam

  # Job 2: Upload Data to Index
  upload_data_to_index:
    runs-on: ubuntu-latest
    needs: deploy_infrastructure  # Job 2 depends on Job 1
    steps:
      - name: Checkout repository
        uses: actions/checkout@v2

      - name: "Az CLI Login"
        uses: azure/login@v1
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - name: Install dependencies
        run: |
          pip install -r ./scripts/requirements.txt

      - name: Upload data to index
        run: |
          python ./scripts/upload_data_to_index.py

  # Job 3: Run Evaluation
  run_evaluation:
    runs-on: ubuntu-latest
    needs: upload_data_to_index  # Job 3 depends on Job 2
    steps:
      - name: Checkout repository
        uses: actions/checkout@v2

      - name: "Az CLI Login"
        uses: azure/login@v1
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
      - name: Install dependencies
        run: |
          pip install -r ./scripts/requirements.txt
      - name: Run evaluation script
        run: |
          python ./scripts/evaluation.py --evaluation-data-path ./scripts/evaluation_data/data.jsonl --evaluation-name evaluationgha --metrics groundedness

  # Job 4: Deploy Model
  deploy_model:
    runs-on: ubuntu-latest
    needs: run_evaluation  # Job 4 depends on Job 3
    steps:
      - name: Checkout repository
        uses: actions/checkout@v2

      - name: "Az CLI Login"
        uses: azure/login@v1
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
      - name: Install dependencies
        run: |
          pip install -r ./scripts/requirements.txt
      - name: Deploy model
        run: |
          python ./scripts/deploy_model.py

  # Job 5: Test Model Deployment
  test_model_deployment:
    runs-on: ubuntu-latest
    needs: deploy_model  # Job 5 depends on Job 4
    steps:
      - name: Checkout repository
        uses: actions/checkout@v2

      - name: "Az CLI Login"
        uses: azure/login@v1
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
      - name: Install dependencies
        run: |
          pip install -r ./scripts/requirements.txt
      - name: Test model deployment
        run: |
          python ./scripts/test_deployment.py --endpoint-name sqd-endpoint --query "what is sqd ?"