End-to-End MLOps Pipeline with TensorFlow, Azure Machine Learning, GitHub Actions, Bicep

Amine Charot
17 min readAug 29, 2024

--

Introduction

In this blog post, we will explore an MLOps pipeline using TensorFlow, Azure Machine Learning (AML), GitHub Actions, Bicep for infrastructure as code (IaC).

This pipeline will cover the entire lifecycle of machine learning (ML) model development, including data preprocessing, model training, hyperparameter tuning, model evaluation, deployment, and implementing CI/CD pipelines.

By the end, you’ll have a robust, scalable, and maintainable MLOps setup.

Prerequisites

Before starting, ensure you have the following:

  1. GitHub Account: For version control and CI/CD.
  2. Azure Subscription: To create and manage Azure resources.
  3. Azure CLI and Bicep: Installed locally to deploy infrastructure.
  4. Basic Knowledge: Familiarity with TensorFlow, Python, and Azure Machine Learning.

Project Overview

The project involves:

  1. Infrastructure Setup: Using Bicep to deploy Azure resources.
  2. Data Preprocessing: Preparing data with TensorFlow.
  3. Model Training: Training a CNN model on the CIFAR-10 dataset.
  4. Hyperparameter Tuning: Optimizing the model using AML’s HyperDrive.
  5. Model Evaluation: Assessing model performance.
  6. Model Deployment: Deploying the model as a REST API.
  7. CI/CD Pipeline: Automating the workflow with GitHub Actions, including model versioning and rollback mechanisms.

Here is the GitHub Repo : charotAmine/mlops_test (github.com)

Let’s get started, and don’t worry — we’ll keep things light with some simple examples along the way!

Step 1: Setting Up Your Project Structure

Imagine your project as a well-organized kitchen. You need different sections for your ingredients (data), your recipe (code), and your cooking tools (scripts). Here’s how you should organize everything:

/mlops-project

├── data/
│ └── download_data.py

├── src/
│ ├── preprocess.py
│ ├── train.py
│ ├── evaluate.py
│ ├── hyperdrive_config.py
│ ├── deploy.py
│ ├── score.py
│ ├── model.py
│ ├── drift_detection.py
│ └── monitor_drift.py

├── .github/
│ └── workflows/
│ └── mlops-pipeline.yml

├── aml_config/
│ ├── environment.yml
│ ├── inferenceconfig.yml
│ └── data_drift.yml

├── bicep/
│ └── modules/
│ └── main.bicep

└── README.md

Step 2: Infrastructure Setup with Bicep

Imagine you’re moving into a new apartment. The first thing you need is a solid infrastructure: walls, a roof, plumbing, electricity — you know, the basics. In the cloud, setting up infrastructure is just like that, except instead of walls, we’re talking about resource groups, storage accounts, and machine learning workspaces.

Here is the link to the GitHub Repo of the infra : charotAmine/mlops_test (github.com)

Step 3: Data Preprocessing — Prepping the Ingredients

Before you cook up a storm, you need to wash, chop, and measure your ingredients. Similarly, before we can train our model, we need to preprocess our data.

3.1. Downloading the Data

In data/download_data.py, download the CIFAR-10 dataset, which we'll use as our ingredients:

import tensorflow as tf

def download_data():
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.cifar10.load_data()
return (train_images, train_labels), (test_images, test_labels)

if __name__ == "__main__":
download_data()
  • Function Definition: The download_data function is defined to download and return the CIFAR-10 dataset.
  • Loading Data: tf.keras.datasets.cifar10.load_data() is a TensorFlow function that downloads the CIFAR-10 dataset, which consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. The dataset is split into 50,000 training images and 10,000 testing images.
  • Returning Data: The function returns the training and testing images along with their corresponding labels.

How This Fits into MLOps

In the context of MLOps (Machine Learning Operations), this script is a simple example of a data ingestion step. Here’s how it fits into the broader MLOps pipeline:

  • Data Ingestion: This script handles the downloading of the dataset, which is the first step in any machine learning pipeline.
  • Reproducibility: By using a standard dataset and a well-defined function, it ensures that the data ingestion process is reproducible.
  • Automation: This script can be integrated into automated workflows, such as CI/CD pipelines, to ensure that the data is always available and up-to-date.

3.2. Preprocessing the Data

In src/preprocess.py, normalize the data (like washing and chopping):

Now, our data is clean and ready to be used in the model.

import tensorflow as tf

def preprocess_data(train_images, test_images):
train_images = train_images / 255.0
test_images = test_images / 255.0
return train_images, test_images
  • Function Definition: The preprocess_data function is defined to preprocess the training and testing images.
  • Normalization: The images are normalized by dividing each pixel value by 255.0. This scales the pixel values from the range [0, 255] to [0, 1], which is a common preprocessing step to help the neural network train more effectively.
  • Returning Data: The function returns the normalized training and testing images.

Step 4: Cooking the Model — Training Your Neural Network

Now, it’s time to cook! We’ll define a Convolutional Neural Network (CNN) to learn from our preprocessed data.

4.1. Defining the Model

In src/model.py, define your CNN. Think of this as your secret sauce:

import tensorflow as tf

def create_model():
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64, (3, 3), activation='relu', input_shape=(32, 32, 3)),
tf.keras.layers.MaxPooling2D((2, 2)),
tf.keras.layers.Conv2D(128, (3, 3), activation='relu'),
tf.keras.layers.MaxPooling2D((2, 2)),
tf.keras.layers.Conv2D(256, (3, 3), activation='relu'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
return model
  • Sequential Model: The model is created as a sequential model, which means that layers are stacked sequentially.
  • Conv2D Layers: These layers perform convolution operations. The first Conv2D layer has 64 filters, the second has 128 filters, and the third has 256 filters, all with a kernel size of (3, 3) and ReLU activation function.
  • MaxPooling2D Layers: These layers perform max pooling operations with a pool size of (2, 2), which reduces the spatial dimensions of the output volume.
  • Flatten Layer: This layer flattens the input, converting the 2D matrix data to a 1D vector.
  • Dense Layers: The first Dense layer has 512 units with ReLU activation, and the second Dense layer has 10 units with softmax activation, which is used for multi-class classification.
  • Optimizer: The model uses the Adam optimizer, which is an adaptive learning rate optimization algorithm.
  • Loss Function: The loss function used is sparse categorical crossentropy, which is suitable for multi-class classification problems where the labels are integers.
  • Metrics: The model is evaluated using accuracy as the metric.

4.2. Training the Model

In src/train.py, cook (train) the model with your prepped data:

import tensorflow as tf
from model import create_model
from download_data import download_data
from preprocess_data import preprocess_data
import mlflow
import mlflow.tensorflow

class CustomCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
mlflow.log_metrics({
"loss": logs["loss"],
"accuracy":
logs["accuracy"]
})

# Start Logging
mlflow.start_run()

# enable autologging
mlflow.tensorflow.autolog()

(train_images, train_labels), (test_images, test_labels) = download_data()
train_images, test_images = preprocess_data(train_images, test_images)

model = create_model()
history = model.fit(train_images, train_labels, epochs=1, validation_data=(test_images, test_labels),callbacks=[CustomCallback()])

model.save("outputs/model")

The script train.py is designed to handle the entire process of training a machine learning model, from data ingestion and preprocessing to model creation, training, and logging. This script is a crucial part of an MLOps pipeline, ensuring that each step is automated, reproducible, and well-documented.

MLflow Integration

MLflow is a platform for managing the end-to-end machine learning lifecycle. It provides tools for experiment tracking, model management, and deployment. In this script, MLflow is used to log metrics and parameters during the training process, which is essential for tracking the performance of different models and experiments.

  1. Starting an MLflow Run: The script begins by starting an MLflow run with mlflow.start_run(). This creates a new run in MLflow, which will track all the subsequent operations, including metrics, parameters, and artifacts.
  2. Autologging: The line mlflow.tensorflow.autolog() enables automatic logging of TensorFlow training parameters, metrics, and models to MLflow. This means that you don’t have to manually log each metric or parameter; MLflow handles it for you, ensuring that all relevant information is captured.
  3. Custom Callback for Logging: A custom callback class CustomCallback is defined to log additional metrics at the end of each epoch. This callback logs the loss and accuracy metrics to MLflow, providing a detailed record of the model’s performance over time.

Data Ingestion and Preprocessing

The script uses functions from other modules to handle data ingestion and preprocessing. The download_data function downloads the CIFAR-10 dataset, which is a standard dataset used for image classification tasks. The preprocess_data function normalizes the images by scaling the pixel values to the range [0, 1], which helps the neural network train more effectively.

Model Creation and Training

The create_model function defines a Convolutional Neural Network (CNN) model using TensorFlow’s Keras API. This model consists of several convolutional layers, pooling layers, and dense layers, designed to classify images into one of ten categories.

The model is then trained using the model.fit method, which takes the training data, labels, and other parameters such as the number of epochs and validation data. The custom callback is passed to this method to ensure that metrics are logged to MLflow at the end of each epoch.

Finally, the trained model is saved to the specified directory using model.save("outputs/model"). This allows you to reuse the model for inference or further training without having to retrain it from scratch.

The primary purpose of this script is to automate the process of training a machine learning model while ensuring that all steps are reproducible and well-documented. By integrating MLflow, the script provides a robust framework for tracking experiments, which is essential for comparing different models and hyperparameters. This level of automation and tracking is crucial in an MLOps pipeline, where the goal is to streamline the development, deployment, and monitoring of machine learning models.

Step 5: Hyperparameter Tuning — Seasoning to Perfection

Just like tasting your dish and adjusting the seasoning, we can fine-tune our model using hyperparameter tuning. This ensures our model is as good as it can be.

5.1. Setting Up HyperDrive

In src/hyperdrive_config.py, configure Azure ML’s HyperDrive to search for the best model parameters:

from azure.ai.ml import MLClient
from azure.ai.ml import command
from azure.ai.ml.entities import Environment
from azure.ai.ml.sweep import Choice,MedianStoppingPolicy
from azure.identity import DefaultAzureCredential
import os
from azure.ai.ml.entities import Model


credential = DefaultAzureCredential()
subscription_id = os.getenv("AZURE_SUBSCRIPTION_ID")
resource_group = os.getenv("AZURE_RESOURCE_GROUP")
workspace_name = os.getenv("AZURE_WORKSPACE_NAME")

ml_client = MLClient(
credential=credential,
subscription_id=subscription_id,
resource_group_name=resource_group,
workspace_name=workspace_name
)

# Define the environment using the conda file
tf_env = Environment(
name="tensorflow-env",
conda_file="aml_config/environment.yml",
image="mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04"
)

command_job = command(
code="./src", # source directory
command="python train.py --learning_rate ${{inputs.learning_rate}} --batch_size ${{inputs.batch_size}}",
environment=tf_env,
inputs={
"learning_rate": Choice([0.01, 0.001, 0.0001]),
"batch_size": Choice([16, 32, 64])
},
compute="gpu-cluster"
)

command_job_for_sweep = command_job(
learning_rate=Choice([0.01, 0.001, 0.0001]),
batch_size=Choice([16, 32, 64]),
)

# Call sweep() on your command job to sweep over your parameter expressions
sweep_job = command_job_for_sweep.sweep(
compute="gpu-cluster",
sampling_algorithm="random",
primary_metric="accuracy",
goal="Maximize"
)

# Define the limits for this sweep
sweep_job.set_limits(max_total_trials=20, max_concurrent_trials=10, timeout=7200)

# Set early stopping on this one
sweep_job.early_termination = MedianStoppingPolicy(delay_evaluation=5, evaluation_interval=2)

# Specify your experiment details
sweep_job.display_name = "lightgbm-iris-sweep"
sweep_job.experiment_name = "lightgbm-iris-sweep"
sweep_job.description = "Run a hyperparameter sweep job for LightGBM on Iris dataset."

# submit the sweep
returned_sweep_job = ml_client.create_or_update(sweep_job)

# stream the output and wait until the job is finished
ml_client.jobs.stream(returned_sweep_job.name)

# refresh the latest status of the job after streaming
returned_sweep_job = ml_client.jobs.get(name=returned_sweep_job.name)

if returned_sweep_job.status == "Completed":

# First let us get the run which gave us the best result
print(returned_sweep_job)
print(returned_sweep_job.properties.keys())
print(returned_sweep_job.properties.values())
best_run = returned_sweep_job.properties["best_child_run_id"]
# lets get the model from this run
model = Model(
path="azureml://jobs/{}/outputs/model/".format(
best_run
),
name="cifar10-model",
description="Model created from run.",
type="custom_model"
)
print("Best run: {}".format(best_run))
print("Model: {}".format(model))
else:
print(
"Sweep job status: {}. Please wait until it completes".format(
returned_sweep_job.status
)
)

registered_model = ml_client.models.create_or_update(model=model)

The purpose of the hyperdrive_config.py script is to automate the process of hyperparameter tuning using Azure Machine Learning (Azure ML). Hyperparameter tuning is a critical step in optimizing machine learning models, as it involves searching for the best combination of hyperparameters that yield the highest model performance. This script leverages Azure ML’s capabilities to efficiently and effectively conduct this search, ensuring that the process is scalable and well-documented.

HyperDrive in Azure ML

HyperDrive is Azure ML’s hyperparameter tuning service. It allows data scientists and machine learning engineers to automate the process of hyperparameter optimization, which is essential for improving model performance. HyperDrive supports various sampling algorithms, such as random sampling, grid search, and Bayesian optimization, to explore the hyperparameter space. It also includes early stopping policies to terminate poorly performing trials early, saving computational resources and time.

Importance in MLOps

In the context of MLOps, HyperDrive plays a crucial role in the model development lifecycle. Here’s why it’s important:

  1. Efficiency: HyperDrive automates the hyperparameter tuning process, allowing for the exploration of a large hyperparameter space without manual intervention. This efficiency is vital for quickly identifying the best model configurations.
  2. Scalability: By leveraging Azure’s cloud infrastructure, HyperDrive can scale the hyperparameter tuning process across multiple compute nodes. This scalability ensures that even complex models with extensive hyperparameter spaces can be optimized in a reasonable timeframe.
  3. Reproducibility: HyperDrive ensures that the hyperparameter tuning process is reproducible. By logging all experiments, configurations, and results, it allows data scientists to track and reproduce the best-performing models.
  4. Integration with MLOps Pipelines: HyperDrive integrates seamlessly with Azure ML pipelines, enabling automated workflows that include hyperparameter tuning as a step in the model training process. This integration ensures that hyperparameter optimization is a standard part of the model development lifecycle, leading to more robust and well-tuned models.
  5. Resource Management: With features like early stopping policies, HyperDrive helps manage computational resources efficiently. It stops trials that are unlikely to yield good results early, thus saving time and reducing costs.

Visualizing and analyzing the results of a hyperparameter sweep in Azure ML is crucial for understanding the performance of different hyperparameter combinations and selecting the best model. Here’s how you can do it:

Using Azure ML Studio

  1. Navigate to Azure ML Studio: Go to the Azure Machine Learning Studio (https://ml.azure.com/).
  2. Access the Experiment: In the left-hand menu, click on “Experiments” and find the experiment associated with your hyperparameter sweep.
  3. View the Sweep Job: Click on the sweep job to open its details. Here, you can see an overview of the sweep, including the status of each trial, the hyperparameters used, and the resulting metrics.
  4. Visualize Metrics: Azure ML Studio provides built-in visualizations for analyzing the results. You can view charts that plot the primary metric (e.g., accuracy) against different hyperparameters. This helps you identify trends and understand how different hyperparameters affect model performance.
  5. Compare Trials: You can compare different trials by selecting them and viewing their detailed metrics and logs. This comparison helps you understand which hyperparameter combinations performed best.

In our case we have something like :

The trials :

Step 6: Model Evaluation — The Taste Test

Before serving your dish, you need to taste it to ensure it’s perfect. We do the same with our model by evaluating its performance.

6.1. Evaluate the Model

In src/evaluate.py, check how well your model performs:

import tensorflow as tf
from download_data import download_data
from preprocess_data import preprocess_data

_, (test_images, test_labels) = download_data()
_, test_images = preprocess_data(test_images, test_images)

model = tf.keras.models.load_model("outputs/model")
loss, accuracy = model.evaluate(test_images, test_labels)
print(f"Test accuracy: {accuracy}")

Purpose of the Script

The primary purpose of the evaluate.py script is to load a pre-trained model, preprocess the test data, and then evaluate the model’s performance on this data. This evaluation step is essential for understanding the model’s accuracy and identifying any potential issues before deploying the model to production.

Evaluation in Azure MLOps

In the context of Azure MLOps, the evaluation step is critical for several reasons:

  1. Model Validation: Before deploying a model, it is essential to validate its performance on a separate test dataset. This ensures that the model is not overfitting to the training data and can generalize well to new data.
  2. Performance Metrics: The script calculates key performance metrics, such as accuracy, which are logged and tracked in Azure ML. These metrics provide a quantitative measure of the model’s performance and are used to compare different models and hyperparameter configurations.
  3. Reproducibility: By using a standardized evaluation script, the evaluation process is reproducible. This means that the same evaluation can be performed consistently across different models and datasets, ensuring reliable and comparable results.
  4. Integration with MLOps Pipelines: The evaluation script can be integrated into an Azure ML pipeline, allowing for automated evaluation of models as part of the continuous integration and continuous deployment (CI/CD) process. This integration ensures that models are thoroughly tested before deployment, reducing the risk of deploying underperforming models.
  5. Feedback Loop: The results of the evaluation can be used to inform further iterations of model training and hyperparameter tuning. By analyzing the evaluation metrics, data scientists can identify areas for improvement and refine their models accordingly.

HyperDrive and Evaluation

When used in conjunction with HyperDrive, the evaluation script plays a vital role in the hyperparameter tuning process. After HyperDrive identifies the best hyperparameter configurations, the evaluation script is used to validate the performance of the resulting models. This validation step ensures that the models selected by HyperDrive are indeed the best-performing models and are ready for deployment.

By incorporating the evaluation script into the MLOps pipeline, organizations can ensure that their models are not only optimized for performance but also validated and ready for production use. This comprehensive approach to model evaluation and validation is essential for maintaining high standards of model quality and reliability in Azure MLOps.

Step 7: Model Deployment — Serving Your Dish

It’s time to serve your perfectly cooked dish! We’ll deploy our model so it can be accessed by others.

from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
Model,
Environment,
ManagedOnlineEndpoint,
ManagedOnlineDeployment,
CodeConfiguration
)
from azure.identity import DefaultAzureCredential
import os
import time

# Load the workspace
credential = DefaultAzureCredential()
subscription_id = os.getenv("AZURE_SUBSCRIPTION_ID")
resource_group = os.getenv("AZURE_RESOURCE_GROUP")
workspace_name = os.getenv("AZURE_WORKSPACE_NAME")

client = MLClient(
credential=credential,
subscription_id=subscription_id,
resource_group_name=resource_group,
workspace_name=workspace_name
)

# Define the environment
env = Environment(
name="cifar10-env",
conda_file="aml_config/environment.yml",
image="mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.8-cudnn8-ubuntu22.04",
)

client.environments.create_or_update(env)


# Create the managed online endpoint
endpoint = ManagedOnlineEndpoint(
name="cifar10-endpoint",
description="Endpoint for CIFAR-10 model",
auth_mode="key"
)

latest_model_version = max(
[int(m.version) for m in client.models.list(name="cifar10-model")]
)
client.online_endpoints.begin_create_or_update(endpoint)
model = client.models.get("cifar10-model",latest_model_version)

# Create the deployment configuration
deployment_config = ManagedOnlineDeployment(
name="cifar10-deployment",
endpoint_name="cifar10-endpoint",
model=model,
code_configuration=CodeConfiguration(code="./src", scoring_script="score.py"),
environment='cifar10-env@latest',
instance_count=1,
instance_type="Standard_DS3_v2" # Example instance type, adjust as needed
)

client.online_deployments.begin_create_or_update(deployment_config)

# Optionally, wait for deployment to complete
deployment = client.online_deployments.get(name="cifar10-deployment", endpoint_name="cifar10-endpoint")

# Function to wait for deployment completion
def wait_for_deployment_completion(ml_client,deployment, timeout=3600, interval=30):
elapsed_time = 0
while deployment.provisioning_state not in ["Succeeded", "Failed", "Canceled"] and elapsed_time < timeout:
time.sleep(interval)
elapsed_time += interval
deployment = ml_client.online_deployments.get(name=deployment.name, endpoint_name="cifar10-endpoint")
print(f"Deployment state: {deployment.provisioning_state}, elapsed time: {elapsed_time} seconds")

if deployment.provisioning_state == "Succeeded":
print("Deployment completed successfully.")
else:
print(f"Deployment failed with state: {deployment.provisioning_state}")

# Wait for the deployment to complete
wait_for_deployment_completion(client,deployment)

The deploy.py script is designed to deploy a trained machine learning model to an Azure Managed Online Endpoint. This deployment process is a critical step in making the model available for real-time predictions and integrating it into production systems.

The primary purpose of the deploy.py script is to automate the deployment of a machine learning model to an Azure Managed Online Endpoint. This involves setting up the environment, creating the endpoint, configuring the deployment, and ensuring that the model is ready to serve predictions.

When used in conjunction with HyperDrive, the deployment script plays a vital role in the MLOps pipeline. After HyperDrive identifies the best hyperparameter configurations and the evaluation script validates the model’s performance, the deployment script ensures that the best-performing model is deployed to production. This end-to-end automation, from hyperparameter tuning to deployment, is essential for maintaining a robust and efficient MLOps workflow.

By incorporating the deployment script into the MLOps pipeline, organizations can ensure that their models are not only optimized and validated but also deployed in a scalable, reproducible, and secure manner. This comprehensive approach to model deployment is crucial for delivering reliable and high-performing machine learning solutions in production environments.

Step 8: CI/CD Pipeline with GitHub Actions — Automating the Kitchen

Imagine having a robot chef that automatically cooks, tastes, and serves your dishes. That’s what our CI/CD pipeline does! It automates everything from training to deployment.

8.1. The GitHub Actions Workflow

In .github/workflows/mlops-pipeline.yml, define the steps for our robot chef:

name: MLOps Pipeline

on:
push:
branches:
- main
permissions:
id-token: write
contents: read
jobs:
setup-infra:
runs-on: ubuntu-latest
outputs:
config-path: ${{ steps.upload-artifact.outputs.config-path }}

steps:
- name: Checkout code
uses: actions/checkout@v2

- name: "Az CLI Login"
uses: azure/login@v1
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

- name: Deploy Azure Resources
run: |
ls /home/runner/work/mlops_test/mlops_test/

- name: Deploy Azure Resources
run: |
az deployment sub create \
--location westeurope \
--template-file bicep/main.bicep \
--parameters spnObjectId=${{ secrets.AZURE_OBJECT_ID }} resourceGroupName=${{ secrets.AZURE_RESOURCE_GROUP }} workspaceName=${{ secrets.AZURE_WORKSPACE_NAME }} clusterName=gpu-cluster storageAccountName=mlopsstorageacct0026
env:
AZURE_CREDENTIALS: ${{ secrets.AZURE_CREDENTIALS }}

- name: Generate config.json
run: |
mkdir -p .azureml
echo '{
"subscription_id": "'"${{ secrets.AZURE_SUBSCRIPTION_ID }}"'",
"resource_group": "'"${{ secrets.AZURE_RESOURCE_GROUP }}"'",
"workspace_name": "'"${{ secrets.AZURE_WORKSPACE_NAME }}"'"
}' > config.json
- name: Upload config.json as artifact
uses: actions/upload-artifact@v3
with:
name: config
path: config.json

build-and-train:
runs-on: ubuntu-latest
needs: setup-infra
outputs:
model-path: ${{ steps.upload-artifact.outputs.model-path }}

steps:
- name: Checkout code
uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: "3.8"

- name: Install dependencies
run: |
pip install azureml-sdk tensorflow mlflow

- name: Train the model
run: |
python src/train.py
env:
AZURE_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
AZURE_RESOURCE_GROUP: ${{ secrets.AZURE_RESOURCE_GROUP }}
AZURE_WORKSPACE_NAME: ${{ secrets.AZURE_WORKSPACE_NAME }}

- name: Upload model as artifact
uses: actions/upload-artifact@v3
with:
name: model
path: outputs/model

hyperparameter-tuning:
runs-on: ubuntu-latest
needs: build-and-train

steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Download model
uses: actions/download-artifact@v3
with:
name: model
path: outputs/model

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: "3.8"

- name: "Az CLI Login"
uses: azure/login@v1
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

- name: Download config.json
uses: actions/download-artifact@v3
with:
name: config

- name: Install dependencies
run: |
pip install azure-ai-ml azure-identity tensorflow mlflow
- name: Hyperparameter tuning
run: |
python src/hyperdrive_config.py
env:
AZURE_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
AZURE_RESOURCE_GROUP: ${{ secrets.AZURE_RESOURCE_GROUP }}
AZURE_WORKSPACE_NAME: ${{ secrets.AZURE_WORKSPACE_NAME }}

evaluate:
runs-on: ubuntu-latest
needs: hyperparameter-tuning

steps:
- name: Checkout code
uses: actions/checkout@v2

- name: Download model
uses: actions/download-artifact@v3
with:
name: model
path: outputs/model

- name: Download config.json
uses: actions/download-artifact@v3
with:
name: config

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: "3.8"

- name: "Az CLI Login"
uses: azure/login@v1
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

- name: Install dependencies
run: |
pip install azure-ai-ml azure-identity tensorflow mlflow

- name: Evaluate the model
run: |
python src/evaluate.py

deploy:
runs-on: ubuntu-latest
needs: evaluate

steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Download config.json
uses: actions/download-artifact@v3
with:
name: config

- name: Download model
uses: actions/download-artifact@v3
with:
name: model
path: outputs/model

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: "3.8"

- name: "Az CLI Login"
uses: azure/login@v1
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

- name: Install dependencies
run: |
pip install azure-ai-ml azure-identity tensorflow mlflow

- name: Deploy the model
run: |
python src/deploy.py
env:
AZURE_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
AZURE_RESOURCE_GROUP: ${{ secrets.AZURE_RESOURCE_GROUP }}
AZURE_WORKSPACE_NAME: ${{ secrets.AZURE_WORKSPACE_NAME }}

MLOps, or Machine Learning Operations, is a critical practice that bridges the gap between the development and deployment of machine learning models. It encompasses a set of best practices, tools, and processes that ensure the efficient, scalable, and reliable production of machine learning solutions. By integrating principles from DevOps, data engineering, and machine learning, MLOps aims to streamline the entire machine learning lifecycle, from data ingestion and preprocessing to model training, evaluation, and deployment.

One of the key benefits of MLOps is its ability to automate and standardize workflows, which enhances reproducibility and consistency. This automation reduces the manual effort required for model development and deployment, allowing data scientists and engineers to focus on innovation and improvement. Additionally, MLOps facilitates continuous integration and continuous deployment (CI/CD) of machine learning models, ensuring that updates and new models can be seamlessly integrated into production environments.

MLOps also emphasizes the importance of monitoring and managing deployed models. This includes tracking performance metrics, detecting anomalies, and ensuring that models remain accurate and reliable over time. By providing tools for version control, experiment tracking, and hyperparameter tuning, MLOps enables teams to maintain high standards of model quality and performance.

In the context of Azure Machine Learning, MLOps leverages powerful cloud-based tools and services to support the entire machine learning lifecycle. From HyperDrive for hyperparameter tuning to Managed Online Endpoints for scalable deployment, Azure ML provides a comprehensive platform for implementing MLOps practices. This integration ensures that machine learning models are not only optimized and validated but also deployed in a secure, scalable, and reproducible manner.

In summary, MLOps is essential for transforming machine learning projects from experimental phases to robust, production-ready solutions. By adopting MLOps practices, organizations can achieve greater efficiency, scalability, and reliability in their machine learning workflows, ultimately driving better business outcomes and innovation.

Next Steps

  1. Experimentation: Try adding more recipes (models) or integrate a feature store.
  2. Scaling: Deploy on AKS for high availability.
  3. Automation: Further refine CI/CD with automated rollback and multi-model versioning.

--

--