Semantic Kernel, Prompt Flow : Evaluate your plugins

6 min readJan 12, 2024

Once again, I’ve embarked on a journey to grasp the fundamentals of deep learning and machine learning. My motivation stems from a desire to enhance my mastery of Azure OpenAI and gain a deeper understanding of its inner workings. As I delve into this realm, I’ve come to appreciate the significance of evaluating the model. This realization has sparked my curiosity about the methods employed in OpenAI for assessing the outcomes.

In a previous post, I delved into the concepts of Semantic Kernel and touched on some aspects of planners. Now, I want to shine a light on yet another intriguing tool: Prompt Flow.

Determining the effectiveness of your descriptions can pose a challenge. In this segment, we will elaborate on how you can leverage Prompt Flow to assess both plugins and planners, ensuring a consistent production of desired results. From Microsoft Docs :

In the overview and planner articles, we demonstrated the importance of providing descriptions for your plugins so planners can effectively use them for autogenerated plans. Knowing whether or not your descriptions are effective, however, can be difficult. In this section, we’ll describe how you can use Prompt flow to evaluate plugins and planners to ensure that they are consistently producing the desired results.

If we parallel this with the creation of our own model (Seen in the previous Deep Learning posts), we understand that evaluating is a major concept in the AI Field.

That’s what it does, evaluating plugins ? nop:

Empower Prompt Flow with the capabilities of planners Prompt Flow excels in defining and executing static chains of functions, suitable for many AI applications. However, it falls short in scenarios where you expect an AI application to dynamically adapt to new inputs and situations. This is where Semantic Kernel comes into play.
Streamline the evaluation of Semantic Kernel Utilizing Prompt Flow, you can harness the capabilities of Azure ML to assess the accuracy, performance, and error rates of your plugins and planners.
Effortlessly deploy Semantic Kernel to Azure ML Finally, the deployment feature of Prompt Flow allows you to seamlessly deploy AI applications to Azure Machine Learning. This means you can utilize Prompt Flow to effortlessly deploy your Semantic Kernel applications with minimal effort, streamlining the deployment process.

Let’s Dive into Prompt Flow !

Okey, last time, we have created a Sherlock Plugin that enables us to go and look for a document of an investigation and try to solve using our ChatGPT-Like !

Let’s try now to create a plugin and try to use Prompt Flow in top of it to evaluate our model oups sorry let me do it again, to evaluate our PLUGIN ! (Yes seems the same steps as Deep Learning).

Requirements :

In your VS Code, you will need the Promp Flow for VS Code extension :

You also need PromptFlow :

pip install promptflow promptflow-tools

Everything is ready ? Let’s go !

Plugin’s PromptFlow

Create a PromptFlow :

pf flow init --flow performSherlock

Create a Plugin :

Let’s first create a plugin; mine it will be about giving some answers of two different questions, let’s keep it simple :

from semantic_kernel.skill_definition import (
    sk_function,
    sk_function_context_parameter,
)
from semantic_kernel.orchestration.sk_context import SKContext


class Sherlock:
    @sk_function(
        description="Returns the wheels number of a car",
        name="wheels"
    )
    def wheels_car(self) -> str:
        return "4"

    @sk_function(
        description="Returns the number of glasses in Marine's House",
        name="windows",
    )
    def windows_house(self) -> str:
        return "20"

Create a Planner :

import asyncio
from promptflow import tool

import semantic_kernel as sk
from semantic_kernel.planning.action_planner import ActionPlanner
from plugins.SherlockPlugin.Sherlock import Sherlock as Sherlock
from promptflow.connections import (
    AzureOpenAIConnection,
)

from semantic_kernel.connectors.ai.open_ai import (
    AzureChatCompletion,
    AzureTextCompletion,
)
import semantic_kernel.connectors.ai.open_ai as sk_oai


@tool
def my_python_tool(
    input: str,
    deployment_type: str,
    deployment_name: str,
    AzureOpenAIConnection: AzureOpenAIConnection,
) -> str:
    # Initialize the kernel
    kernel = sk.Kernel(log=sk.NullLogger())
    print(AzureOpenAIConnection)
    chat_service = sk_oai.AzureChatCompletion(
        deployment_name=deployment_name,
        endpoint=AzureOpenAIConnection.api_base,
        api_key=AzureOpenAIConnection.api_key,
        api_version="2023-12-01-preview",
    )
    kernel.add_chat_service("chat-gpt", chat_service)
    
    planner = ActionPlanner(kernel=kernel)

    # Import the native functions
    Sherlock_plugin = kernel.import_skill(Sherlock(), "SherlockPlugin")
    print("Kernel")
    print(kernel)
    ask = "Use the available Sherlock functions to solve this word problem: " + input

    plan = asyncio.run(planner.create_plan_async(ask))
    print("MY QUESTION :")
    print(plan)
    # Execute the plan
    result = asyncio.run(plan.invoke_async()).result

    for index, step in enumerate(plan._steps):
        print("Function: " + step.skill_name + "." + step._function.name)
        print("Input vars: " + str(step.parameters.variables))
        print("Output vars: " + str(step._outputs))
    print("Result: " + str(result))

    return str(result)

Modify our yaml file to create a flow :

$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.json
environment:
  python_requirements_txt: requirements.txt
inputs:
  text:
    type: string
    default: How many windows has marine ?
outputs:
  output_prompt:
    type: string
    reference: ${echo_my_prompt.output}
nodes:
- name: echo_my_prompt
  type: python
  source:
    type: code
    path: hello.py
  inputs:
    AzureOpenAIConnection: sherlock_plugin
    input: ${inputs.text}
    deployment_type: Standard
    deployment_name: gpt-35-turbo

Now we need an “AzureOpenAIConnection”, how to create it ?

In the connection Tab, create your connection.

Run the PromptFlow :

Seems good ! How to run a batch of questions ?

Well I have created a dataset file :

{"text": "How many wheels have a car","groundtruth":"4"}
{"text": "How many windows marine has","groundtruth":"20"}

Let’s run a batch by clicking here :

Once is done, let’s check our results :

Now that we have our outputs, let’s evaluate the output.

Evaluation PromptFlow

Remember about what we were talking in the last blog post about the Neural Network Classification ? The loss and the accuracy, well let’s try to do the same to evaluate our Planner & Plugin :

Create an aggregation Function :

from typing import List
from promptflow import tool
from promptflow import log_metric


@tool
def accuracy_aggregate(processed_results: List[int]):

    num_exception = 0
    num_correct = 0

    for i in range(len(processed_results)):
        if processed_results[i] == -1:
            num_exception += 1
        elif processed_results[i] == 1:
            num_correct += 1

    num_total = len(processed_results)
    accuracy = round(1.0 * num_correct / num_total, 2)
    error_rate = round(1.0 * num_exception / num_total, 2)

    log_metric(key="accuracy", value=accuracy)
    log_metric(key="error_rate", value=error_rate)

    return {
        "num_total": num_total,
        "num_correct": num_correct,
        "num_exception": num_exception,
        "accuracy": accuracy,
        "error_rate": error_rate
    }


if __name__ == "__main__":
    numbers = [4, 20]
    accuracy = accuracy_aggregate(numbers)
    print("The accuracy is", accuracy)

Create a line processing Function :

from promptflow import tool


@tool
def line_process(groundtruth: str, prediction: str) -> int:

    processed_result = 0

    if prediction == "JSONDecodeError" or prediction.startswith("Unknown Error:"):
        processed_result = -1
        return processed_result

    try:
        groundtruth = int(groundtruth)
        prediction = int(prediction)
    except ValueError:
        processed_result = -1
        return processed_result

    if round(prediction, 2) == round(groundtruth, 2):
        processed_result = 1

    return processed_result


if __name__ == "__main__":
    processed_result = line_process("2", "2")
    print("The processed result is", processed_result)

    processed_result = line_process("2", "3")
    print("The processed result is", processed_result)

    processed_result = line_process("20", "2")
    print("The processed result is", processed_result)

Create the Flow :

$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.json
inputs:
  groundtruth:
    type: string
    default: "1"
  prediction:
    type: string
    default: "2"
outputs:
  score:
    type: string
    reference: ${line_process.output}
nodes:
- name: line_process
  type: python
  source:
    type: code
    path: line_process.py
  inputs:
    groundtruth: ${inputs.groundtruth}
    prediction: ${inputs.prediction}
- name: aggregate
  type: python
  source:
    type: code
    path: aggregate.py
  inputs:
    processed_results: ${line_process.output}
  aggregation: true

Remember all this is to test our plugin that we just created. So let’s calculate the accuracy of our plugin :

pf run create --flow C:\Users\aminecharot\Documents\file\openAiProject01\backend\sherlockEvaluation --data ./data.jsonl --column-mapping groundtruth='${data.groundtruth}' prediction='${run.outputs.output_prompt}' --run performSherlock_default_20240112_155443_440000 --stream --name pe001

In this command, I am instructing it to execute the ‘sherlockEvaluation’ for the data stored in ‘./data.jsonl’ (as defined in a previous step). The evaluation essentially involves providing both the input and the predicted output. We obtain the predicted output through the plugin’s prompt flow execution, specifically using the command: prediction='${run.outputs.output_prompt}' from the execution named 'performSherlock_default_20240112_155443_440000'.

To sum it up, this command initiates the prompt flow for evaluation, utilizing the output obtained from running the batch in the plugin’s prompt flow.

Since it was not a huge plugin, the accuracy must be at 100% :

Here is it !

Semantic Kernel, Prompt Flow : Evaluate your plugins

Plugin’s PromptFlow

Evaluation PromptFlow

Written by Amine Charot

No responses yet