Running Custom Evaluations with LMEval Llama Stack External Eval Provider

Prerequisites

Admin access to an OpenShift cluster
The TrustyAI operator installed in your OpenShift cluster
KServe set to Raw Deployment mode
A language model deployed on vLLM Serving Runtime in your OpenShift cluster

Overview

This tutorial demonstrates how to evaluate a language model using the LMEval Llama Stack External Eval Provider on a custom dataset. While Eleuther’s lm-evaluation-harness comes with 100+ out-of-the-box tasks, one might want to create a custom task to better evaluate the knowledge and behavior of their model. In order to run evaluations over a custom task, we need to 1) upload the task dataset to our OpenShift Cluster and 2) register it as a benchmark with Llama Stack.

In this tutorial, you will learn how to:

Register a custom benchmark dataset
Run a benchmark evaluation job on a language model

Usage

This tutorial extends Getting Started with LMEval Llama Stack External Provider so see the Usage and Configuring the Llama Stack Server section there to start your Llama Stack server

Upload Your Custom Task Dataset to OpenShift

With the Llama Stack server running, create a Python script or Jupyter notebook to interact with the server and run an evaluation.

Create a PersistentVolumeClaim (PVC) object named my-pvc to store your task dataset on your OpenShift cluster:

oc apply -n <MODEL_NAMESPACE> -f << EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
      name: my-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
EOF

Create a pod object named dataset-storage-pod to download the task dataset into the PVC:

oc apply -n <MODEL_NAMESPACE> << EOF
apiVersion: v1
kind: Pod
metadata:
  name: dataset-storage-pod
spec:
  containers:
  - name: dataset-container
    image: 'quay.io/prometheus/busybox:latest'
    command: ["/bin/sh", "-c", "sleep 3600"]
    volumeMounts:
    - mountPath: "/data/upload_files"
      name: dataset-storage
  volumes:
  - name: dataset-storage
    persistentVolumeClaim:
      claimName: my-pvc
EOF

Copy your locally stored task dataset to the Pod. In this example, the dataset is named example-dk-bench-input-bmo.jsonl and we are copying it to the dataset-storage-pod under the path /data/upload_files/:

oc cp example-dk-bench-input-bmo.jsonl dataset-storage-pod:/data/upload_files/example-dk-bench-input-bmo.jsonl -n <MODEL_NAMESPACE>

Replace <MODEL_NAMESPACE> with the namespace where the language model you wish to evaluate lives

Register the Custom Dataset as a Benchmark

Once the dataset is uploaded to the PVC, we can register it as a benchmark for evaluations. At a minimum, we need to provide the following metadata:

The TrustyAI LM-Eval Tasks GitHub url, branch, commit SHA, and path of the custom task
The location of the custom task file in our PVC

client.benchmarks.register(
    benchmark_id="trustyai_lmeval::dk-bench",
    dataset_id="trustyai_lmeval::dk-bench",
    scoring_functions=["string"],
    provider_benchmark_id="string",
    provider_id="trustyai_lmeval",
    metadata={
        "custom_task": {
            "git": {
                "url": "https://github.com/trustyai-explainability/lm-eval-tasks.git",
                "branch": "main",
                "commit": "8220e2d73c187471acbe71659c98bccecfe77958",
                "path": "tasks/",
            }
        },
        "env": {
            # Path of the dataset inside the PVC
            "DK_BENCH_DATASET_PATH": "/opt/app-root/src/hf_home/example-dk-bench-input-bmo.jsonl",
            "JUDGE_MODEL_URL": "http://phi-3-predictor:8080/v1/chat/completions",
            # For simplicity, we use the same model as the one being evaluated
            "JUDGE_MODEL_NAME": "phi-3",
            "JUDGE_API_KEY": "",
        },
        "tokenized_requests": False,
        "tokenizer": "google/flan-t5-small",
        "input": {"storage": {"pvc": "my-pvc"}}
    },
)

Run a benchmark evaluation on your model:

job = client.eval.run_eval(
    benchmark_id="trustyai_lmeval::dk-bench",
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": "phi-3",
            "provider_id": "trustyai_lmeval",
            "sampling_params": {
                "temperature": 0.7,
                "top_p": 0.9,
                "max_tokens": 256
            },
        },
        "num_examples": 1000,
     },
)

print(f"Starting job '{job.job_id}'")

Monitor the status of the evaluation job. The job will run asynchronously, so you can check its status periodically:

def get_job_status(job_id, benchmark_id):
    return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)

while True:
    job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk_bench")
    print(job)

    if job.status in ['failed', 'completed']:
        print(f"Job ended with status: {job.status}")
        break

    time.sleep(20)

Get the job’s results:

pprint.pprint(client.eval.jobs.retrieve(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk-bench").scores)