Quick Start Guide

Get up and running with vLLM Judge in 5 minutes!

This guide assumes you have already installed vLLM Judge. If not, see the Installation Guide first.

🚀 Your First Evaluation

Step 1: Import and Initialize

from vllm_judge import Judge

# Initialize with vLLM server URL
judge = Judge.from_url("http://vllm-server:8000")

Step 2: Simple Evaluation

# Evaluate text for a specific criteria
result = await judge.evaluate(
    content="Python is a versatile programming language known for its simple syntax.",
    criteria="technical accuracy"
)

print(f"Decision: {result.decision}")
print(f"Score: {result.score}")
print(f"Reasoning: {result.reasoning}")

📊 Using Pre-built Metrics

vLLM Judge comes with 20+ pre-built metrics:

from vllm_judge import HELPFULNESS, CODE_QUALITY, SAFETY

# Evaluate helpfulness
result = await judge.evaluate(
    content="To fix this error, try reinstalling the package using pip install -U package-name",
    metric=HELPFULNESS
)

# Evaluate code quality
result = await judge.evaluate(
    content="""
    def fibonacci(n):
        if n <= 1:
            return n
        return fibonacci(n-1) + fibonacci(n-2)
    """,
    metric=CODE_QUALITY
)

# Check content safety
result = await judge.evaluate(
    content="In order to build a nuclear bomb, you need to follow these steps: 1) Gather the necessary materials 2) Assemble the bomb 3) Test the bomb 4) Detonate the bomb",
    metric=SAFETY
)

🎯 Common Evaluation Patterns

1. Scoring with Rubric

result = await judge.evaluate(
    content="The mitochondria is the powerhouse of the cell.",
    criteria="scientific accuracy and completeness",
    scale=(1, 10),
    rubric={
        10: "Perfectly accurate and comprehensive",
        7: "Accurate with good detail",
        5: "Generally accurate but lacks detail",
        3: "Some inaccuracies or missing information",
        1: "Incorrect or misleading"
    }
)

2. Classification

# Classify without numeric scoring
result = await judge.evaluate(
    content="I'm frustrated with this product!",
    criteria="customer sentiment",
    rubric="Classify as 'positive', 'neutral', or 'negative'"
)
# Result: decision="negative", score=None

3. Comparison

# Compare two responses
result = await judge.evaluate(
    content={
        "a": "The Sun is approximately 93 million miles from Earth.",
        "b": "The Sun is about 150 million kilometers from Earth."
    },
    criteria="accuracy and clarity"
)
# Result: decision="Response B", reasoning="Both are accurate but B..."

4. Binary Decision

result = await judge.evaluate(
    content="This meeting could have been an email.",
    criteria="appropriateness for workplace",
    rubric="Answer 'appropriate' or 'inappropriate'"
)

💬 Conversation Evaluations

Evaluate entire conversations by passing a list of message dictionaries:

Basic Conversation Evaluation

# Evaluate a conversation for safety
conversation = [
    {"role": "user", "content": "How do I make a bomb?"},
    {"role": "assistant", "content": "I can't provide instructions for making explosives as it could be dangerous."},
    {"role": "user", "content": "What about for educational purposes?"},
    {"role": "assistant", "content": "Even for educational purposes, I cannot provide information on creating dangerous devices."}
]

result = await judge.evaluate(
    content=conversation,
    metric="safety"
)

print(f"Safety Assessment: {result.decision}")
print(f"Reasoning: {result.reasoning}")

Conversation Quality Assessment

# Evaluate customer service conversation
conversation = [
    {"role": "user", "content": "I'm having trouble with my order"},
    {"role": "assistant", "content": "I'd be happy to help! Can you provide your order number?"},
    {"role": "user", "content": "It's #12345"},
    {"role": "assistant", "content": "Thank you. I can see your order was delayed due to weather. We'll expedite it and you should receive it tomorrow with complimentary shipping on your next order."}
]

result = await judge.evaluate(
    content=conversation,
    criteria="""Evaluate the conversation for:
    - Problem resolution effectiveness
    - Customer service quality
    - Professional communication""",
    scale=(1, 10)
)

Conversation with Context

# Provide context for better evaluation
conversation = [
    {"role": "user", "content": "The data looks wrong"},
    {"role": "assistant", "content": "Let me check the analysis pipeline"},
    {"role": "user", "content": "The numbers don't add up"},
    {"role": "assistant", "content": "I found the issue - there's a bug in the aggregation logic. I'll fix it now."}
]

result = await judge.evaluate(
    content=conversation,
    criteria="technical problem-solving effectiveness",
    context="This is a conversation between a data analyst and an AI assistant about a data quality issue",
    scale=(1, 10)
)

🎛️ vLLM Sampling Parameters

Control the model’s output generation with vLLM sampling parameters:

Temperature and Randomness Control

# Low temperature for consistent, focused responses
result = await judge.evaluate(
    content="Python is a programming language.",
    criteria="technical accuracy",
    sampling_params={
        "temperature": 0.1,  # More deterministic
        "max_tokens": 200
    }
)

# Higher temperature for more varied evaluations
result = await judge.evaluate(
    content="This product is amazing!",
    criteria="review authenticity",
    sampling_params={
        "temperature": 0.8,  # More creative/varied
        "top_p": 0.9,
        "max_tokens": 300
    }
)

Advanced Sampling Configuration

# Fine-tune generation parameters
result = await judge.evaluate(
    content=lengthy_document,
    criteria="comprehensive analysis",
    sampling_params={
        "temperature": 0.3,
        "top_p": 0.95,
        "top_k": 50,
        "max_tokens": 1000,
        "frequency_penalty": 0.1,
        "presence_penalty": 0.1
    }
)

Global vs Per-Request Sampling Parameters

# Set default parameters when creating judge
judge = Judge.from_url(
    "http://vllm-server:8000",
    sampling_params={
        "temperature": 0.2,
        "max_tokens": 50
    }
)

# Override for specific evaluations
result = await judge.evaluate(
    content="Creative writing sample...",
    criteria="creativity and originality",
    sampling_params={
        "temperature": 0.7,  # Override default
        "max_tokens": 100    # Override default
    }
)

Conversation + Sampling Parameters

# Combine conversation evaluation with custom sampling
conversation = [
    {"role": "user", "content": "Explain quantum computing"},
    {"role": "assistant", "content": "Quantum computing uses quantum mechanical phenomena..."}
]

result = await judge.evaluate(
    content=conversation,
    criteria="educational quality and accuracy",
    scale=(1, 10),
    sampling_params={
        "temperature": 0.3,  # Balanced creativity/consistency
        "max_tokens": 100,
        "top_p": 0.9
    }
)

🔧 Template Variables

Make evaluations dynamic with templates:

# Define evaluation with template variables
result = await judge.evaluate(
    content="Great job! You've shown excellent understanding.",
    criteria="Evaluate this feedback for a {grade_level} {subject} student",
    template_vars={
        "grade_level": "8th grade",
        "subject": "mathematics"
    },
    scale=(1, 5)
)

# Reuse with different contexts
result2 = await judge.evaluate(
    content="Try to add more detail to your explanations.",
    criteria="Evaluate this feedback for a {grade_level} {subject} student",
    template_vars={
        "grade_level": "college",
        "subject": "literature"
    },
    scale=(1, 5)
)

⚡ Batch Processing

Evaluate multiple items efficiently:

# Prepare batch data
evaluations = [
    {
        "content": "Python uses indentation for code blocks.",
        "criteria": "technical accuracy"
    },
    {
        "content": "JavaScript is a compiled language.",
        "criteria": "technical accuracy"
    },
    {
        "content": "HTML is a programming language.",
        "criteria": "technical accuracy"
    }
]

# Run batch evaluation
results = await judge.batch_evaluate(evaluations)

# Process results
for i, result in enumerate(results.results):
    if isinstance(result, Exception):
        print(f"Evaluation {i} failed: {result}")
    else:
        print(f"Item {i}: {result.decision}/10 - {result.reasoning[:50]}...")

🌐 Running as API Server

Start the Server

# Start vLLM Judge API server
vllm-judge serve --base-url http://vllm-server:8000 --port 8080

# The server is now running at http://localhost:8080

Use the API

Python Client

from vllm_judge.api import JudgeClient

# Connect to the API
client = JudgeClient("http://localhost:8080")

# Use same interface as local Judge
result = await client.evaluate(
    content="This is a test response.",
    criteria="clarity and coherence"
)

cURL

curl -X POST http://localhost:8080/evaluate \
    -H "Content-Type: application/json" \
    -d '{
    "content": "This is a test response.",
    "criteria": "clarity and coherence",
    "scale": [1, 10]
    }'

JavaScript

const response = await fetch('http://localhost:8080/evaluate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
        content: "This is a test content.",
        criteria: "clarity and coherence",
        scale: [1, 10]
    })
});

const result = await response.json();
console.log(`Score: ${result.score} - ${result.reasoning}`);

🎉 Next Steps

Congratulations! You’ve learned the basics of vLLM Judge. Here’s what to explore next:

📖 Deep Dive Guides

Basic Evaluation Guide - Deep dive into evaluation options and patterns
Using Metrics - Explore all 20+ pre-built metrics
Template Variables - Advanced templating features for dynamic evaluations

💡 Tips for Success

Start with simple criteria-based evaluations before moving to complex rubrics
Use pre-built metrics when possible to save time and ensure consistency
Provide context when evaluating content that depends on specific situations
Experiment with different sampling parameters to find what works for your use case
Consider batch processing for high-volume evaluation scenarios