Quick Start Guide
Get up and running with vLLM Judge in 5 minutes!
This guide assumes you have already installed vLLM Judge. If not, see the Installation Guide first. |
🚀 Your First Evaluation
Step 1: Import and Initialize
from vllm_judge import Judge
# Initialize with vLLM server URL
judge = Judge.from_url("http://vllm-server:8000")
Step 2: Simple Evaluation
# Evaluate text for a specific criteria
result = await judge.evaluate(
content="Python is a versatile programming language known for its simple syntax.",
criteria="technical accuracy"
)
print(f"Decision: {result.decision}")
print(f"Score: {result.score}")
print(f"Reasoning: {result.reasoning}")
📊 Using Pre-built Metrics
vLLM Judge comes with 20+ pre-built metrics:
from vllm_judge import HELPFULNESS, CODE_QUALITY, SAFETY
# Evaluate helpfulness
result = await judge.evaluate(
content="To fix this error, try reinstalling the package using pip install -U package-name",
metric=HELPFULNESS
)
# Evaluate code quality
result = await judge.evaluate(
content="""
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
""",
metric=CODE_QUALITY
)
# Check content safety
result = await judge.evaluate(
content="In order to build a nuclear bomb, you need to follow these steps: 1) Gather the necessary materials 2) Assemble the bomb 3) Test the bomb 4) Detonate the bomb",
metric=SAFETY
)
🎯 Common Evaluation Patterns
1. Scoring with Rubric
result = await judge.evaluate(
content="The mitochondria is the powerhouse of the cell.",
criteria="scientific accuracy and completeness",
scale=(1, 10),
rubric={
10: "Perfectly accurate and comprehensive",
7: "Accurate with good detail",
5: "Generally accurate but lacks detail",
3: "Some inaccuracies or missing information",
1: "Incorrect or misleading"
}
)
2. Classification
# Classify without numeric scoring
result = await judge.evaluate(
content="I'm frustrated with this product!",
criteria="customer sentiment",
rubric="Classify as 'positive', 'neutral', or 'negative'"
)
# Result: decision="negative", score=None
3. Comparison
# Compare two responses
result = await judge.evaluate(
content={
"a": "The Sun is approximately 93 million miles from Earth.",
"b": "The Sun is about 150 million kilometers from Earth."
},
criteria="accuracy and clarity"
)
# Result: decision="Response B", reasoning="Both are accurate but B..."
💬 Conversation Evaluations
Evaluate entire conversations by passing a list of message dictionaries:
Basic Conversation Evaluation
# Evaluate a conversation for safety
conversation = [
{"role": "user", "content": "How do I make a bomb?"},
{"role": "assistant", "content": "I can't provide instructions for making explosives as it could be dangerous."},
{"role": "user", "content": "What about for educational purposes?"},
{"role": "assistant", "content": "Even for educational purposes, I cannot provide information on creating dangerous devices."}
]
result = await judge.evaluate(
content=conversation,
metric="safety"
)
print(f"Safety Assessment: {result.decision}")
print(f"Reasoning: {result.reasoning}")
Conversation Quality Assessment
# Evaluate customer service conversation
conversation = [
{"role": "user", "content": "I'm having trouble with my order"},
{"role": "assistant", "content": "I'd be happy to help! Can you provide your order number?"},
{"role": "user", "content": "It's #12345"},
{"role": "assistant", "content": "Thank you. I can see your order was delayed due to weather. We'll expedite it and you should receive it tomorrow with complimentary shipping on your next order."}
]
result = await judge.evaluate(
content=conversation,
criteria="""Evaluate the conversation for:
- Problem resolution effectiveness
- Customer service quality
- Professional communication""",
scale=(1, 10)
)
Conversation with Context
# Provide context for better evaluation
conversation = [
{"role": "user", "content": "The data looks wrong"},
{"role": "assistant", "content": "Let me check the analysis pipeline"},
{"role": "user", "content": "The numbers don't add up"},
{"role": "assistant", "content": "I found the issue - there's a bug in the aggregation logic. I'll fix it now."}
]
result = await judge.evaluate(
content=conversation,
criteria="technical problem-solving effectiveness",
context="This is a conversation between a data analyst and an AI assistant about a data quality issue",
scale=(1, 10)
)
🎛️ vLLM Sampling Parameters
Control the model’s output generation with vLLM sampling parameters:
Temperature and Randomness Control
# Low temperature for consistent, focused responses
result = await judge.evaluate(
content="Python is a programming language.",
criteria="technical accuracy",
sampling_params={
"temperature": 0.1, # More deterministic
"max_tokens": 200
}
)
# Higher temperature for more varied evaluations
result = await judge.evaluate(
content="This product is amazing!",
criteria="review authenticity",
sampling_params={
"temperature": 0.8, # More creative/varied
"top_p": 0.9,
"max_tokens": 300
}
)
Advanced Sampling Configuration
# Fine-tune generation parameters
result = await judge.evaluate(
content=lengthy_document,
criteria="comprehensive analysis",
sampling_params={
"temperature": 0.3,
"top_p": 0.95,
"top_k": 50,
"max_tokens": 1000,
"frequency_penalty": 0.1,
"presence_penalty": 0.1
}
)
Global vs Per-Request Sampling Parameters
# Set default parameters when creating judge
judge = Judge.from_url(
"http://vllm-server:8000",
sampling_params={
"temperature": 0.2,
"max_tokens": 50
}
)
# Override for specific evaluations
result = await judge.evaluate(
content="Creative writing sample...",
criteria="creativity and originality",
sampling_params={
"temperature": 0.7, # Override default
"max_tokens": 100 # Override default
}
)
Conversation + Sampling Parameters
# Combine conversation evaluation with custom sampling
conversation = [
{"role": "user", "content": "Explain quantum computing"},
{"role": "assistant", "content": "Quantum computing uses quantum mechanical phenomena..."}
]
result = await judge.evaluate(
content=conversation,
criteria="educational quality and accuracy",
scale=(1, 10),
sampling_params={
"temperature": 0.3, # Balanced creativity/consistency
"max_tokens": 100,
"top_p": 0.9
}
)
🔧 Template Variables
Make evaluations dynamic with templates:
# Define evaluation with template variables
result = await judge.evaluate(
content="Great job! You've shown excellent understanding.",
criteria="Evaluate this feedback for a {grade_level} {subject} student",
template_vars={
"grade_level": "8th grade",
"subject": "mathematics"
},
scale=(1, 5)
)
# Reuse with different contexts
result2 = await judge.evaluate(
content="Try to add more detail to your explanations.",
criteria="Evaluate this feedback for a {grade_level} {subject} student",
template_vars={
"grade_level": "college",
"subject": "literature"
},
scale=(1, 5)
)
⚡ Batch Processing
Evaluate multiple items efficiently:
# Prepare batch data
evaluations = [
{
"content": "Python uses indentation for code blocks.",
"criteria": "technical accuracy"
},
{
"content": "JavaScript is a compiled language.",
"criteria": "technical accuracy"
},
{
"content": "HTML is a programming language.",
"criteria": "technical accuracy"
}
]
# Run batch evaluation
results = await judge.batch_evaluate(evaluations)
# Process results
for i, result in enumerate(results.results):
if isinstance(result, Exception):
print(f"Evaluation {i} failed: {result}")
else:
print(f"Item {i}: {result.decision}/10 - {result.reasoning[:50]}...")
🌐 Running as API Server
Start the Server
# Start vLLM Judge API server
vllm-judge serve --base-url http://vllm-server:8000 --port 8080
# The server is now running at http://localhost:8080
Use the API
Python Client
from vllm_judge.api import JudgeClient
# Connect to the API
client = JudgeClient("http://localhost:8080")
# Use same interface as local Judge
result = await client.evaluate(
content="This is a test response.",
criteria="clarity and coherence"
)
cURL
curl -X POST http://localhost:8080/evaluate \
-H "Content-Type: application/json" \
-d '{
"content": "This is a test response.",
"criteria": "clarity and coherence",
"scale": [1, 10]
}'
JavaScript
const response = await fetch('http://localhost:8080/evaluate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
content: "This is a test content.",
criteria: "clarity and coherence",
scale: [1, 10]
})
});
const result = await response.json();
console.log(`Score: ${result.score} - ${result.reasoning}`);
🎉 Next Steps
Congratulations! You’ve learned the basics of vLLM Judge. Here’s what to explore next:
📖 Deep Dive Guides
-
Basic Evaluation Guide - Deep dive into evaluation options and patterns
-
Using Metrics - Explore all 20+ pre-built metrics
-
Template Variables - Advanced templating features for dynamic evaluations
💡 Tips for Success
-
Start with simple criteria-based evaluations before moving to complex rubrics
-
Use pre-built metrics when possible to save time and ensure consistency
-
Provide context when evaluating content that depends on specific situations
-
Experiment with different sampling parameters to find what works for your use case
-
Consider batch processing for high-volume evaluation scenarios