Basic Evaluation Guide
This guide covers the fundamental evaluation capabilities of vLLM Judge, progressing from simple to advanced usage.
Understanding the Universal Interface
vLLM Judge uses a single evaluate()
method that adapts to your needs:
result = await judge.evaluate(
content="...", # What to evaluate
criteria="...", # What to evaluate for
# Optional parameters to control evaluation
)
The method automatically determines the evaluation type based on what you provide.
Level 1: Simple Criteria-Based Evaluation
The simplest form - just provide text and criteria:
# Basic evaluation
result = await judge.evaluate(
content="The Earth is the third planet from the Sun.",
criteria="scientific accuracy"
)
# Multiple criteria
result = await judge.evaluate(
content="Dear customer, thank you for your feedback...",
criteria="professionalism, empathy, and clarity"
)
What happens behind the scenes:
-
Judge creates a prompt asking to evaluate the content based on your criteria
-
The LLM provides a score (if scale is provided) and reasoning
-
You get a structured result with
decision
,reasoning
, andscore
Level 2: Adding Structure with Scales and Rubrics
Numeric Scales
Control the scoring range:
# 5-point scale
result = await judge.evaluate(
content="The product works as advertised.",
criteria="review helpfulness",
scale=(1, 5)
)
# 100-point scale for fine-grained scoring
result = await judge.evaluate(
content=essay_text,
criteria="writing quality",
scale=(0, 100)
)
String Rubrics
Provide evaluation guidance as text:
result = await judge.evaluate(
content="I hate this product!",
criteria="sentiment analysis",
rubric="Classify as 'positive', 'neutral', or 'negative' based on emotional tone"
)
# Result: decision="negative", score=None
Detailed Rubrics
Define specific score meanings:
result = await judge.evaluate(
content=code_snippet,
criteria="code quality",
scale=(1, 10),
rubric={
10: "Production-ready, follows all best practices",
8: "High quality with minor improvements possible",
6: "Functional but needs refactoring",
4: "Works but has significant issues",
2: "Barely functional with major problems",
1: "Broken or completely incorrect"
}
)
Level 3: Comparison Evaluations
Compare two responses by providing a dictionary:
# Compare two responses
result = await judge.evaluate(
content={
"a": "Python is great for beginners due to its simple syntax.",
"b": "Python's intuitive syntax makes it ideal for newcomers."
},
criteria="clarity and informativeness"
)
# With additional context
result = await judge.evaluate(
content={
"a": customer_response_1,
"b": customer_response_2
},
criteria="helpfulness and professionalism",
context="Customer asked about refund policy"
)
Level 4: Adding Context and Examples
Providing Context
Add context to improve evaluation accuracy:
result = await judge.evaluate(
content="Just use the default settings.",
criteria="helpfulness",
context="User asked how to configure advanced security settings"
)
# Low score due to dismissive response to specific question
Few-Shot Examples
Guide the evaluation with examples:
result = await judge.evaluate(
content="Your code has a bug on line 5.",
criteria="constructive feedback quality",
scale=(1, 10),
examples=[
{
"content": "This doesn't work. Fix it.",
"score": 2,
"reasoning": "Too vague and dismissive"
},
{
"content": "Line 5 has a syntax error. Try adding a closing parenthesis.",
"score": 8,
"reasoning": "Specific, actionable, and helpful"
}
]
)
Level 5: Custom System Prompts
Take full control of the evaluator's persona:
# Expert evaluator
result = await judge.evaluate(
content=medical_advice,
criteria="medical accuracy and safety",
system_prompt="""You are a licensed medical professional reviewing
health information for accuracy and potential harm. Be extremely
cautious about unsafe advice."""
)
# Specific domain expert
result = await judge.evaluate(
content=legal_document,
criteria="legal compliance",
system_prompt="""You are a corporate lawyer specializing in GDPR
compliance. Evaluate for regulatory adherence."""
)
Understanding Output Types
Numeric Scores
When you provide a scale, you get numeric scoring:
result = await judge.evaluate(
content="Great product!",
criteria="review quality",
scale=(1, 5)
)
# decision: 4 (numeric)
# score: 4.0
# reasoning: "Brief but positive..."
Classifications
Without a scale but with category rubric:
result = await judge.evaluate(
content="This might be considered offensive.",
criteria="content moderation",
rubric="Classify as 'safe', 'warning', or 'unsafe'"
)
# decision: "warning" (string)
# score: None
# reasoning: "Contains potentially sensitive content..."
Binary Decisions
For yes/no evaluations:
result = await judge.evaluate(
content=user_message,
criteria="spam detection",
rubric="Determine if this is 'spam' or 'not spam'"
)
# decision: "not spam"
# score: None
Mixed Evaluation
You can request both classification and scoring:
result = await judge.evaluate(
content=essay,
criteria="academic quality",
rubric="""
Grade the essay:
- 'A' (90-100): Exceptional work
- 'B' (80-89): Good work
- 'C' (70-79): Satisfactory
- 'D' (60-69): Below average
- 'F' (0-59): Failing
Provide both letter grade and numeric score.
"""
)
# decision: "B"
# score: 85.0
# reasoning: "Well-structured argument with minor issues..."
Common Patterns
Quality Assurance
async def qa_check(response: str, threshold: float = 7.0):
"""Check if response meets quality threshold."""
result = await judge.evaluate(
content=response,
criteria="helpfulness, accuracy, and professionalism",
scale=(1, 10)
)
passed = result.score >= threshold
return {
"passed": passed,
"score": result.score,
"feedback": result.reasoning,
"improve": None if passed else "Consider improving: " + result.reasoning
}
A/B Testing
async def compare_models(prompt: str, response_a: str, response_b: str):
"""Compare two model responses."""
result = await judge.evaluate(
content={"a": response_a, "b": response_b},
criteria="helpfulness, accuracy, and clarity",
context=f"User prompt: {prompt}"
)
return {
"winner": result.decision,
"reason": result.reasoning,
"prompt": prompt
}
Multi-Aspect Evaluation
async def comprehensive_evaluation(content: str):
"""Evaluate content on multiple dimensions."""
aspects = {
"accuracy": "factual correctness",
"clarity": "ease of understanding",
"completeness": "thoroughness of coverage",
"engagement": "interesting and engaging presentation"
}
results = {}
for aspect, criteria in aspects.items():
result = await judge.evaluate(
content=content,
criteria=criteria,
scale=(1, 10)
)
results[aspect] = {
"score": result.score,
"feedback": result.reasoning
}
# Calculate overall score
avg_score = sum(r["score"] for r in results.values()) / len(results)
results["overall"] = avg_score
return results
💡 Best Practices
-
Be specific with your criteria.
-
Rubric Design
- Make score distinctions clear and meaningful
- Avoid overlapping descriptions
- Include specific indicators for each level
-
Add system prompt to control the persona.
-
Try to provide context when the evaluation depends on understanding the situation
-
Try to provide input that generated the content being evaluated.
Next Steps
-
Learn about Using Metrics for common evaluation tasks
-
Explore Template Variables for dynamic evaluations