Skip to content

Basic Evaluation Guide

This guide covers the fundamental evaluation capabilities of vLLM Judge, progressing from simple to advanced usage.

Understanding the Universal Interface

vLLM Judge uses a single evaluate() method that adapts to your needs:

result = await judge.evaluate(
    content="...",        # What to evaluate
    criteria="...",        # What to evaluate for
    # Optional parameters to control evaluation
)

The method automatically determines the evaluation type based on what you provide.

Level 1: Simple Criteria-Based Evaluation

The simplest form - just provide text and criteria:

# Basic evaluation
result = await judge.evaluate(
    content="The Earth is the third planet from the Sun.",
    criteria="scientific accuracy"
)

# Multiple criteria
result = await judge.evaluate(
    content="Dear customer, thank you for your feedback...",
    criteria="professionalism, empathy, and clarity"
)

What happens behind the scenes:

  • Judge creates a prompt asking to evaluate the content based on your criteria

  • The LLM provides a score (if scale is provided) and reasoning

  • You get a structured result with decision, reasoning, and score

Level 2: Adding Structure with Scales and Rubrics

Numeric Scales

Control the scoring range:

# 5-point scale
result = await judge.evaluate(
    content="The product works as advertised.",
    criteria="review helpfulness",
    scale=(1, 5)
)

# 100-point scale for fine-grained scoring
result = await judge.evaluate(
    content=essay_text,
    criteria="writing quality",
    scale=(0, 100)
)

String Rubrics

Provide evaluation guidance as text:

result = await judge.evaluate(
    content="I hate this product!",
    criteria="sentiment analysis",
    rubric="Classify as 'positive', 'neutral', or 'negative' based on emotional tone"
)
# Result: decision="negative", score=None

Detailed Rubrics

Define specific score meanings:

result = await judge.evaluate(
    content=code_snippet,
    criteria="code quality",
    scale=(1, 10),
    rubric={
        10: "Production-ready, follows all best practices",
        8: "High quality with minor improvements possible",
        6: "Functional but needs refactoring",
        4: "Works but has significant issues",
        2: "Barely functional with major problems",
        1: "Broken or completely incorrect"
    }
)

Level 3: Comparison Evaluations

Compare two responses by providing a dictionary:

# Compare two responses
result = await judge.evaluate(
    content={
        "a": "Python is great for beginners due to its simple syntax.",
        "b": "Python's intuitive syntax makes it ideal for newcomers."
    },
    criteria="clarity and informativeness"
)

# With additional context
result = await judge.evaluate(
    content={
        "a": customer_response_1,
        "b": customer_response_2
    },
    criteria="helpfulness and professionalism",
    context="Customer asked about refund policy"
)

Level 4: Adding Context and Examples

Providing Context

Add context to improve evaluation accuracy:

result = await judge.evaluate(
    content="Just use the default settings.",
    criteria="helpfulness",
    context="User asked how to configure advanced security settings"
)
# Low score due to dismissive response to specific question

Few-Shot Examples

Guide the evaluation with examples:

result = await judge.evaluate(
    content="Your code has a bug on line 5.",
    criteria="constructive feedback quality",
    scale=(1, 10),
    examples=[
        {
            "content": "This doesn't work. Fix it.",
            "score": 2,
            "reasoning": "Too vague and dismissive"
        },
        {
            "content": "Line 5 has a syntax error. Try adding a closing parenthesis.",
            "score": 8,
            "reasoning": "Specific, actionable, and helpful"
        }
    ]
)

Level 5: Custom System Prompts

Take full control of the evaluator's persona:

# Expert evaluator
result = await judge.evaluate(
    content=medical_advice,
    criteria="medical accuracy and safety",
    system_prompt="""You are a licensed medical professional reviewing 
    health information for accuracy and potential harm. Be extremely 
    cautious about unsafe advice."""
)

# Specific domain expert
result = await judge.evaluate(
    content=legal_document,
    criteria="legal compliance",
    system_prompt="""You are a corporate lawyer specializing in GDPR 
    compliance. Evaluate for regulatory adherence."""
)

Understanding Output Types

Numeric Scores

When you provide a scale, you get numeric scoring:

result = await judge.evaluate(
    content="Great product!",
    criteria="review quality",
    scale=(1, 5)
)
# decision: 4 (numeric)
# score: 4.0
# reasoning: "Brief but positive..."

Classifications

Without a scale but with category rubric:

result = await judge.evaluate(
    content="This might be considered offensive.",
    criteria="content moderation",
    rubric="Classify as 'safe', 'warning', or 'unsafe'"
)
# decision: "warning" (string)
# score: None
# reasoning: "Contains potentially sensitive content..."

Binary Decisions

For yes/no evaluations:

result = await judge.evaluate(
    content=user_message,
    criteria="spam detection",
    rubric="Determine if this is 'spam' or 'not spam'"
)
# decision: "not spam"
# score: None

Mixed Evaluation

You can request both classification and scoring:

result = await judge.evaluate(
    content=essay,
    criteria="academic quality",
    rubric="""
    Grade the essay:
    - 'A' (90-100): Exceptional work
    - 'B' (80-89): Good work
    - 'C' (70-79): Satisfactory
    - 'D' (60-69): Below average
    - 'F' (0-59): Failing

    Provide both letter grade and numeric score.
    """
)
# decision: "B"
# score: 85.0
# reasoning: "Well-structured argument with minor issues..."

Common Patterns

Quality Assurance

async def qa_check(response: str, threshold: float = 7.0):
    """Check if response meets quality threshold."""
    result = await judge.evaluate(
        content=response,
        criteria="helpfulness, accuracy, and professionalism",
        scale=(1, 10)
    )

    passed = result.score >= threshold
    return {
        "passed": passed,
        "score": result.score,
        "feedback": result.reasoning,
        "improve": None if passed else "Consider improving: " + result.reasoning
    }

A/B Testing

async def compare_models(prompt: str, response_a: str, response_b: str):
    """Compare two model responses."""
    result = await judge.evaluate(
        content={"a": response_a, "b": response_b},
        criteria="helpfulness, accuracy, and clarity",
        context=f"User prompt: {prompt}"
    )

    return {
        "winner": result.decision,
        "reason": result.reasoning,
        "prompt": prompt
    }

Multi-Aspect Evaluation

async def comprehensive_evaluation(content: str):
    """Evaluate content on multiple dimensions."""
    aspects = {
        "accuracy": "factual correctness",
        "clarity": "ease of understanding",
        "completeness": "thoroughness of coverage",
        "engagement": "interesting and engaging presentation"
    }

    results = {}
    for aspect, criteria in aspects.items():
        result = await judge.evaluate(
            content=content,
            criteria=criteria,
            scale=(1, 10)
        )
        results[aspect] = {
            "score": result.score,
            "feedback": result.reasoning
        }

    # Calculate overall score
    avg_score = sum(r["score"] for r in results.values()) / len(results)
    results["overall"] = avg_score

    return results

💡 Best Practices

  • Be specific with your criteria.

  • Rubric Design

    • Make score distinctions clear and meaningful
    • Avoid overlapping descriptions
    • Include specific indicators for each level
  • Add system prompt to control the persona.

  • Try to provide context when the evaluation depends on understanding the situation

  • Try to provide input that generated the content being evaluated.

Next Steps