Evaluating fine-tuned models — MT-Bench, Alpaca Eval, custom evals

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Perplexity on a held-out set tells you the model's token prediction quality, but it says nothing about whether it follows instructions well, stays in character, refuses harmful requests, or answers your domain questions correctly. Instruction-tuned model evaluation requires task-specific benchmarks: MT-Bench measures multi-turn reasoning and instruction following; Alpaca Eval measures preference over text-davinci-003; MMLU measures knowledge; HellaSwag measures commonsense reasoning. For a production fine-tuned model, all of these are secondary to your custom eval set — the 50–200 task-specific examples that directly measure what your model needs to do. Building that custom eval set, running it on every fine-tuning run, and tracking regression is what separates a research fine-tuning project from a production one.

Demo

Perplexity tells you how surprised the model is by held-out tokens; it says nothing about whether the model follows instructions, answers your domain questions, or avoids harmful outputs. An LLM-judge harness closes that gap: for each eval prompt you generate a response from both the candidate model and a baseline, send the pair to a strong judge model (GPT-4, Claude), and count wins. The win rate becomes the tracking metric you optimize across fine-tuning runs — much more actionable than a loss curve.

# Custom eval harness: pairwise win rate (model vs baseline)
# Requires an LLM API for judging (or swap in a rules-based scorer)

import json

# Eval dataset: (instruction, reference_answer) pairs
eval_set = [
    {
        "instruction": "Explain gradient descent in one sentence.",
        "reference": "Gradient descent minimizes a loss function by iteratively moving parameters in the opposite direction of the gradient.",
    },
    {
        "instruction": "What is the bias-variance tradeoff?",
        "reference": "The bias-variance tradeoff describes how model complexity affects underfitting (high bias) vs overfitting (high variance).",
    },
    {
        "instruction": "Name three regularization techniques for neural networks.",
        "reference": "Dropout, L2 weight decay, and early stopping.",
    },
]

judge_prompt = """Compare two responses to the same instruction. Which is better?
Instruction: {instruction}
Reference: {reference}
Response A: {response_a}
Response B: {response_b}
Answer with JSON: {{"winner": "A" or "B" or "tie", "reason": "<one sentence>"}}"""

def evaluate_model_output(instruction, response, reference):
    """In production: send to LLM API. Here: rule-based for demo."""
    ref_words = set(reference.lower().split())
    resp_words = set(response.lower().split())
    overlap = len(ref_words & resp_words) / len(ref_words)
    if overlap > 0.5:   return "A"   # response (A) wins
    elif overlap > 0.3: return "tie"
    else:               return "B"   # reference (B) wins

# Simulate model outputs
model_responses = [
    "Gradient descent updates model weights by taking small steps against the gradient of the loss.",
    "It balances model complexity: simple models underfit (high bias), complex ones overfit (high variance).",
    "L1/L2 regularization, dropout, batch normalization.",
]

wins, ties, losses = 0, 0, 0
for ex, resp in zip(eval_set, model_responses):
    verdict = evaluate_model_output(ex["instruction"], resp, ex["reference"])
    if verdict == "A": wins   += 1
    elif verdict == "tie": ties += 1
    else: losses += 1

total = len(eval_set)
print(f"Win rate: {wins/total:.0%}  Tie: {ties/total:.0%}  Loss: {losses/total:.0%}")
print(f"Win+Tie rate (vs baseline): {(wins+ties)/total:.0%}")

Run: python3 main.py

Try it yourself

Run your fine-tuned model on all 3 eval instructions (or use a public HuggingFace model). Send both the model response and the reference to an LLM judge (replace evaluate_model_output with a real API call). What win rate do you get?

Add a 4th eval example where the correct answer requires knowledge not in the training data. Does the model hallucinate? How would you handle expected failures in your eval harness — do they count as losses?

Research MT-Bench: it has 80 multi-turn questions across 8 categories (writing, roleplay, reasoning, math, coding, extraction, STEM, humanities). Download the benchmark from the MT-Bench GitHub repo and run 5 questions manually through any model to see the format.

Design a custom eval set for a specific domain (e.g., medical Q&A, legal summarization, code review). What are the 3 dimensions you'd measure (accuracy, format compliance, safety/refusals)? Write 5 eval examples for each dimension.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain why perplexity is a poor metric for evaluating an instruction-tuned model. What does perplexity measure, and what important properties does it fail to capture?

2. Why it works (the mechanism)

Walk me through how MT-Bench works: what types of questions does it use, how does it measure multi-turn ability (the second turn depends on the first), and why does it use GPT-4 as a judge rather than automatic metrics like ROUGE?

3. Advanced — application & what's next

I've fine-tuned three LoRA checkpoints of Llama-3 8B on customer support data, varying rank (8, 16, 32) and alpha. I want to select the best checkpoint. Walk me through a rigorous evaluation plan: (1) which public benchmarks to run to detect catastrophic forgetting, (2) how to build a domain-specific eval set efficiently, (3) whether to use LLM-as-judge or rule-based scoring, and (4) how to report results to stakeholders in a way they can act on.