Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Perplexity on a held-out set tells you the model's token prediction quality, but it says nothing about whether it follows instructions well, stays in character, refuses harmful requests, or answers your domain questions correctly. Instruction-tuned model evaluation requires task-specific benchmarks: MT-Bench measures multi-turn reasoning and instruction following; Alpaca Eval measures preference over text-davinci-003; MMLU measures knowledge; HellaSwag measures commonsense reasoning. For a production fine-tuned model, all of these are secondary to your custom eval set — the 50–200 task-specific examples that directly measure what your model needs to do. Building that custom eval set, running it on every fine-tuning run, and tracking regression is what separates a research fine-tuning project from a production one.
Perplexity tells you how surprised the model is by held-out tokens; it says nothing about whether the model follows instructions, answers your domain questions, or avoids harmful outputs. An LLM-judge harness closes that gap: for each eval prompt you generate a response from both the candidate model and a baseline, send the pair to a strong judge model (GPT-4, Claude), and count wins. The win rate becomes the tracking metric you optimize across fine-tuning runs — much more actionable than a loss curve.
# Custom eval harness: pairwise win rate (model vs baseline)
# Requires an LLM API for judging (or swap in a rules-based scorer)
import json
# Eval dataset: (instruction, reference_answer) pairs
eval_set = [
{
"instruction": "Explain gradient descent in one sentence.",
"reference": "Gradient descent minimizes a loss function by iteratively moving parameters in the opposite direction of the gradient.",
},
{
"instruction": "What is the bias-variance tradeoff?",
"reference": "The bias-variance tradeoff describes how model complexity affects underfitting (high bias) vs overfitting (high variance).",
},
{
"instruction": "Name three regularization techniques for neural networks.",
"reference": "Dropout, L2 weight decay, and early stopping.",
},
]
judge_prompt = """Compare two responses to the same instruction. Which is better?
Instruction: {instruction}
Reference: {reference}
Response A: {response_a}
Response B: {response_b}
Answer with JSON: {{"winner": "A" or "B" or "tie", "reason": "<one sentence>"}}"""
def evaluate_model_output(instruction, response, reference):
"""In production: send to LLM API. Here: rule-based for demo."""
ref_words = set(reference.lower().split())
resp_words = set(response.lower().split())
overlap = len(ref_words & resp_words) / len(ref_words)
if overlap > 0.5: return "A" # response (A) wins
elif overlap > 0.3: return "tie"
else: return "B" # reference (B) wins
# Simulate model outputs
model_responses = [
"Gradient descent updates model weights by taking small steps against the gradient of the loss.",
"It balances model complexity: simple models underfit (high bias), complex ones overfit (high variance).",
"L1/L2 regularization, dropout, batch normalization.",
]
wins, ties, losses = 0, 0, 0
for ex, resp in zip(eval_set, model_responses):
verdict = evaluate_model_output(ex["instruction"], resp, ex["reference"])
if verdict == "A": wins += 1
elif verdict == "tie": ties += 1
else: losses += 1
total = len(eval_set)
print(f"Win rate: {wins/total:.0%} Tie: {ties/total:.0%} Loss: {losses/total:.0%}")
print(f"Win+Tie rate (vs baseline): {(wins+ties)/total:.0%}")python3 main.pyevaluate_model_output with a real API call). What win rate do you get?Use these three in order. Each builds on the one before.
In one paragraph, explain why perplexity is a poor metric for evaluating an instruction-tuned model. What does perplexity measure, and what important properties does it fail to capture?
Walk me through how MT-Bench works: what types of questions does it use, how does it measure multi-turn ability (the second turn depends on the first), and why does it use GPT-4 as a judge rather than automatic metrics like ROUGE?
I've fine-tuned three LoRA checkpoints of Llama-3 8B on customer support data, varying rank (8, 16, 32) and alpha. I want to select the best checkpoint. Walk me through a rigorous evaluation plan: (1) which public benchmarks to run to detect catastrophic forgetting, (2) how to build a domain-specific eval set efficiently, (3) whether to use LLM-as-judge or rule-based scoring, and (4) how to report results to stakeholders in a way they can act on.