Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Most RAG demos look great on the developer's curated questions and fall apart on real user queries. The fix is a real eval set you build once and run after every change — 30-50 (question, expected-fact, expected-source) triples covering the queries you actually see. Score each query on retrieval quality (was the right chunk in the top-k?) and answer quality (did the model produce the expected fact?). Without eval, every change is a guess; with eval, you make tradeoffs based on numbers.
Build the eval set from real query logs OR by writing them yourself with help from a teammate who's domain expert (NOT the model — that's contaminated). 30-50 triples is enough to start; expand over time. Run the eval before every change to chunking, embedding, threshold, or prompt — log retrieval@k, answer-quality (judged by another LLM or by humans for the first 100), and latency. Track the numbers in a CSV or sheet over time. This is the discipline that turns RAG from 'works on the demo' to 'works for users'.
Use these three in order. Each builds on the one before.
Why do most RAG systems look great in demos and fail in production? What does an eval set fix?
Walk me through 'LLM-as-judge': how do you use another LLM to score answers automatically, and what biases does this introduce?
I have 200 real user queries from logs. Help me design a minimum-viable eval pipeline: how many to label by hand, how to use LLM-judge for the rest, what biases to control for, and how often to re-run.
# Minimum-viable eval
EVAL_SET = [
{
"q": "Why does a long transaction cause table bloat?",
"expected_fact": "long transactions prevent vacuum from reclaiming dead tuples",
"expected_chunk_id": 17, # the one that should rank top-k
},
{
"q": "What's the recommended connection pool size?",
"expected_fact": "2 * cpus + 1",
"expected_chunk_id": 23,
},
# ... 30-50 of these
]
def eval_retrieval(eval_set, k=5):
recall_at_k = 0
for case in eval_set:
hits = search(case["q"], k=k)
ids = [docs_to_id(h) for h in hits]
if case["expected_chunk_id"] in ids:
recall_at_k += 1
return recall_at_k / len(eval_set)
def eval_answers(eval_set):
correct = 0
for case in eval_set:
ans = answer(case["q"])
# judge with another LLM
judge_msg = llm.messages.create(
model="claude-sonnet-4-6",
max_tokens=64,
system="Answer YES or NO only. Does the answer contain the expected fact?",
messages=[{
"role": "user",
"content": f"Answer: {ans}\nExpected fact: {case['expected_fact']}"
}]
)
if "YES" in judge_msg.content[0].text:
correct += 1
return correct / len(eval_set)
# Track over time
print(f"Recall@5: {eval_retrieval(EVAL_SET):.2%}")
print(f"Answer acc: {eval_answers(EVAL_SET):.2%}")python3 main.py