Evaluating naive RAG honestly

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Most RAG demos look great on the developer's curated questions and fall apart on real user queries. The fix is a real eval set you build once and run after every change — 30-50 (question, expected-fact, expected-source) triples covering the queries you actually see. Score each query on retrieval quality (was the right chunk in the top-k?) and answer quality (did the model produce the expected fact?). Without eval, every change is a guess; with eval, you make tradeoffs based on numbers.

Demo

Build the eval set from real query logs OR by writing them yourself with help from a teammate who's domain expert (NOT the model — that's contaminated). 30-50 triples is enough to start; expand over time. Run the eval before every change to chunking, embedding, threshold, or prompt — log retrieval@k, answer-quality (judged by another LLM or by humans for the first 100), and latency. Track the numbers in a CSV or sheet over time. This is the discipline that turns RAG from 'works on the demo' to 'works for users'.

Try it yourself

Write 30 (question, expected-fact) pairs from your domain. Pull from real user queries if you have them; otherwise from your top-N FAQs.
Run the retrieval eval. Recall@5 below 70% means your retriever is broken — that's the biggest lever, fix it first.
Run the answer-quality eval. If retrieval is 95% but answers are 60% correct, the prompt or model is the problem, not retrieval.
Re-run the eval before merging any change to chunking, threshold, or prompt. Number must go up or you've made a regression.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Why do most RAG systems look great in demos and fail in production? What does an eval set fix?

2. Why it works (the mechanism)

Walk me through 'LLM-as-judge': how do you use another LLM to score answers automatically, and what biases does this introduce?

3. Advanced — application & what's next

I have 200 real user queries from logs. Help me design a minimum-viable eval pipeline: how many to label by hand, how to use LLM-judge for the rest, what biases to control for, and how often to re-run.