Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Retrieval-Augmented Generation (RAG) is the most widely deployed production AI pattern in 2025. The idea: instead of asking a language model to memorize everything in its weights (expensive and stale), you retrieve relevant documents at query time and include them in the context. This lets a general-purpose LLM answer questions about your proprietary data, recent events, or specialized knowledge without any fine-tuning. Every enterprise AI assistant, every document Q&A system, and most AI agents use some form of RAG. Understanding how to build one from scratch — chunking strategy, embedding model, vector store, retrieval, prompt engineering — means you can implement and debug these systems rather than just using them as a black box.
RAG decouples what the model knows (parametric memory, fixed at training) from what it can look up (retrieval, updated at index time). The demo walks all five stages — sentence-level chunking, encoding with all-MiniLM-L6-v2, FAISS inner-product search, and a prompt template ready to send to any LLM — making the handoffs between stages explicit before any production abstraction layer obscures them.
# pip install sentence-transformers faiss-cpu transformers
from sentence_transformers import SentenceTransformer
import numpy as np, faiss
# Step 1: Chunk a document
document = """
Machine learning models learn from data by adjusting parameters to minimize a loss function.
Gradient descent is the standard optimization algorithm used in training neural networks.
The learning rate controls how large each parameter update step is.
Overfitting occurs when a model memorizes training data instead of learning general patterns.
Regularization techniques like L2 weight decay and dropout help prevent overfitting.
Cross-validation gives a more reliable estimate of generalization performance than a single train/test split.
Transfer learning reuses pretrained model weights as a starting point for a new task.
"""
chunks = [s.strip() for s in document.strip().split("\n") if s.strip()]
print(f"Chunks: {len(chunks)}")
# Step 2: Embed chunks
embedder = SentenceTransformer("all-MiniLM-L6-v2")
chunk_embs = embedder.encode(chunks, normalize_embeddings=True) # (7, 384)
# Step 3: Store in a vector index
index = faiss.IndexFlatIP(384) # inner product (= cosine sim for unit vectors)
index.add(chunk_embs.astype(np.float32))
# Step 4: Retrieve
query = "How do I prevent my model from memorizing training data?"
q_emb = embedder.encode([query], normalize_embeddings=True).astype(np.float32)
scores, ids = index.search(q_emb, k=2)
print(f"\nQuery: {query}")
print("Top retrieved chunks:")
for score, idx in zip(scores[0], ids[0]):
print(f" [{score:.3f}] {chunks[idx]}")
# Step 5: Generate (prompt template — swap in any LLM)
context = "\n".join(chunks[i] for i in ids[0])
print(f"\nPrompt to send to LLM:\n---\nContext:\n{context}\n\nQuestion: {query}\nAnswer:")python3 main.pyk=2 to k=3. Does the third retrieved chunk still seem relevant? This is a common RAG tuning decision — more context is not always better.'Overfitting is great for memorizing multiplication tables.' Re-run the query. Does it appear in the top 2? This simulates noisy retrieval, a real RAG failure mode.'What is the capital of France?'. What chunks are retrieved? This is the hallucination risk in RAG — if retrieved context is irrelevant, the LLM may still generate a plausible-sounding answer.faiss.IndexFlatIP for faiss.IndexFlatL2 (Euclidean distance). Are the retrieved chunks different? For unit-normalized vectors, cosine similarity and negative L2 distance should produce the same ranking — verify this.Use these three in order. Each builds on the one before.
In one paragraph, explain what RAG is and why it's useful. Why is retrieval better than fine-tuning for answering questions about frequently changing information (e.g., daily news)?
Walk me through the five steps of a RAG pipeline: chunking, embedding, indexing, retrieval, and generation. For each step, name one key decision (e.g., chunk size for chunking) and explain how a wrong choice affects the final answer quality.
My RAG system retrieves the top-5 chunks by cosine similarity but the LLM still gives wrong answers. Name four root causes (chunking strategy, embedding model quality, query-document distribution mismatch, retrieval count) and for each: a concrete diagnostic test you'd run and a specific fix. Also explain when you'd switch from dense retrieval (embeddings) to hybrid search (BM25 + dense).