Your first RAG pipeline — chunk, embed, store, retrieve, generate

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Retrieval-Augmented Generation (RAG) is the most widely deployed production AI pattern in 2025. The idea: instead of asking a language model to memorize everything in its weights (expensive and stale), you retrieve relevant documents at query time and include them in the context. This lets a general-purpose LLM answer questions about your proprietary data, recent events, or specialized knowledge without any fine-tuning. Every enterprise AI assistant, every document Q&A system, and most AI agents use some form of RAG. Understanding how to build one from scratch — chunking strategy, embedding model, vector store, retrieval, prompt engineering — means you can implement and debug these systems rather than just using them as a black box.

Demo

RAG decouples what the model knows (parametric memory, fixed at training) from what it can look up (retrieval, updated at index time). The demo walks all five stages — sentence-level chunking, encoding with all-MiniLM-L6-v2, FAISS inner-product search, and a prompt template ready to send to any LLM — making the handoffs between stages explicit before any production abstraction layer obscures them.

# pip install sentence-transformers faiss-cpu transformers
from sentence_transformers import SentenceTransformer
import numpy as np, faiss

# Step 1: Chunk a document
document = """
Machine learning models learn from data by adjusting parameters to minimize a loss function.
Gradient descent is the standard optimization algorithm used in training neural networks.
The learning rate controls how large each parameter update step is.
Overfitting occurs when a model memorizes training data instead of learning general patterns.
Regularization techniques like L2 weight decay and dropout help prevent overfitting.
Cross-validation gives a more reliable estimate of generalization performance than a single train/test split.
Transfer learning reuses pretrained model weights as a starting point for a new task.
"""

chunks = [s.strip() for s in document.strip().split("\n") if s.strip()]
print(f"Chunks: {len(chunks)}")

# Step 2: Embed chunks
embedder = SentenceTransformer("all-MiniLM-L6-v2")
chunk_embs = embedder.encode(chunks, normalize_embeddings=True)  # (7, 384)

# Step 3: Store in a vector index
index = faiss.IndexFlatIP(384)  # inner product (= cosine sim for unit vectors)
index.add(chunk_embs.astype(np.float32))

# Step 4: Retrieve
query = "How do I prevent my model from memorizing training data?"
q_emb = embedder.encode([query], normalize_embeddings=True).astype(np.float32)
scores, ids = index.search(q_emb, k=2)

print(f"\nQuery: {query}")
print("Top retrieved chunks:")
for score, idx in zip(scores[0], ids[0]):
    print(f"  [{score:.3f}] {chunks[idx]}")

# Step 5: Generate (prompt template — swap in any LLM)
context = "\n".join(chunks[i] for i in ids[0])
print(f"\nPrompt to send to LLM:\n---\nContext:\n{context}\n\nQuestion: {query}\nAnswer:")

Run: python3 main.py

Try it yourself

Change k=2 to k=3. Does the third retrieved chunk still seem relevant? This is a common RAG tuning decision — more context is not always better.

Add a chunk that is semantically misleading: 'Overfitting is great for memorizing multiplication tables.' Re-run the query. Does it appear in the top 2? This simulates noisy retrieval, a real RAG failure mode.

Try a query that is NOT answered by any chunk: 'What is the capital of France?'. What chunks are retrieved? This is the hallucination risk in RAG — if retrieved context is irrelevant, the LLM may still generate a plausible-sounding answer.

Swap faiss.IndexFlatIP for faiss.IndexFlatL2 (Euclidean distance). Are the retrieved chunks different? For unit-normalized vectors, cosine similarity and negative L2 distance should produce the same ranking — verify this.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain what RAG is and why it's useful. Why is retrieval better than fine-tuning for answering questions about frequently changing information (e.g., daily news)?

2. Why it works (the mechanism)

Walk me through the five steps of a RAG pipeline: chunking, embedding, indexing, retrieval, and generation. For each step, name one key decision (e.g., chunk size for chunking) and explain how a wrong choice affects the final answer quality.

3. Advanced — application & what's next

My RAG system retrieves the top-5 chunks by cosine similarity but the LLM still gives wrong answers. Name four root causes (chunking strategy, embedding model quality, query-document distribution mismatch, retrieval count) and for each: a concrete diagnostic test you'd run and a specific fix. Also explain when you'd switch from dense retrieval (embeddings) to hybrid search (BM25 + dense).