Build an end-to-end semantic search system over a corpus of your choice (e.g., Wikipedia abstracts, product descriptions, or your own notes). The system must: chunk documents, embed them with a sentence-transformer, store them in a vector index (FAISS or pgvector), expose a `/search?q=...` REST endpoint, and return the top-5 results with their similarity scores. Wrap it in Docker and include a one-line eval script that computes Recall@5 on 20 test queries you write.
all-MiniLM-L6-v2 for the embedding model — it's fast and good enough for most corpora. Switch to BAAI/bge-large-en-v1.5 only if you need higher accuracy and can accept slower indexing.IndexFlatIP (inner product) is fine for ≤100K vectors. For larger corpora, use IndexIVFFlat with nlist=100 for approximate search.$ python build_index.py --input corpus.jsonl --output index.faiss
Loading 500 documents...
Chunking (avg 150 tokens/chunk): 1240 chunks
Embedding (batch_size=64): 100%|████| 20/20 [00:08<00:00]
Saved index.faiss (1240 vectors, 384 dims)
$ curl 'http://localhost:8000/search?q=how+does+attention+work&k=3'
{"results": [
{"score": 0.872, "text": "The attention mechanism...", "doc_id": "doc_42"},
{"score": 0.854, "text": "Scaled dot-product attention...", "doc_id": "doc_17"},
{"score": 0.831, "text": "Multi-head attention...", "doc_id": "doc_55"}
]}
$ python eval.py
Recall@5: 0.80 (16/20 queries retrieved the expected document in top 5)
rank-bm25 package) and implement hybrid search: combine BM25 and dense scores with a weighted sum. Measure whether hybrid beats pure dense on your eval set./ask endpoint that takes a question, retrieves top-3 chunks, and calls an LLM to generate an answer grounded in the retrieved context (a full RAG pipeline).cross-encoder/ms-marco-MiniLM-L-6-v2) and re-rank by cross-encoder score. Does Recall@5 improve?