Project · A semantic document search API

project

hard

module project

Ship something real. Submit your work when you're done.

Brief

Build an end-to-end semantic search system over a corpus of your choice (e.g., Wikipedia abstracts, product descriptions, or your own notes). The system must: chunk documents, embed them with a sentence-transformer, store them in a vector index (FAISS or pgvector), expose a `/search?q=...` REST endpoint, and return the top-5 results with their similarity scores. Wrap it in Docker and include a one-line eval script that computes Recall@5 on 20 test queries you write.

Deliverables

A `build_index.py` script that reads documents from a file, chunks, embeds, and saves the index to disk.
A FastAPI server (`server.py`) with `GET /search?q=<query>&k=5` that loads the index and returns ranked results as JSON.
A `Dockerfile` and `docker-compose.yml` that build and run the full system.
An `eval.py` with 20 hand-written (query, expected_doc_id) pairs and a Recall@5 score printed to stdout.
A `README.md` with: corpus description, indexing command, curl example, and measured Recall@5.

How we grade it

Index builds in under 60 seconds for a corpus of at least 500 documents.
Recall@5 ≥ 0.70 on your 20 test queries.
`docker compose up` starts the server cleanly; `curl 'http://localhost:8000/search?q=test'` returns valid JSON.
Results are returned in under 200ms for a corpus of ≤10K documents.

Hints

Use all-MiniLM-L6-v2 for the embedding model — it's fast and good enough for most corpora. Switch to BAAI/bge-large-en-v1.5 only if you need higher accuracy and can accept slower indexing.

Chunk size matters more than you think: 100-200 token chunks work better than full documents for most Q&A tasks because retrieval is more precise.

FAISS IndexFlatIP (inner product) is fine for ≤100K vectors. For larger corpora, use IndexIVFFlat with nlist=100 for approximate search.

Write your 20 eval queries before building the system — it prevents you from writing queries that your system trivially answers.

Expected output

$ python build_index.py --input corpus.jsonl --output index.faiss Loading 500 documents... Chunking (avg 150 tokens/chunk): 1240 chunks Embedding (batch_size=64): 100%|████| 20/20 [00:08<00:00] Saved index.faiss (1240 vectors, 384 dims) $ curl 'http://localhost:8000/search?q=how+does+attention+work&k=3' {"results": [ {"score": 0.872, "text": "The attention mechanism...", "doc_id": "doc_42"}, {"score": 0.854, "text": "Scaled dot-product attention...", "doc_id": "doc_17"}, {"score": 0.831, "text": "Multi-head attention...", "doc_id": "doc_55"} ]} $ python eval.py Recall@5: 0.80 (16/20 queries retrieved the expected document in top 5)

Stretch goals

Add BM25 sparse retrieval (with the rank-bm25 package) and implement hybrid search: combine BM25 and dense scores with a weighted sum. Measure whether hybrid beats pure dense on your eval set.

Add a /ask endpoint that takes a question, retrieves top-3 chunks, and calls an LLM to generate an answer grounded in the retrieved context (a full RAG pipeline).

Add query caching: if the same query (or one with cosine similarity > 0.98 to a recent query) arrives within 60 seconds, return the cached result. Measure p95 latency with and without the cache.

Implement re-ranking: after retrieval, send the top-10 chunks and the query to a cross-encoder model (cross-encoder/ms-marco-MiniLM-L-6-v2) and re-rank by cross-encoder score. Does Recall@5 improve?