Embeddings — what they are, cosine similarity, use cases

medium

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

An embedding is a dense vector representation of discrete data (a word, sentence, image, or code snippet) that encodes semantic meaning in its geometry: similar things are close together in vector space. This is why you can add vectors to get 'king - man + woman ≈ queen', why a semantic search over 10M documents takes milliseconds (nearest-neighbor over vectors), and why RAG (retrieval-augmented generation) works at all. Embeddings are the foundation of modern AI infrastructure — search, recommendation, anomaly detection, and every RAG pipeline depend on them.

Demo

A sentence embedding encodes semantic meaning as a point in high-dimensional space, so cosine similarity between two vectors directly measures how related the sentences are — even with no shared words. The demo encodes a small corpus with all-MiniLM-L6-v2 and ranks documents against a query, revealing exactly how semantic search differs from keyword matching.

Try it yourself

Run the query 'What is supervised learning?' against the corpus. Check that sentences about gradient descent and backprop rank higher than the one about dogs. This is semantic search — it works even without keyword overlap.
Add a new corpus entry: 'Stochastic gradient descent updates weights using a mini-batch of samples.'. Re-run the query about neural network learning. Does it rank above backpropagation? Why or why not?
Compute the pairwise similarity matrix of all 5 corpus sentences: corpus_emb_n @ corpus_emb_n.T. Print it. Verify that the two ML-related sentences about gradient descent and backprop are more similar to each other than to the dogs sentence.
Try a different query: 'Animals as pets'. Verify the dogs sentence now ranks highest. This demonstrates that embedding space is query-sensitive — the same document scores differently for different queries.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain what a sentence embedding is and why two semantically similar sentences have similar embedding vectors even if they share no words.

2. Why it works (the mechanism)

Walk me through how a sentence-transformer model converts a variable-length sentence into a fixed-size vector. What is mean pooling, and why is it applied to the token embeddings? Why is cosine similarity preferred over Euclidean distance for comparing embeddings?

3. Advanced — application & what's next

I need to build a semantic search system over 5M product descriptions. Walk me through the full stack: embedding model choice (all-MiniLM vs BGE vs OpenAI text-embedding-3-small), vector database options (Pinecone vs Weaviate vs pgvector), approximate nearest neighbor algorithms (HNSW vs IVF), and batched indexing strategy for 5M items. What's the approximate storage size and p95 query latency I should expect?