The dot product is everything

medium

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

A single neuron is a dot product followed by a non-linearity. A linear layer is many dot products bundled as a matrix multiply. Attention is a dot product between queries and keys. Word similarity is a dot product between embeddings. If you understand the dot product — the weighted sum — and you understand that matrix multiplication is just many dot products stacked, you've unlocked 90% of the math in an LLM.

Demo

The dot product of two vectors a and b is sum(a_i * b_i). Geometrically it's |a| * |b| * cos(angle) — which is why normalized dot products (cosine similarity) measure "how aligned are these two vectors?" That's how "king - man + woman ≈ queen" works in embedding space.

Below: the dot product from scratch in three languages, plus cosine similarity on a tiny embedding example.

Try it yourself

Compute by hand: dot([1,2,3], [4,5,6]). Then verify with code.

Flip the sign of one coordinate in queen and observe what happens to cosine similarity. Can you make it negative?

Swap king's 3rd coordinate from 0.1 to 0.95 (making it more car-like). Re-run both cosines — they should cross over.

Implement dot product without a loop using your language's standard library's vectorized primitives. Time both on 1M-element vectors.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Define the dot product and cosine similarity. Give one non-ML example and one ML example where each is used.

2. How it actually works (the mechanism)

Explain why `a · b = |a||b|cos(θ)`. Walk through the derivation using the law of cosines, then explain why dividing by magnitudes gives a scale-invariant similarity.

3. Advanced — application & what's next

In an LLM, when is a dot product NOT the right similarity? Name two: (a) when embeddings aren't normalized and lengths encode frequency, (b) in IR with learned-sparse vectors. Explain what to use instead.