Where the FLOPs and the time actually go

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

To optimize serving you must know where the work is. A transformer forward pass spends its FLOPs on the attention computation and the big feed-forward (MLP) matmuls, scaled by model size and sequence length. Knowing the rough cost breakdown — that compute scales with parameters and (for attention) with sequence length squared — tells you why bigger models and longer contexts cost more, and where techniques like quantization (shrink the weights) and attention optimizations (tame the n²) pay off. This is the quantitative intuition behind every later optimization.

Demo

The demo estimates FLOPs per token (~2 × parameters for the matmuls) and shows attention's sequence-length dependence, giving you a back-of-envelope for why a 70B model is ~10× a 7B and why long contexts strain attention.

Try it yourself

Compare FLOPs/token for a 7B vs. 70B model and confirm it scales ~linearly with parameters.
Increase seq_len and watch attention's share of the work grow — the motivation for attention optimizations.
Connect: halving the weights' precision (quantization) roughly halves the matmul cost/memory.
Reason about why decode (1 new token, but full weights read) is memory-bound despite low FLOPs — setup for Module 3.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In an LLM forward pass, where do the compute (FLOPs) and time actually go?

2. Why it works (the mechanism)

Explain the rough FLOP breakdown (matmuls ~2×params/token, attention growing with sequence length) and why model size and context length drive cost.

3. Advanced — application & what's next

Using the FLOP breakdown, explain where quantization and attention optimizations pay off, and why decode is memory-bound while prefill is compute-bound despite both running the same weights.