Measuring it — wrk, hey, oha, and percentiles

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

If you can't measure latency, you can't improve it. 'Average latency' is a lie — one 2-second request in a thousand 20ms requests drags the average by 2ms, but p99 jumps from 20ms to 2000ms. What matters is the distribution: p50, p90, p95, p99, p99.9. You need a load-testing tool that reports all of them. wrk is the classic; hey is a friendlier version; oha is the modern rust rewrite with a pretty UI. Learning to read a percentile distribution is half the job.

Demo

Load-testing tools differ in how they model concurrency, measure latency, and report percentiles — so the same server produces different-looking numbers depending on which tool you choose. Running hey, oha, and wrk against the same endpoint makes the differences concrete: request rate, connection model, and histogram bucketing are all configurable, and each tool's defaults make different trade-offs. Calibrating with all three means you trust the numbers you report.

Try it yourself

Install oha (cargo install oha, or brew install oha). Run it against your server. Look at the pretty TUI.
Same server, same load. Run wrk and hey for comparison. Note which gives you percentiles, which gives you histogram, which is easier to read.
Compare p50 vs p99 latency. How much bigger is p99? In a healthy service it's 3-5x; in an unhealthy one it's 50-100x.
Add a 10% 'slow path' (1% chance of a 500ms sleep). Rerun. Watch p99 and p99.9 explode while p50 stays the same. This is the tail-latency problem.
Pick a realistic SLO for your hypothetical service — e.g. 'p99 < 100ms at 5000 rps'. Your numbers: pass or fail?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Explain why 'average latency' is a misleading metric. Use a concrete example with a distribution of response times.

2. Why it works (the mechanism)

Walk me through how a load tester like wrk or oha actually works: it opens N concurrent connections, sends requests as fast as the server accepts them, records per-request latencies, and at the end computes percentiles. Where can the load tester itself become the bottleneck?

3. Advanced — application & what's next

Your service has p50=10ms, p99=300ms, and you're told to fix the tail. The symptoms: most requests are fast, 1% are 30x slower. Give me a checklist of the 5 most likely causes, in order of how often they're the real answer (GC pause? DB lock? slow disk? retry storm? cold cache?).