Health endpoints — /healthz vs /readyz

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

A 'health check' sounds simple until you have a load balancer that kills your app whenever Redis blips. The trick: two endpoints. /healthz answers 'is the process alive?' — always 200 unless the process is broken. /readyz answers 'can this instance serve traffic right now?' — 503 if a downstream the request will need is down. Load balancers route on /readyz. Process supervisors kill on /healthz. Conflate them and your LB will yank a perfectly fine app out of rotation every time S3 has a hiccup, OR your supervisor will never restart a wedged process. Kubernetes formalized this distinction; the rest of us should adopt it.

Demo

The right shape: /healthz is a one-liner returning 200 (or 503 only if the process has detected itself as broken). /readyz checks every dependency this instance's traffic depends on — DB, cache, maybe one critical upstream — with short timeouts (50-200ms) and returns 503 if any are down. Then your LB polls /readyz and your supervisor / orchestrator polls /healthz. Two endpoints, two semantics, no overlap.

Try it yourself

Implement both endpoints in your service. Curl them — /healthz should always be 200 (or you have a problem); /readyz should be 200 with all dependencies up.
Stop your local Redis or DB. Hit /readyz — you should get 503 with a clear JSON saying which dep failed. Restart it; back to 200.
Configure your LB / proxy / Kubernetes to poll /readyz (NOT /healthz) for traffic routing. Configure your process supervisor (or k8s livenessProbe) to poll /healthz for restarts.
Add a /readyz timeout of 200ms. Make the DB check sleep 500ms. Confirm /readyz returns 503 due to the timeout, not the DB itself — and that you can tell the difference from the response body.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Explain the difference between liveness and readiness probes. Why does Kubernetes have both, and why does the distinction matter even outside k8s?

2. Why it works (the mechanism)

Walk me through what happens when /readyz starts failing on one of three load-balanced instances: how does the LB notice, how long until traffic is pulled, what happens to in-flight requests, and how does it come back?

3. Advanced — application & what's next

Design a 'cascading-failure-safe' /readyz: I want to remove an instance from rotation when it's truly bad, but NOT remove all 10 instances at once just because the shared DB has a 5-second blip. What's the algorithm?