Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
A 'health check' sounds simple until you have a load balancer that kills your app whenever Redis blips. The trick: two endpoints. /healthz answers 'is the process alive?' — always 200 unless the process is broken. /readyz answers 'can this instance serve traffic right now?' — 503 if a downstream the request will need is down. Load balancers route on /readyz. Process supervisors kill on /healthz. Conflate them and your LB will yank a perfectly fine app out of rotation every time S3 has a hiccup, OR your supervisor will never restart a wedged process. Kubernetes formalized this distinction; the rest of us should adopt it.
The right shape: /healthz is a one-liner returning 200 (or 503 only if the process has detected itself as broken). /readyz checks every dependency this instance's traffic depends on — DB, cache, maybe one critical upstream — with short timeouts (50-200ms) and returns 503 if any are down. Then your LB polls /readyz and your supervisor / orchestrator polls /healthz. Two endpoints, two semantics, no overlap.
/healthz should always be 200 (or you have a problem); /readyz should be 200 with all dependencies up./readyz — you should get 503 with a clear JSON saying which dep failed. Restart it; back to 200./readyz (NOT /healthz) for traffic routing. Configure your process supervisor (or k8s livenessProbe) to poll /healthz for restarts./readyz timeout of 200ms. Make the DB check sleep 500ms. Confirm /readyz returns 503 due to the timeout, not the DB itself — and that you can tell the difference from the response body.Use these three in order. Each builds on the one before.
Explain the difference between liveness and readiness probes. Why does Kubernetes have both, and why does the distinction matter even outside k8s?
Walk me through what happens when /readyz starts failing on one of three load-balanced instances: how does the LB notice, how long until traffic is pulled, what happens to in-flight requests, and how does it come back?
Design a 'cascading-failure-safe' /readyz: I want to remove an instance from rotation when it's truly bad, but NOT remove all 10 instances at once just because the shared DB has a 5-second blip. What's the algorithm?
package main
import (
"context"
"database/sql"
"encoding/json"
"net/http"
"time"
)
var db *sql.DB
var redisOK func(context.Context) error
// /healthz — process liveness. Always 200 unless the process itself is broken.
func healthz(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
_, _ = w.Write([]byte("ok"))
}
// /readyz — can this instance serve a real user request right now?
func readyz(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 200*time.Millisecond)
defer cancel()
status := map[string]string{}
code := 200
if err := db.PingContext(ctx); err != nil {
status["db"] = "fail: " + err.Error()
code = 503
} else {
status["db"] = "ok"
}
if err := redisOK(ctx); err != nil {
status["redis"] = "fail: " + err.Error()
code = 503
} else {
status["redis"] = "ok"
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(code)
_ = json.NewEncoder(w).Encode(status)
}go run main.go