Capstok — learn by doing

Why this matters

A service that 'feels slow' is not fixable — there is no test for feelings, no alert that fires on feelings, and no way to declare victory against feelings. A Service Level Objective (SLO) converts subjective discomfort into a measurable, alertable target: 'p99 latency < 300 ms, measured over a rolling 7-day window.' This single sentence defines what fast means for your service, what metric to instrument, at what threshold to page the on-call engineer, and when a performance improvement is officially done. Without an SLO, every optimisation is driven by whoever complains loudest rather than by data — and performance work never ends because 'done' is never defined.

Demo

SLOs are written as percentile budgets over rolling windows — "p99 < 500 ms over the last 30 days" — not as instantaneous snapshots. Evaluating a rolling window against a threshold shows exactly how many budget-minutes remain before a breach, which drives prioritization: a service burning budget at 2x the safe rate needs attention today, not at the next quarterly review. Implementing this evaluation on raw latency samples is the conceptual foundation of every error-budget dashboard.

Try it yourself

Run the demo with the 1% slow-request fraction. Is your p99 compliant? Now increase the slow fraction to 2% (rand.Float64() < 0.02). At what fraction does the SLO breach? This is your 'error budget' boundary.
Change the SLO threshold from 300ms to 150ms. How many samples are now 'burned'? What does this tell you about choosing an appropriate threshold for your service?
Write a function that takes two SLO evaluations (before/after a deploy) and prints whether the deploy improved, degraded, or had no measurable impact on the SLO. What would you need to declare 'no measurable impact' confidently?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain what an SLO is, how it differs from an SLA, and why error budgets matter.

2. Why it works (the mechanism)

Walk me through how Google's SRE book defines error budgets and how a team should respond when 50%, 75%, and 100% of the budget is consumed.

3. Advanced — application & what's next

My API has two user-facing endpoints: /search (called 10× per session, p99 target 200ms) and /checkout (called once per purchase, p99 target 1000ms). How would you design separate SLOs for each, and how would you roll them up into a single service-level indicator for an executive dashboard?

References

// main.go — evaluate an SLO against a latency sample window
package main

import (
	"fmt"
	"math/rand"
	"sort"
	"time"
)

const (
	sloPercentile = 99     // p99
	sloThresholdMs = 300   // must be < 300ms
	sloWindowSize  = 1000  // samples in the rolling window
)

func generateSamples(n int) []float64 {
	rand.Seed(time.Now().UnixNano())
	var ms []float64
	for i := 0; i < n; i++ {
		var v float64
		if rand.Float64() < 0.01 { // 1% slow requests
			v = 200 + rand.Float64()*600 // 200-800ms
		} else {
			v = 30 + rand.Float64()*80 // 30-110ms
		}
		ms = append(ms, v)
	}
	return ms
}

func evalSLO(samples []float64) {
	s := make([]float64, len(samples))
	copy(s, samples)
	sort.Float64s(s)

	idx := int(float64(sloPercentile)/100*float64(len(s)-1))
	p99 := s[idx]

	status := "✅ COMPLIANT"
	if p99 >= float64(sloThresholdMs) {
		status = "❌ BREACHED"
	}
	fmt.Printf("Window: %d samples\n", len(s))
	fmt.Printf("p%d:    %.1f ms (threshold: < %d ms)\n",
		sloPercentile, p99, sloThresholdMs)
	fmt.Printf("SLO:    %s\n", status)
	burned := 0
	for _, v := range s {
		if v >= float64(sloThresholdMs) { burned++ }
	}
	fmt.Printf("Budget: %d/%d requests burned (%.1f%% of error budget)\n",
		burned, len(s), float64(burned)/float64(len(s))*100)
}

func main() { evalSLO(generateSamples(sloWindowSize)) }

Run: go run main.go

SLOs — Turning 'It Feels Slow' Into a Number