Capstok — learn by doing

Why this matters

It's easy to treat MLOps practices as optional polish until you've lived through what their absence costs. Three failure stories recur across teams: a model silently decays for months because no one is watching accuracy, and by the time revenue drops the trail is cold; a model that made a bad decision can't be reproduced because the training data and seed were never captured, so you can neither explain nor fix it; and a bad deploy takes down quality with no way to roll back because the previous model artifact was overwritten. Each of these is not bad luck — it's a specific missing practice (monitoring, versioning, registries/rollback). Pricing out the failure is what justifies the work to a skeptical team, because 'we can't reproduce or revert our own model' is a far scarier sentence than 'we skipped experiment tracking.'

Demo

The demo maps three real failure modes to the exact MLOps practice that prevents each, and computes a crude cost of the gap so you can see why the practice pays for itself.

Try it yourself

Pick the one failure story most likely to hit your team and name the exact practice (from a later module) that closes it.
Change days_undetected on the monitoring story from 90 to the number of days YOUR system could decay unnoticed today.
Add a fourth failure your team has actually experienced and map it to a missing practice.
Total the avoidable cost and compare it to the effort of adding the one practice that prevents the biggest line item.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain what concretely goes wrong when a team skips MLOps, like I'm new to it.

2. Why it works (the mechanism)

Walk me through how a model can silently decay for months undetected, and which single MLOps practice would have caught it and when.

3. Advanced — application & what's next

Given a model that made a costly bad decision that we now cannot reproduce, walk me through exactly which versioning gaps caused that and how I'd close them.

References

Chat about this lesson

# Each failure story -> the missing practice that would have prevented it.
failures = [
    {"story": "Accuracy fell for 3 months before anyone noticed",
     "missing": "production monitoring / drift detection",
     "days_undetected": 90, "revenue_per_day_lost": 400},
    {"story": "Model made a bad call; can't reproduce to debug it",
     "missing": "data + seed + weight versioning (reproducibility)",
     "days_undetected": 14, "revenue_per_day_lost": 250},
    {"story": "Bad deploy tanked quality; no previous artifact to revert to",
     "missing": "model registry + rollback",
     "days_undetected": 2, "revenue_per_day_lost": 3000},
]

total = 0
for f in failures:
    cost = f["days_undetected"] * f["revenue_per_day_lost"]
    total += cost
    print(f"{f['story']}\n  prevented by: {f['missing']}  (cost: {cost:,} USD)\n")

print(f"Total avoidable cost across three incidents: {total:,} USD")
# The point: the practices are cheap; the gaps are not.

Run: python3 main.py

The cost of NOT doing MLOps