Hotfixes — test strategy for emergency production patches

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

A hotfix is a patch applied to production to fix a critical bug, on a timeline measured in hours — not sprints. The testing challenge: how do you maintain quality while moving faster than your normal release process allows? The answer is a pre-defined, minimal hotfix test protocol: a regression pack of the smallest set of tests that covers the changed code and its most critical neighbors, a manual smoke test of the affected feature, and a monitoring watch period after deploy. 'We don't have time to test' is never acceptable — but 'we have a 30-minute targeted test protocol for hotfixes' is how responsible teams ship emergency fixes without creating more incidents.

Demo

A hotfix test protocol is a pre-defined, scope-limited set of tests that gives enough confidence to deploy a production fix in under two hours: unit tests for the exact changed function, integration tests for the affected endpoint, and a smoke test of the adjacent critical paths that share the most risk with the change. Running the full regression suite during a hotfix is usually wrong — it takes too long and the signal is too diffuse — but running nothing is never acceptable. Defining this protocol before an incident forces the team to think clearly about risk scope rather than improvising under pressure.

import subprocess, sys, time

# Hotfix test protocol — run this instead of the full suite for emergency patches

HOTFIX_CONTEXT = """
Bug:    Login returns HTTP 500 when email contains a + character
Fix:    Added email normalization in auth/login.py:validate_credentials()
Commit: a3f91bc
"""

# ── Step 1: Unit tests for the exact changed function ──────────────────────
UNIT_TESTS = [
    "tests/unit/auth/test_validate_credentials.py::test_email_with_plus_sign",
    "tests/unit/auth/test_validate_credentials.py::test_email_normalization",
    "tests/unit/auth/test_validate_credentials.py",  # full module
]

# ── Step 2: Integration tests for the affected API endpoint ──────────────────
INTEGRATION_TESTS = [
    "tests/integration/test_auth_api.py::test_login_success",
    "tests/integration/test_auth_api.py::test_login_invalid_credentials",
    "tests/integration/test_auth_api.py::test_login_with_special_email_chars",
    "tests/integration/test_auth_api.py",
]

# ── Step 3: Smoke tests for adjacent features (regression risk) ──────────────
SMOKE_TESTS = [
    "tests/smoke/test_auth_smoke.py",
    "tests/smoke/test_user_profile_smoke.py",
]

ALL_HOTFIX_TESTS = UNIT_TESTS + INTEGRATION_TESTS + SMOKE_TESTS

def run_hotfix_protocol():
    print(f"HOTFIX TEST PROTOCOL")
    print(f"Context: {HOTFIX_CONTEXT.strip()}")
    print(f"Running {len(ALL_HOTFIX_TESTS)} targeted test targets\n")

    for test_path in ALL_HOTFIX_TESTS:
        print(f"  pytest {test_path}")

    print("\nPost-deploy checklist:")
    checklist = [
        "✓ Deploy to staging and run smoke tests",
        "✓ Manual test: login with user+tag@example.com",
        "✓ Check error rate in Datadog/Grafana for 15 min after deploy",
        "✓ Check login success rate in analytics for 30 min",
        "✓ Notify on-call if error rate > baseline",
        "✗ DO NOT run full 2-hour regression suite — deploy to production first, run nightly",
    ]
    for item in checklist: print(f"  {item}")

run_hotfix_protocol()

Run: python3 main.py

Try it yourself

Think of a recent bug fix in a codebase you know. Write the hotfix test protocol for it: identify the unit tests, integration tests, and smoke tests that would give you enough confidence to ship in 2 hours without the full suite.

Research the concept of a 'canary deploy': the hotfix is deployed to 5% of traffic first, metrics are monitored for 15 minutes, and it's rolled out to 100% only if error rates look healthy. How does this reduce the risk of a hotfix? What monitoring would you set up?

Write a CI job that only runs tests in files that were modified by the current commit: git diff --name-only HEAD~1 | grep test_ | xargs pytest. This is a minimal hotfix CI gate. What are the risks of this approach?

Create a template bug report for a production incident that includes: impact (how many users affected), SLA (when it needs to be fixed by), and the hotfix test protocol (exactly which tests to run). This is a runbook for your on-call engineer.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain what a hotfix is and why its testing process is different from a normal release. What's the minimum testing you should do before deploying a hotfix to production?

2. Why it works (the mechanism)

Walk me through how to select the right test subset for a hotfix: how do you identify which unit tests cover the changed function, which integration tests cover the affected endpoint, and which smoke tests cover the adjacent risk area?

3. Advanced — application & what's next

I'm an SDET at a company that ships 3 hotfixes per month. Each one requires a judgment call: how much testing is 'enough' given the severity and the time pressure? Design a hotfix testing matrix: for each severity level (P0 production down, P1 major feature broken, P2 minor bug), specify the test protocol, the maximum time before deploy, the post-deploy monitoring window, and the rollback trigger criteria.