Capstok — learn by doing

Why this matters

A fine-tune rarely produces a dramatic, obvious leap — it produces a model that more reliably does the specific thing you trained it for. People expect magic and get a 10-15% improvement on their target metric plus better consistency, then conclude it 'didn't work' because they were measuring the wrong thing or had no baseline. Setting expectations means defining, before you train, exactly what 'better' looks like and how you'll measure it. This task is about avoiding the disappointment that comes from a vague goal, and replacing it with a concrete success criterion you can hold the result against.

Demo

The demo forces you to write down a measurable success criterion and a baseline before training, so 'did it work?' has a numeric answer instead of a gut feeling.

Try it yourself

Fill in the plan dict for your own task with a real metric and threshold.
Measure your baseline-with-best-prompt number first — you can't claim improvement without it.
Add a guardrail metric (a capability that must NOT regress) and justify the tolerance.
Decide your success threshold and defend why it's the right bar for shipping.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, what should I realistically expect a fine-tune to improve, and by how much?

2. Why it works (the mechanism)

Walk me through how to define a measurable success criterion and baseline before I start fine-tuning.

3. Advanced — application & what's next

Given a task where my best prompt already hits 72% on my metric, how should I set a success threshold and a no-regression guardrail so the fine-tune's value is unambiguous?

References

# Write this BEFORE you train. If you can't fill it in, you're not ready.
plan = {
    "target_behavior": "Output valid JSON matching our schema on first try",
    "metric": "schema-valid fraction on a 100-example held-out set",
    "baseline_with_best_prompt": 0.72,        # measure this FIRST
    "success_threshold": 0.95,                # the bar the fine-tune must clear
    "guardrail": "general QA accuracy must not drop >2 points (no forgetting)",
}
def verdict(measured_after):
    if measured_after >= plan["success_threshold"]:
        return f"Success: {measured_after:.2f} >= {plan['success_threshold']}"
    if measured_after > plan["baseline_with_best_prompt"]:
        return f"Improved but short of bar: {measured_after:.2f}"
    return f"No gain over prompting ({measured_after:.2f}) -- rethink data or approach"
print(verdict(0.96)); print(verdict(0.83))

Run: python3 main.py

Setting realistic expectations