Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
A fine-tune rarely produces a dramatic, obvious leap — it produces a model that more reliably does the specific thing you trained it for. People expect magic and get a 10-15% improvement on their target metric plus better consistency, then conclude it 'didn't work' because they were measuring the wrong thing or had no baseline. Setting expectations means defining, before you train, exactly what 'better' looks like and how you'll measure it. This task is about avoiding the disappointment that comes from a vague goal, and replacing it with a concrete success criterion you can hold the result against.
The demo forces you to write down a measurable success criterion and a baseline before training, so 'did it work?' has a numeric answer instead of a gut feeling.
Use these three in order. Each builds on the one before.
In one paragraph, what should I realistically expect a fine-tune to improve, and by how much?
Walk me through how to define a measurable success criterion and baseline before I start fine-tuning.
Given a task where my best prompt already hits 72% on my metric, how should I set a success threshold and a no-regression guardrail so the fine-tune's value is unambiguous?
# Write this BEFORE you train. If you can't fill it in, you're not ready.
plan = {
"target_behavior": "Output valid JSON matching our schema on first try",
"metric": "schema-valid fraction on a 100-example held-out set",
"baseline_with_best_prompt": 0.72, # measure this FIRST
"success_threshold": 0.95, # the bar the fine-tune must clear
"guardrail": "general QA accuracy must not drop >2 points (no forgetting)",
}
def verdict(measured_after):
if measured_after >= plan["success_threshold"]:
return f"Success: {measured_after:.2f} >= {plan['success_threshold']}"
if measured_after > plan["baseline_with_best_prompt"]:
return f"Improved but short of bar: {measured_after:.2f}"
return f"No gain over prompting ({measured_after:.2f}) -- rethink data or approach"
print(verdict(0.96)); print(verdict(0.83))python3 main.py