Fine-tuning vs prompting — when to do which

medium

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Prompting (or in-context learning) means giving a pretrained model instructions and examples in the prompt — no weight updates. Fine-tuning means updating the model's weights on your labeled data. Prompting is fast to iterate, cheap, and surprisingly powerful for well-defined tasks with clear instructions. Fine-tuning is slower and needs labeled data, but it produces lower latency, lower cost-per-query, and higher accuracy when the task distribution is far from the model's pretraining data. Getting this decision wrong means either paying 100× more per query than necessary, or spending weeks fine-tuning a model that a 5-line prompt would have matched.

Demo

Zero-shot classification uses natural language entailment — the BART-large-MNLI model has never seen your label names but can infer 'finance' from 'The Fed raised rates' by reasoning about textual entailment. Fine-tuned classifiers learn explicit decision boundaries from labeled data and run at a fraction of the cost per query. The right choice turns on labeled data availability, label stability, and the inference budget.

# pip install transformers datasets torch
from transformers import pipeline

# Approach 1: Zero-shot prompting (no training data needed)
zero_shot = pipeline("zero-shot-classification",
                     model="facebook/bart-large-mnli")

texts = [
    "The stock market crashed after the Fed raised rates.",
    "Manchester United won 3-0 against Arsenal.",
    "Scientists discover new exoplanet in habitable zone.",
]
labels = ["finance", "sports", "science"]

print("Zero-shot classification:")
for text in texts:
    result = zero_shot(text, candidate_labels=labels)
    top = result["labels"][0]
    score = result["scores"][0]
    print(f"  [{top:8s} {score:.2f}]  {text[:50]}")

# Approach 2: Few-shot prompting (GPT-style, no weight updates)
prompt_template = """Classify the following text as finance, sports, or science.

Examples:
"Fed raises interest rates by 25 basis points." → finance
"LeBron James scores 40 points in playoff win." → sports
"CRISPR used to reverse genetic blindness in mice." → science

Text: "{text}"
Answer:"""

print("\nFew-shot prompt (send to any LLM API):")
for text in texts:
    print(prompt_template.format(text=text[:60])[:200])
    print("---")

Run: python3 main.py

Try it yourself

Add a new label 'politics' to the zero-shot classifier and try 'Congress passes new infrastructure bill.' Does it classify correctly? Zero-shot works because the MNLI model understands natural language entailment — not because it knows your label names.

Try a deliberately ambiguous text: 'Apple stock rose 5% after the product launch.' Does zero-shot correctly classify it as finance over science? Adjust the label names to see how label wording affects confidence scores.

Write a few-shot prompt with 3 labeled examples for sentiment analysis (positive/negative/neutral). Test it on 5 new sentences. Compare to pipeline('sentiment-analysis'). Which is more accurate on domain-specific text (e.g., technical product reviews)?

Research the cost tradeoff: if zero-shot costs

0.003 per 1000 tokens and you need 10M classifications per day, calculate the daily and annual API cost. At what query volume does fine-tuning a smaller model (that costs

20 to fine-tune and $0.0001/1K tokens) break even?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain the difference between prompting and fine-tuning a language model. When does prompting fail — what kinds of tasks reliably require fine-tuning?

2. Why it works (the mechanism)

Walk me through what 'in-context learning' means mechanically: when you include 3 examples in the prompt (few-shot), what is the model doing with those examples? Is it updating its weights? How does it 'learn' from them?

3. Advanced — application & what's next

I'm building a customer support classifier for 80 fine-grained intent categories (e.g., 'billing dispute', 'password reset', 'shipping delay'). I have 500 labeled examples per category. Compare: (1) few-shot prompting with GPT-4, (2) fine-tuning a BERT-base classifier, (3) fine-tuning a Mistral-7B with LoRA. For each: expected accuracy, cost to build, cost per query at 1M queries/day, and latency.