Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Everything in this course is more concrete once you've run a model yourself and watched the phases happen. Running a small model locally (via Hugging Face transformers or llama.cpp) lets you observe prefill, decode, tokens-per-second, and memory use directly — no API abstraction. This hands-on baseline is what you'll optimize against in later modules, and it removes the mystery: an LLM is a program you can run, profile, and instrument. It also frees you from rate limits and cost while learning the mechanics.
The demo runs a small model locally and prints generation speed and the prefill/decode split. Running a 1-3B model on CPU/GPU is enough to feel the autoregressive loop and measure tokens/sec — your baseline for the rest of the course.
Use these three in order. Each builds on the one before.
How do I run a small LLM locally to see inference happen, and what should I measure?
Walk me through what happens when I run model.generate locally: tokenization, prefill, the decode loop, and where tokens/sec comes from.
I want a local baseline to optimize against. Help me instrument prefill time, decode tok/s, and memory use for a small model, and interpret the numbers in terms of prefill/decode and memory-vs-compute bounds.
Working within free-tier limits. Free / low-tier provider keys rate-limit aggressively, and eval or agent loops that fan out calls will hit
429 Too Many Requestsfast. Survive it: readRetry-Afterand thex-ratelimit-*headers and back off (exponential backoff with jitter + a max-retry cap) instead of hammering; cap in-flight requests with a small concurrency limiter so you stay under the RPM/TPM ceiling; cache identical requests so retries don't re-spend quota; downshift to a smaller/cheaper model for practice runs; use the provider Batch API for non-interactive jobs; or sidestep hosted limits entirely by running a small model locally (Ollama / llama.cpp) or on a free Colab/Kaggle GPU while you learn.
# pip install transformers torch (use a small model so it runs anywhere)
import time, torch
from transformers import AutoModelForCausalLM, AutoTokenizer
name = "Qwen/Qwen2.5-1.5B-Instruct"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(name, torch_dtype=torch.float16,
device_map="auto")
ids = tok("Explain prefill vs decode in one sentence.", return_tensors="pt").to(model.device)
t0 = time.time()
out = model.generate(**ids, max_new_tokens=80)
dt = time.time() - t0
new_tokens = out.shape[1] - ids.input_ids.shape[1]
print(tok.decode(out[0][ids.input_ids.shape[1]:]))
print(f"{new_tokens} tokens in {dt:.2f}s = {new_tokens/dt:.1f} tok/s")python3 main.py