Your first local inference run

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Everything in this course is more concrete once you've run a model yourself and watched the phases happen. Running a small model locally (via Hugging Face transformers or llama.cpp) lets you observe prefill, decode, tokens-per-second, and memory use directly — no API abstraction. This hands-on baseline is what you'll optimize against in later modules, and it removes the mystery: an LLM is a program you can run, profile, and instrument. It also frees you from rate limits and cost while learning the mechanics.

Demo

The demo runs a small model locally and prints generation speed and the prefill/decode split. Running a 1-3B model on CPU/GPU is enough to feel the autoregressive loop and measure tokens/sec — your baseline for the rest of the course.

Try it yourself

Run a 1.5B model locally and record your tokens/sec baseline — you'll improve on it later.
Try a longer prompt and note the first-token delay (prefill) vs. steady decode speed.
Watch memory (nvidia-smi or Activity Monitor) and see weights + KV cache occupy VRAM/RAM.
Swap to llama.cpp with a quantized GGUF and compare speed/memory (preview of quantization).

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

How do I run a small LLM locally to see inference happen, and what should I measure?

2. Why it works (the mechanism)

Walk me through what happens when I run model.generate locally: tokenization, prefill, the decode loop, and where tokens/sec comes from.

3. Advanced — application & what's next

I want a local baseline to optimize against. Help me instrument prefill time, decode tok/s, and memory use for a small model, and interpret the numbers in terms of prefill/decode and memory-vs-compute bounds.

References

Working within free-tier limits. Free / low-tier provider keys rate-limit aggressively, and eval or agent loops that fan out calls will hit 429 Too Many Requests fast. Survive it: read Retry-After and the x-ratelimit-* headers and back off (exponential backoff with jitter + a max-retry cap) instead of hammering; cap in-flight requests with a small concurrency limiter so you stay under the RPM/TPM ceiling; cache identical requests so retries don't re-spend quota; downshift to a smaller/cheaper model for practice runs; use the provider Batch API for non-interactive jobs; or sidestep hosted limits entirely by running a small model locally (Ollama / llama.cpp) or on a free Colab/Kaggle GPU while you learn.