100 challenges, from a scalar gradient to a fine-tuned model streaming behind your own API.
A rebuildable LLM course. Start with tensors and autograd (in Go, Python, and Rust), build a micrograd, train a bigram, author attention from first principles, stitch a GPT, train it on TinyShakespeare, scale it with mixed precision and FlashAttention, fine-tune with LoRA and DPO, quantize to int4, and ship behind a streaming HTTP API. Every module has runnable code and a module-level project; the capstone is a small but real LLM you trained, fine-tuned, and deployed.
Built by Lakshya Kumar
We grant free access case-by-case — students, career-switchers, builders on a tight budget. Sign in to send us a note.
Sign in to applyComplete all modules, then submit the required number of capstone projects. Each must earn a passing rating from an admin reviewer.
End-to-end: BPE tokenizer on a dataset you picked, a 10–50M-param GPT trained to a measurable val loss, instruction fine-tune with LoRA, DPO preference tune, int4 quantize, deploy behind a streaming chat API with moderation and structured logs. Ship as a repo with README, measurements, sample outputs, and 5 concrete failure modes you discovered.
Paste this into any AI chat. Fill in the bracketed parts with your context — you'll get back a straight answer on whether this belongs on your plate.
I'm considering a "How to create an LLM from scratch" course. It builds up from tensors/autograd, to neural nets, to tokenization, to attention, to a full Transformer, to training GPT on TinyShakespeare, to training-at-scale tricks (FlashAttention, ZeRO, mixed precision), to SFT + LoRA + DPO fine-tuning, to int4 quantization and deployment behind a streaming chat API. 100 challenges total. Python throughout, with Go + Rust + Node on the engineering/deployment modules via code tabs. Context about me: 1. My current role/focus: [e.g. "backend dev who's curious about ML", "data scientist who only ever calls model.fit", "undergrad who's already watched Karpathy's videos once"] 2. The deepest I've gone into ML so far: [e.g. "nothing", "sklearn + XGBoost", "trained a CNN in PyTorch once", "fine-tuned Llama with a HuggingFace trainer"] 3. What I'm hoping this course changes about me: [e.g. "I can read an LLM paper and implement it", "I can deploy my own fine-tuned model at work", "I can start my own AI startup"] Answer these: - For my background, which 2 modules will give me the highest leverage in the next 3 months, and why? - Name a concrete artifact I'd build during the course that I could actually use on my resume or at work. - Is 60 hours worth it for me, or should I do something shorter first (just Karpathy, a course on ML basics, etc.)? Give your honest pick. - What should I explicitly NOT expect — e.g. "you will not train a 70B model", "you will not beat GPT-4 at anything", "you will not learn RAG"?
Implement scaled dot-product attention, multi-head attention, and a full transformer block from scratch (no torch.nn.MultiheadAttention). Compare correctness against the reference impl on synthetic data. Profile memory and FLOPs at sequence lengths 128, 512, 2048.
Implement BPE tokenizer training and encoding from scratch. Train on 100MB of text; produce a 32k-vocab tokenizer. Encode/decode round-trip 10MB of text and compare compression ratio + speed against tiktoken. Document the byte-level fallback.
Take your trained base model and instruction-tune it on a small instruction dataset (Alpaca-style, 5k examples). Evaluate the instruct model on held-out instructions: response quality (manual scoring) + structured task success (JSON output, classification). Compare to base model.
Train 4 sizes of your model (1M, 10M, 100M, ~500M params) on the same data budget. Plot loss vs parameter count and vs FLOPs. Identify your local Chinchilla-optimal point. Compare findings to the canonical scaling-laws papers and explain deviations.
The book version of this course. Read a chapter when a module doesn't fully land.