Gradient descent — the update rule, learning rate, convergence

medium

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Gradient descent is the engine of all modern machine learning. Every training loop — from linear regression to GPT — does: compute the gradient of the loss w.r.t. parameters, move parameters a small step opposite to the gradient. Understanding why this works (gradient points uphill; negative gradient points downhill), what can go wrong (saddle points, vanishing gradients, lr too high/low), and the variants (SGD, mini-batch, Adam) means you can diagnose training failures instead of blindly tuning hyperparameters.

Demo

A 2D bowl loss surface — L(w) = w₀² + 4w₁² — has an analytically known minimum at the origin, which means you can verify that gradient descent actually finds it and measure exactly how the learning rate changes the path. Setting the learning rate above 1/(largest eigenvalue) causes the steps to overshoot and diverge; dropping below that threshold produces stable but potentially slow convergence that illustrates why learning-rate choice is the most consequential hyperparameter decision you make before training.

Try it yourself

Run with lr=0.3 and watch the loss diverge. The maximum stable lr for this loss is 1/(largest eigenvalue) = 1/8 = 0.125. Verify by trying lr=0.124 vs lr=0.126.
Add momentum: v = 0.9*v + lr*grad(w); w -= v (init v = np.zeros(2)). Compare convergence speed at lr=0.01 with and without momentum.
Change the bowl to w[0]**2 + 100*w[1]**2. Run with lr=0.01. Notice it converges much slower along the steep dimension. This is the ill-conditioning problem that Adam's per-parameter scaling addresses.
Implement mini-batch GD on random 2D data: generate 1000 samples, sample 32 per step, compute gradient on the mini-batch. Does it converge to the same weights? Is it noisier?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain why subtracting the gradient moves parameters toward a minimum. Why subtract (not add)? What is the learning rate physically controlling?

2. Why it works (the mechanism)

Walk me through batch GD vs SGD vs mini-batch SGD. For each: what data subset is used per step, how good is the gradient estimate, memory footprint, and typical convergence behavior on 1M samples.

3. Advanced — application & what's next

Explain the Adam optimizer from first principles: what are the m and v moment estimates, why do they need bias correction, and what problem does per-parameter lr adaptation solve vs vanilla SGD? Name two scenarios where Adam is worse than SGD with momentum.

References

Ruder — An overview of gradient descent optimization algorithms