Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Gradient descent is the engine of all modern machine learning. Every training loop — from linear regression to GPT — does: compute the gradient of the loss w.r.t. parameters, move parameters a small step opposite to the gradient. Understanding why this works (gradient points uphill; negative gradient points downhill), what can go wrong (saddle points, vanishing gradients, lr too high/low), and the variants (SGD, mini-batch, Adam) means you can diagnose training failures instead of blindly tuning hyperparameters.
A 2D bowl loss surface — L(w) = w₀² + 4w₁² — has an analytically known minimum at the origin, which means you can verify that gradient descent actually finds it and measure exactly how the learning rate changes the path. Setting the learning rate above 1/(largest eigenvalue) causes the steps to overshoot and diverge; dropping below that threshold produces stable but potentially slow convergence that illustrates why learning-rate choice is the most consequential hyperparameter decision you make before training.
lr=0.3 and watch the loss diverge. The maximum stable lr for this loss is 1/(largest eigenvalue) = 1/8 = 0.125. Verify by trying lr=0.124 vs lr=0.126.v = 0.9*v + lr*grad(w); w -= v (init v = np.zeros(2)). Compare convergence speed at lr=0.01 with and without momentum.w[0]**2 + 100*w[1]**2. Run with lr=0.01. Notice it converges much slower along the steep dimension. This is the ill-conditioning problem that Adam's per-parameter scaling addresses.Use these three in order. Each builds on the one before.
In one paragraph, explain why subtracting the gradient moves parameters toward a minimum. Why subtract (not add)? What is the learning rate physically controlling?
Walk me through batch GD vs SGD vs mini-batch SGD. For each: what data subset is used per step, how good is the gradient estimate, memory footprint, and typical convergence behavior on 1M samples.
Explain the Adam optimizer from first principles: what are the m and v moment estimates, why do they need bias correction, and what problem does per-parameter lr adaptation solve vs vanilla SGD? Name two scenarios where Adam is worse than SGD with momentum.
import numpy as np
# 2D bowl: L(w) = w0^2 + 4*w1^2 — minimum at (0, 0)
loss = lambda w: w[0]**2 + 4*w[1]**2
grad = lambda w: np.array([2*w[0], 8*w[1]])
def gd(lr, steps=30):
w = np.array([3.0, 2.0])
for i in range(steps):
w = w - lr * grad(w)
if i % 5 == 0:
print(f" step {i:2d}: L={loss(w):.6f} w={w.round(4)}")
return w
print("lr=0.01 (slow, stable)")
gd(lr=0.01)
print("\nlr=0.10 (faster)")
gd(lr=0.10)
print("\nlr=0.28 (edge of divergence — watch loss grow)")
gd(lr=0.28)python3 main.py