Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Most ML tutorials start with model.fit(X, y) and never explain what's inside. A model is a function with learnable parameters: given input features it multiplies them by weights, adds a bias, and passes the result through a non-linearity. Understanding this — not as magic but as f(x; θ) — means you can read any architecture diagram, debug shape mismatches, and understand why 'training' is just parameter optimization. Skip this and you'll cargo-cult .fit() calls without knowing what they're actually doing.
A machine learning model is just a function with tunable numbers: for a single neuron those numbers are one weight vector and one bias. Writing it by hand in NumPy — before any framework gets involved — makes the computation concrete: a dot product, an addition, and a non-linearity. Every architecture from logistic regression to GPT is a variation on this same skeleton.
w = np.zeros(3). What does y_hat become? This is the uninitialised-model baseline — it predicts 0.5 regardless of input.x = np.zeros(3). Show that no matter what w is, y_hat depends only on b. This reveals the role of the bias term.sigmoid for a linear pass-through (lambda z: z). Feed x = np.array([1, 0, 0]) and verify y_hat == w[0] + b. This is linear regression in its simplest form.w.shape, x.shape, and np.dot(w, x). Confirm the shapes are compatible — this is the first shape-matching constraint you'll hit in every real model.Use these three in order. Each builds on the one before.
In one paragraph, explain what a machine learning model's parameters are. Where do they come from — are they set by the programmer, or found automatically? Use the single-neuron example as a concrete reference.
Walk me through what happens numerically when the neuron receives `x = [2, 1, 0.5]`, `w = [0.5, -1.2, 0.8]`, `b = 0.3`: show each step — dot product, add bias, sigmoid. Why use sigmoid instead of outputting z directly?
Distinguish a model's architecture (layer count, activation shapes) from its weights (the learned numbers). Why does this distinction matter when saving a checkpoint, fine-tuning a pretrained model, or debugging a NaN loss? Give one concrete example for each scenario.
import numpy as np
# A single neuron: y = sigmoid(w · x + b)
def sigmoid(z): return 1 / (1 + np.exp(-z))
w = np.array([0.5, -1.2, 0.8]) # weights (learnable parameters)
b = 0.3 # bias (also learnable)
x = np.array([2.0, 1.0, 0.5]) # one input sample (3 features)
z = np.dot(w, x) + b # linear part
y_hat = sigmoid(z) # squash to (0, 1)
print(f"z = {z:.4f}") # z = 0.3000
print(f"y_hat = {y_hat:.4f}") # y_hat = 0.5744
# Training means adjusting w and b so y_hat gets close to the true label.
# That adjustment loop is gradient descent — covered in task 5.python3 main.py