Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
The loss function is the only thing that tells the model what 'correct' means. MSE penalizes large errors quadratically — one big outlier can dominate the entire loss. MAE penalizes linearly and is robust to outliers, but its non-differentiability at zero complicates gradient descent. Cross-entropy is correct for classification: it penalizes confident wrong predictions exponentially, while MSE on class labels produces shallow gradients that slow training to a crawl. Choosing the wrong loss is one of the most common silent bugs in ML — the model still trains, just poorly.
MSE and MAE measure error differently: MSE squares each residual, so a single outlier with a residual of 95 contributes 9025 to the total while five normal residuals of 0.2 contribute only 0.2. That asymmetry is not a bug — for regression on clean targets MSE's steep gradient near large errors is useful — but it makes MSE a poor fit for classification, where the target is a probability between 0 and 1 and you want steep gradients only when the model is confidently wrong.
y_prob[3] = 0.01 (very confident wrong prediction). Recompute BCE and MSE. Which loss grows more steeply? This is why BCE is used for classification.np.where(np.abs(delta) < 1, 0.5*delta**2, np.abs(delta) - 0.5). Compare it to MSE and MAE on the outlier dataset — it should be more robust than MSE but smoother than MAE.binary_cross_entropy([1], [1.0]) approaches 0 and binary_cross_entropy([1], [0.001]) is large. This is the penalty for confident wrong predictions that MSE doesn't have.Use these three in order. Each builds on the one before.
In one paragraph, explain why cross-entropy is the standard loss for classification instead of MSE. What does cross-entropy measure, and why does MSE give poor gradients for probabilities near 0 or 1?
Derive binary cross-entropy from the log-likelihood of a Bernoulli distribution. Start from P(y=1) = p, write the likelihood of a batch, take the negative log, and arrive at the BCE formula. Why 'negative' log-likelihood?
My production regression target has heavy-tailed outliers (99th-pct errors are 100× the median). I'm choosing between MSE, MAE, Huber, and log-cosh. Walk me through the tradeoffs: optimization stability, outlier resistance, gradient behavior near zero, and how to tune the Huber delta.
import numpy as np
y_true = np.array([1.0, 2.0, 3.0, 4.0, 100.0]) # outlier at 100
y_pred = np.array([1.1, 2.0, 2.9, 4.2, 5.0]) # misses the outlier
mse = np.mean((y_true - y_pred) ** 2)
mae = np.mean(np.abs(y_true - y_pred))
print(f"MSE: {mse:.2f}") # 1782.02 — dominated by outlier
print(f"MAE: {mae:.2f}") # 19.06 — spread across errors
def binary_cross_entropy(y_true, y_prob, eps=1e-12):
return -np.mean(
y_true * np.log(y_prob + eps) + (1 - y_true) * np.log(1 - y_prob + eps)
)
y_cls = np.array([1, 0, 1, 1])
y_prob = np.array([0.9, 0.1, 0.8, 0.4])
bce = binary_cross_entropy(y_cls, y_prob)
mse_cls = np.mean((y_cls - y_prob) ** 2)
print(f"\nBCE: {bce:.4f}") # 0.2015
print(f"MSE for cls: {mse_cls:.4f}") # 0.0875
# BCE gives steeper gradients for confident wrong predictionspython3 main.py