Loss functions — MSE, MAE, cross-entropy, and why they differ

medium

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

The loss function is the only thing that tells the model what 'correct' means. MSE penalizes large errors quadratically — one big outlier can dominate the entire loss. MAE penalizes linearly and is robust to outliers, but its non-differentiability at zero complicates gradient descent. Cross-entropy is correct for classification: it penalizes confident wrong predictions exponentially, while MSE on class labels produces shallow gradients that slow training to a crawl. Choosing the wrong loss is one of the most common silent bugs in ML — the model still trains, just poorly.

Demo

MSE and MAE measure error differently: MSE squares each residual, so a single outlier with a residual of 95 contributes 9025 to the total while five normal residuals of 0.2 contribute only 0.2. That asymmetry is not a bug — for regression on clean targets MSE's steep gradient near large errors is useful — but it makes MSE a poor fit for classification, where the target is a probability between 0 and 1 and you want steep gradients only when the model is confidently wrong.

Try it yourself

Change the outlier from 100 to 10 and recompute MSE and MAE. Track the MSE/MAE ratio as you vary the outlier from 5 to 500 — it demonstrates how MSE's quadratic penalty amplifies outliers non-linearly.
In the classification block, set y_prob[3] = 0.01 (very confident wrong prediction). Recompute BCE and MSE. Which loss grows more steeply? This is why BCE is used for classification.
Implement Huber loss: np.where(np.abs(delta) < 1, 0.5*delta**2, np.abs(delta) - 0.5). Compare it to MSE and MAE on the outlier dataset — it should be more robust than MSE but smoother than MAE.
Verify that binary_cross_entropy([1], [1.0]) approaches 0 and binary_cross_entropy([1], [0.001]) is large. This is the penalty for confident wrong predictions that MSE doesn't have.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain why cross-entropy is the standard loss for classification instead of MSE. What does cross-entropy measure, and why does MSE give poor gradients for probabilities near 0 or 1?

2. Why it works (the mechanism)

Derive binary cross-entropy from the log-likelihood of a Bernoulli distribution. Start from P(y=1) = p, write the likelihood of a batch, take the negative log, and arrive at the BCE formula. Why 'negative' log-likelihood?

3. Advanced — application & what's next

My production regression target has heavy-tailed outliers (99th-pct errors are 100× the median). I'm choosing between MSE, MAE, Huber, and log-cosh. Walk me through the tradeoffs: optimization stability, outlier resistance, gradient behavior near zero, and how to tune the Huber delta.