MSE / Huber Loss
MSE / Huber Loss
Section titled “MSE / Huber Loss”Mean squared error (MSE) — also called L2 loss — and its robust cousin Huber loss (smooth L1). The standard losses for regression tasks — predicting continuous values like bounding box coordinates, noise in diffusion models, or Q-values in reinforcement learning. (Not to be confused with L2 regularisation / weight decay, which penalises weight magnitudes — see regularisation/weight-decay.md.)
Intuition
Section titled “Intuition”MSE asks: “how far off were you, squared?” Squaring has two effects. First, it penalises large errors much more than small ones — an error of 10 costs 100x more than an error of 1. This makes MSE aggressively chase outliers. Second, it gives a smooth, everywhere-differentiable loss surface with a clean gradient: just the error itself (times 2).
The problem is that squaring cuts both ways. If your data has outliers or noisy targets, MSE will warp the entire model to reduce those few extreme errors. Huber loss fixes this by being quadratic for small errors (behaving like MSE) and linear for large errors (behaving like MAE/L1). The transition point, delta, controls where “small” ends and “large” begins. PyTorch’s smooth_l1_loss uses delta=1 by default. Below delta, you get MSE-like gradients that shrink as you approach zero (good for precise convergence). Above delta, you get constant-magnitude gradients that don’t explode on outliers.
In deep RL, this distinction matters: Q-learning targets are notoriously noisy (they depend on the max over a changing network), so DQN uses Huber loss instead of MSE to avoid destabilising gradient spikes.
MSE / L2 loss (mean squared error):
Gradient with respect to prediction : . The gradient is proportional to the error — large errors get large gradients.
Huber loss (smooth L1):
where . For , the gradient is (like MSE). For , the gradient is (constant magnitude, like L1).
MAE / L1 loss (for comparison):
Gradient is everywhere (except at zero where it’s undefined). Robust to outliers but doesn’t converge as precisely near zero because the gradient never shrinks.
import torchimport torch.nn.functional as F
# ── MSE loss ─────────────────────────────────────────────────────pred = model(x) # (B, D) — predictionstarget = y # (B, D) — ground truthloss = F.mse_loss(pred, target) # scalar, reduction='mean'
# ── Huber / smooth L1 (default delta=1.0) ────────────────────────loss = F.smooth_l1_loss(pred, target) # scalar# Custom delta (called 'beta' in PyTorch):loss = F.smooth_l1_loss(pred, target, beta=0.5)
# ── Huber loss (explicit, equivalent but different API) ──────────loss = F.huber_loss(pred, target, delta=1.0)
# WARNING: F.smooth_l1_loss and F.huber_loss differ by a factor of# (1/delta) in the quadratic region when delta != 1. Use huber_loss# if you want the standard textbook Huber; use smooth_l1_loss if# following DQN-style papers that expect the smooth L1 convention.Manual Implementation
Section titled “Manual Implementation”import numpy as np
def mse_loss(pred, target): """ Equivalent to F.mse_loss(pred, target, reduction='mean'). pred: (B, D) or (B,) predictions target: same shape as pred """ return ((pred - target) ** 2).mean()
def huber_loss(pred, target, delta=1.0): """ Equivalent to F.huber_loss(pred, target, delta=delta). Quadratic for |error| <= delta, linear beyond. pred: (B, D) predictions target: (B, D) ground truth """ error = pred - target # (B, D) abs_error = np.abs(error) # (B, D) quadratic = 0.5 * error ** 2 # (B, D) linear = delta * (abs_error - 0.5 * delta) # (B, D) loss = np.where(abs_error <= delta, quadratic, linear) # (B, D) return loss.mean()Popular Uses
Section titled “Popular Uses”- Diffusion noise prediction (see
diffusion/): MSE between predicted and actual noise is the core DDPM training objective — - Q-learning (see
q-learning/): Huber loss between Q-values and bootstrap targets. DQN switched from MSE to Huber to tame noisy target gradients - Bounding box regression (Faster R-CNN, YOLO): smooth L1 for predicting box coordinates — robust to annotation noise
- Autoencoders / VAEs (see
variational-inference-vae/): MSE reconstruction loss when pixel-level fidelity matters (vs. cross-entropy for binary images) - Regression heads in multi-task models (e.g. predicting age, price, temperature)
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Cross-entropy | Classification (discrete targets) | Probabilistic interpretation; not applicable to continuous targets |
| MAE / L1 loss | Need maximum outlier robustness | Constant gradient doesn’t shrink near the optimum — slower final convergence |
| Log-cosh loss | Want smooth L1-like behaviour without the piecewise definition | Approximately MSE for small errors, L1 for large; twice differentiable everywhere |
| Quantile loss | Predicting intervals or specific percentiles | Asymmetric — penalises over/under-prediction differently based on the quantile |
| Cosine similarity loss | Comparing directions, not magnitudes (embeddings) | Ignores scale entirely; only measures angular distance |
Historical Context
Section titled “Historical Context”MSE traces back to Gauss and Legendre (early 1800s) as the foundation of least-squares estimation. It became the default neural network loss for regression because it corresponds to maximum likelihood under Gaussian noise assumptions and has clean, well-behaved gradients.
Huber loss was introduced by Peter Huber in 1964 in robust statistics, specifically to reduce the influence of outliers in estimation. It entered deep learning through DQN (Mnih et al., 2015), where the smooth L1 variant stabilised Q-learning by capping gradient magnitudes from noisy bootstrap targets. The distinction between Huber and smooth L1 conventions (a factor of delta in the quadratic region) continues to cause minor confusion across frameworks, so always check which convention your codebase uses.