MSE / Huber Loss

Mean squared error (MSE) — also called L2 loss — and its robust cousin Huber loss (smooth L1). The standard losses for regression tasks — predicting continuous values like bounding box coordinates, noise in diffusion models, or Q-values in reinforcement learning. (Not to be confused with L2 regularisation / weight decay, which penalises weight magnitudes — see regularisation/weight-decay.md.)

Intuition

MSE asks: “how far off were you, squared?” Squaring has two effects. First, it penalises large errors much more than small ones — an error of 10 costs 100x more than an error of 1. This makes MSE aggressively chase outliers. Second, it gives a smooth, everywhere-differentiable loss surface with a clean gradient: just the error itself (times 2).

The problem is that squaring cuts both ways. If your data has outliers or noisy targets, MSE will warp the entire model to reduce those few extreme errors. Huber loss fixes this by being quadratic for small errors (behaving like MSE) and linear for large errors (behaving like MAE/L1). The transition point, delta, controls where “small” ends and “large” begins. PyTorch’s smooth_l1_loss uses delta=1 by default. Below delta, you get MSE-like gradients that shrink as you approach zero (good for precise convergence). Above delta, you get constant-magnitude gradients that don’t explode on outliers.

In deep RL, this distinction matters: Q-learning targets are notoriously noisy (they depend on the max over a changing network), so DQN uses Huber loss instead of MSE to avoid destabilising gradient spikes.

Math

MSE / L2 loss (mean squared error):

$\mathcal{L}_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$

Gradient with respect to prediction $\hat{y}_i$ : $\frac{\partial \mathcal{L}}{\partial \hat{y}_i} = \frac{2}{N}(\hat{y}_i - y_i)$ . The gradient is proportional to the error — large errors get large gradients.

Huber loss (smooth L1):

$L_\delta(e) = \begin{cases} \frac{1}{2} e^2 & \text{if } |e| \leq \delta \\ \delta \cdot (|e| - \frac{1}{2}\delta) & \text{if } |e| > \delta \end{cases}$

where $e = y - \hat{y}$ . For $|e| \leq \delta$ , the gradient is $e$ (like MSE). For $|e| > \delta$ , the gradient is $\pm\delta$ (constant magnitude, like L1).

MAE / L1 loss (for comparison):

$\mathcal{L}_{\text{MAE}} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i|$

Gradient is $\pm 1$ everywhere (except at zero where it’s undefined). Robust to outliers but doesn’t converge as precisely near zero because the gradient never shrinks.

Code

import torch
import torch.nn.functional as F

# ── MSE loss ─────────────────────────────────────────────────────
pred = model(x)                             # (B, D) — predictions
target = y                                  # (B, D) — ground truth
loss = F.mse_loss(pred, target)             # scalar, reduction='mean'

# ── Huber / smooth L1 (default delta=1.0) ────────────────────────
loss = F.smooth_l1_loss(pred, target)       # scalar
# Custom delta (called 'beta' in PyTorch):
loss = F.smooth_l1_loss(pred, target, beta=0.5)

# ── Huber loss (explicit, equivalent but different API) ──────────
loss = F.huber_loss(pred, target, delta=1.0)

# WARNING: F.smooth_l1_loss and F.huber_loss differ by a factor of
# (1/delta) in the quadratic region when delta != 1. Use huber_loss
# if you want the standard textbook Huber; use smooth_l1_loss if
# following DQN-style papers that expect the smooth L1 convention.

Manual Implementation

import numpy as np

def mse_loss(pred, target):
    """
    Equivalent to F.mse_loss(pred, target, reduction='mean').
    pred:   (B, D) or (B,) predictions
    target: same shape as pred
    """
    return ((pred - target) ** 2).mean()


def huber_loss(pred, target, delta=1.0):
    """
    Equivalent to F.huber_loss(pred, target, delta=delta).
    Quadratic for |error| <= delta, linear beyond.
    pred:   (B, D) predictions
    target: (B, D) ground truth
    """
    error = pred - target                                    # (B, D)
    abs_error = np.abs(error)                                # (B, D)
    quadratic = 0.5 * error ** 2                             # (B, D)
    linear = delta * (abs_error - 0.5 * delta)               # (B, D)
    loss = np.where(abs_error <= delta, quadratic, linear)   # (B, D)
    return loss.mean()

Popular Uses

Diffusion noise prediction (see diffusion/): MSE between predicted and actual noise is the core DDPM training objective — $\|\epsilon - \epsilon_\theta(x_t, t)\|^2$
Q-learning (see q-learning/): Huber loss between Q-values and bootstrap targets. DQN switched from MSE to Huber to tame noisy target gradients
Bounding box regression (Faster R-CNN, YOLO): smooth L1 for predicting box coordinates — robust to annotation noise
Autoencoders / VAEs (see variational-inference-vae/): MSE reconstruction loss when pixel-level fidelity matters (vs. cross-entropy for binary images)
Regression heads in multi-task models (e.g. predicting age, price, temperature)

Alternatives

Alternative	When to use	Tradeoff
Cross-entropy	Classification (discrete targets)	Probabilistic interpretation; not applicable to continuous targets
MAE / L1 loss	Need maximum outlier robustness	Constant gradient doesn’t shrink near the optimum — slower final convergence
Log-cosh loss	Want smooth L1-like behaviour without the piecewise definition	Approximately MSE for small errors, L1 for large; twice differentiable everywhere
Quantile loss	Predicting intervals or specific percentiles	Asymmetric — penalises over/under-prediction differently based on the quantile
Cosine similarity loss	Comparing directions, not magnitudes (embeddings)	Ignores scale entirely; only measures angular distance

Historical Context

MSE traces back to Gauss and Legendre (early 1800s) as the foundation of least-squares estimation. It became the default neural network loss for regression because it corresponds to maximum likelihood under Gaussian noise assumptions and has clean, well-behaved gradients.

Huber loss was introduced by Peter Huber in 1964 in robust statistics, specifically to reduce the influence of outliers in estimation. It entered deep learning through DQN (Mnih et al., 2015), where the smooth L1 variant stabilised Q-learning by capping gradient magnitudes from noisy bootstrap targets. The distinction between Huber and smooth L1 conventions (a factor of delta in the quadratic region) continues to cause minor confusion across frameworks, so always check which convention your codebase uses.