Label Smoothing
Label Smoothing
Section titled “Label Smoothing”Softens one-hot target distributions by mixing with a uniform distribution: . Prevents the model from producing overconfident predictions by never asking it to assign probability 1.0 to any class. Improves calibration and generalisation with zero architectural cost.
Intuition
Section titled “Intuition”Without label smoothing, the loss drives the model to put 100% probability on the correct class. Achieving this requires pushing the correct logit to — which means the logits grow without bound, the model becomes infinitely confident, and the softmax saturates. At that point, the model has memorised “this is definitely a cat” rather than learning “this is probably a cat because of these features.”
Label smoothing says: “don’t aim for 100%. Aim for 90% on the correct class and spread the remaining 10% evenly across all others.” This caps how large the logits need to grow, keeping the model in a regime where softmax outputs are informative rather than saturated. The model learns to be confident but not certain — and that hedge turns out to generalise better.
A useful side effect: label smoothing improves calibration. A well-calibrated model’s confidence matches its accuracy — when it says “80% sure,” it’s right 80% of the time. By preventing extreme confidences during training, label smoothing nudges the model toward this property without any explicit calibration objective.
Smoothed target distribution (class is correct, classes total):
This is equivalent to: , where is one-hot and is uniform.
Cross-entropy with smoothed targets:
This decomposes as: . The second term is a KL penalty that pushes the model’s output toward uniform — preventing any class from dominating too strongly.
Typical : 0.1 (the near-universal default). Values above 0.2 hurt accuracy.
import torchimport torch.nn.functional as F
# ── Built-in PyTorch support (since 1.10) ──────────────────────logits = model(x) # (B, K)targets = labels # (B,)loss = F.cross_entropy(logits, targets, label_smoothing=0.1)
# WARNING: label_smoothing expects integer targets (class indices),# not soft targets. If you already have soft targets, don't use this# flag — it will smooth your already-soft distribution.
# ── Manual computation (useful for understanding) ──────────────K = logits.size(-1)log_probs = F.log_softmax(logits, dim=-1) # (B, K)nll = -log_probs.gather(dim=-1, index=targets.unsqueeze(-1)).squeeze(-1) # (B,)smooth_loss = -log_probs.mean(dim=-1) # (B,)loss = (1 - 0.1) * nll + 0.1 * smooth_loss # (B,)loss = loss.mean() # scalarManual Implementation
Section titled “Manual Implementation”import numpy as np
def label_smoothing_cross_entropy(logits, targets, alpha=0.1): """ Cross-entropy with label smoothing. logits: (B, K) raw scores targets: (B,) integer class indices alpha: smoothing factor (0.0 = standard CE) """ B, K = logits.shape
# Stable log-softmax shifted = logits - logits.max(axis=1, keepdims=True) # (B, K) log_sum_exp = np.log(np.exp(shifted).sum(axis=1, keepdims=True)) # (B, 1) log_probs = shifted - log_sum_exp # (B, K)
# NLL on correct class nll = -log_probs[np.arange(B), targets] # (B,)
# Uniform penalty: mean of all log-probs smooth_loss = -log_probs.mean(axis=1) # (B,)
return ((1 - alpha) * nll + alpha * smooth_loss).mean()Popular Uses
Section titled “Popular Uses”- Image classification (Inception v3, EfficientNet): the original application, is default in most vision recipes
- LLM pretraining: some models use light smoothing to improve calibration of next-token predictions
- Machine translation (Transformer, “Attention Is All You Need”): was part of the original Transformer recipe and remains standard
- Knowledge distillation: soft targets from a teacher already provide implicit smoothing; additional label smoothing is usually unnecessary
- Speech recognition: commonly applied with in CTC and attention-based models
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Temperature scaling | Post-hoc calibration | Applied after training, doesn’t affect training dynamics; fixes calibration but doesn’t regularise |
| Mixup / CutMix | When you want stronger regularisation + data augmentation | Interpolates entire inputs and labels; more powerful but changes the data distribution |
| Focal loss | Class-imbalanced settings | Down-weights easy/confident examples; addresses imbalance rather than overconfidence |
| Knowledge distillation | When a teacher model is available | Soft targets from the teacher are a richer form of smoothing tailored to the data |
| Confidence penalty | Explicit entropy regularisation | Adds to the loss directly; more flexible but another hyperparameter |
Historical Context
Section titled “Historical Context”Label smoothing was introduced by Szegedy et al. (2016, “Rethinking the Inception Architecture”) as one of several training tricks for the Inception v3 model. It was a minor note in that paper but became ubiquitous after Vaswani et al. (2017, “Attention Is All You Need”) included it in the Transformer training recipe.
Muller et al. (2019, “When Does Label Smoothing Help?”) provided deeper analysis, showing that label smoothing makes representations of different classes more tightly clustered and equidistant in embedding space. They also noted a surprising downside: label smoothing can hurt knowledge distillation because the teacher’s softened outputs carry less information about inter-class relationships. Despite this edge case, remains a near-universal default — one of the cheapest regularisers to apply with consistent benefits.