Representation Collapse

In self-supervised learning, the encoder maps all inputs to the same (or nearly the same) representation — the trivial constant solution. The loss reaches a low value not because the model learned useful features, but because identical representations trivially satisfy similarity objectives.

Intuition

Imagine you’re grading essays by similarity. If every student submits a blank page, every pair of essays is perfectly similar — full marks for everyone. That’s representation collapse: the encoder learns to output the same vector regardless of input, and any objective that rewards similarity between views of the same image is trivially satisfied.

The core issue is that contrastive and similarity-based objectives have a degenerate solution built into them. If $f(x) = c$ for all $x$ , then the similarity between any two augmented views is maximal (they’re identical). No negative examples? Collapse is the global minimum. Negative examples help but don’t always prevent subtler forms of collapse where the representations cluster into a low-dimensional subspace rather than spanning the full embedding space.

This is structurally similar to posterior collapse in VAEs and mode collapse in GANs — in each case, the model finds a degenerate solution that locally satisfies the training objective without learning anything useful. The difference is the mechanism: VAEs collapse the latent to the prior, GANs collapse the output to a few modes, and self-supervised methods collapse the representation to a point or subspace.

Manifestation

All representations have near-identical values — compute the standard deviation of embeddings across a batch; if it’s near zero, you’ve collapsed
Loss drops quickly to a low value early in training and stays flat — suspiciously good, suspiciously fast
Linear probe accuracy is at chance despite low training loss — the representations carry no discriminative information
Embedding covariance matrix is near-singular — most eigenvalues are near zero, meaning the representations live in a tiny subspace
Nearest-neighbour retrieval returns random images — if all embeddings are the same, every image is equally “close” to every other

Where It Appears

Contrastive self-supervised learning (contrastive-self-supervising/): SimCLR and MoCo use negative examples to prevent collapse (pushing apart non-matching pairs); BYOL avoids negatives entirely but uses a stop-gradient + EMA teacher instead; VICReg adds explicit variance and covariance regularisation
GANs (gans/): mode collapse is the generative analogue — the generator’s output space collapses to a few points
VAE (variational-inference-vae/): posterior collapse is the latent-space analogue — the encoder ignores input and matches the prior
Transformer (transformer/): without proper training, attention heads can collapse to uniform attention (every token attends equally to all tokens), a subtler form of the same phenomenon

Solutions at a Glance

Solution	Mechanism	Where documented
Contrastive negatives (SimCLR, MoCo)	Explicitly push apart representations of different images	`contrastive-self-supervising/`
Stop-gradient + EMA teacher (BYOL)	Asymmetric architecture prevents both branches from collapsing together	`contrastive-self-supervising/`, `atomic-concepts/mathematical-tricks/stop-gradient.md`
Variance/covariance regularisation (VICReg)	Directly penalise low variance and high correlation in the embedding dimensions	(Bardes et al., 2022)
Large batch + temperature (SimCLR)	More negatives per batch makes the contrastive signal stronger	`atomic-concepts/loss-functions/infonce-loss.md`
Momentum encoder (MoCo)	Maintains a large, consistent dictionary of negatives via EMA	`contrastive-self-supervising/`, `atomic-concepts/optimisation-primitives/exponential-moving-average.md`
Predictor head (BYOL, SimSiam)	An extra MLP on one branch breaks the symmetry that allows collapse	`contrastive-self-supervising/`

Historical Context

Representation collapse was understood as a risk from the early days of self-supervised learning (2018-2019), but it was BYOL (Grill et al., 2020) that made it a central topic of study. BYOL showed that contrastive negatives were not strictly necessary to avoid collapse — stop-gradient and an EMA teacher were sufficient — which surprised the community and sparked intense investigation into why these methods don’t collapse. The theoretical understanding is still evolving, but the practical toolkit (negatives, EMA, stop-gradient, variance regularisation) is now well-established. Chen & He (2021) with SimSiam further simplified the picture by showing that stop-gradient alone (without EMA) can prevent collapse, pointing to the optimisation dynamics rather than any single architectural trick as the key factor.