Skip to content

InfoNCE Loss

A contrastive loss that treats representation learning as a classification problem: “which of these N candidates is the true positive?” The core training objective for SimCLR, CLIP, MoCo, and most modern contrastive learning methods. Also called NT-Xent (normalised temperature-scaled cross-entropy) in the SimCLR paper — they are the same thing.

You have an anchor (e.g. an augmented image), one positive (a different augmentation of the same image), and many negatives (augmentations of other images). InfoNCE computes the similarity of the anchor to every candidate and asks: “can you pick the positive out of the lineup?” The loss is literally cross-entropy over this N-way classification problem, where the “classes” are the candidates and the “correct class” is the positive.

Why does this work for learning representations? Because the only way to reliably pick the positive from a large set of negatives is to encode the semantic content that makes the positive similar to the anchor. Surface-level features (colour, orientation) vary across augmentations, so they can’t help. The model must learn features that are invariant to augmentation but discriminative across different inputs.

Temperature τ\tau controls how “sharp” the classification is. Low temperature (e.g. 0.07) makes the softmax peakier, forcing the model to produce very distinct embeddings. High temperature (e.g. 1.0) is more forgiving. SimCLR and CLIP both use learned or tuned temperatures around 0.07-0.1. Setting temperature too low causes training instability; too high makes the task too easy and representations become less discriminative.

General form (anchor qq, positive k+k^+, negatives {ki}\{k^-_i\}):

L=logexp(sim(q,k+)/τ)exp(sim(q,k+)/τ)+iexp(sim(q,ki)/τ)\mathcal{L} = -\log \frac{\exp(\text{sim}(q, k^+) / \tau)}{\exp(\text{sim}(q, k^+) / \tau) + \sum_{i} \exp(\text{sim}(q, k^-_i) / \tau)}

where sim(a,b)=abab\text{sim}(a, b) = \frac{a \cdot b}{\|a\| \|b\|} (cosine similarity) and τ\tau is the temperature.

Batch form (SimCLR-style, all pairs in a batch of NN pairs):

For each pair (i,j)(i, j) from the same source, with the remaining 2(N1)2(N-1) samples as negatives:

Li=logexp(sim(zi,zj)/τ)kiexp(sim(zi,zk)/τ)\mathcal{L}_i = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k) / \tau)}

Total loss averages over all 2N2N anchors.

Connection to mutual information: InfoNCE is a lower bound on the mutual information between the anchor and positive representations: I(q;k+)log(N)LI(q; k^+) \geq \log(N) - \mathcal{L}. More negatives give a tighter bound, which is why larger batch sizes help.

import torch
import torch.nn.functional as F
# ── SimCLR-style InfoNCE (in-batch negatives) ────────────────────
def infonce_loss(z1, z2, temperature=0.07):
"""
z1, z2: (B, D) — L2-normalised embeddings of two augmentations
Returns scalar loss.
"""
z1 = F.normalize(z1, dim=-1) # (B, D)
z2 = F.normalize(z2, dim=-1) # (B, D)
# All pairwise cosine similarities, scaled by temperature
# Each row i should match column i (the positive pair)
logits = z1 @ z2.T / temperature # (B, B)
# Labels: sample i's positive is at index i
labels = torch.arange(z1.size(0), device=z1.device) # (B,)
loss = F.cross_entropy(logits, labels) # scalar
return loss
# ── CLIP-style (symmetric) ──────────────────────────────────────
# CLIP averages image→text and text→image directions:
logits = image_emb @ text_emb.T / temperature # (B, B)
labels = torch.arange(B, device=logits.device)
loss = (F.cross_entropy(logits, labels)
+ F.cross_entropy(logits.T, labels)) / 2
import numpy as np
def infonce_loss(z1, z2, temperature=0.07):
"""
InfoNCE / NT-Xent loss for a batch of positive pairs.
Equivalent to SimCLR's contrastive loss.
z1: (B, D) normalised embeddings (view 1)
z2: (B, D) normalised embeddings (view 2)
"""
B = z1.shape[0]
# L2 normalise
z1 = z1 / np.linalg.norm(z1, axis=1, keepdims=True) # (B, D)
z2 = z2 / np.linalg.norm(z2, axis=1, keepdims=True) # (B, D)
# Cosine similarity matrix, scaled by temperature
logits = z1 @ z2.T / temperature # (B, B)
# Numerically stable log-softmax along rows
shifted = logits - logits.max(axis=1, keepdims=True) # (B, B)
log_sum_exp = np.log(np.exp(shifted).sum(axis=1)) # (B,)
log_probs = shifted[np.arange(B), np.arange(B)] - log_sum_exp # (B,)
return -log_probs.mean()
  • SimCLR (see contrastive-self-supervising/): InfoNCE with in-batch negatives, requires large batch sizes (4096+) to provide enough negatives
  • CLIP (OpenAI): symmetric InfoNCE between image and text embeddings — the loss that enables zero-shot image classification
  • MoCo (see contrastive-self-supervising/): InfoNCE with a momentum-updated queue of negatives, decoupling batch size from negative count
  • Audio-visual learning (AudioCLIP, ImageBind): InfoNCE across modalities — aligning audio, image, and text in a shared embedding space
  • Dense retrieval (DPR, ColBERT): InfoNCE between query and document embeddings for search
AlternativeWhen to useTradeoff
Triplet lossSmall number of negatives, fine-grained retrievalOnly considers one negative at a time; less efficient use of batch
Contrastive loss (pairwise)Binary same/different pairs, simple setupNo temperature, no softmax — just margin-based on pairs. Less effective with many negatives
BYOL / non-contrastiveWant to avoid needing large batches of negativesNo negatives at all — uses EMA and prediction head instead. Risk of collapse without careful design
SupCon (supervised contrastive)Have labels, want to leverage themMultiple positives per class in the denominator; strictly better than InfoNCE when labels are available
VICRegWant explicit control over representation propertiesVariance/invariance/covariance terms replace the implicit pressure from negatives. No temperature to tune

InfoNCE was introduced by van den Oord et al. (2018) in “Representation Learning with Contrastive Predictive Coding” (CPC). The name stands for “Noise-Contrastive Estimation” applied to mutual information — the loss lower-bounds the mutual information between the anchor and positive. The theoretical grounding came from noise-contrastive estimation (Gutmann & Hyvarinen, 2010).

The loss became dominant in self-supervised learning through SimCLR (Chen et al., 2020) and MoCo (He et al., 2020), which showed that InfoNCE with the right data augmentations could match supervised pretraining on ImageNet. CLIP (Radford et al., 2021) extended it to multimodal learning, applying InfoNCE between image and text to create the most widely-used vision-language model. The key practical discovery was the importance of temperature — both SimCLR and CLIP found that a very low temperature (0.07) was critical for learning good representations.