Cosine Annealing

Decaying the learning rate following a cosine curve from a maximum to a minimum over $T$ steps. Provides a smooth, gradual decay that spends more time at moderate learning rates than linear or step schedules. Often combined with warm restarts (SGDR) for cyclic training.

Intuition

Think of training as exploring a loss landscape. Early on, you want a high learning rate to cross ridges and escape bad basins. Late in training, you want a low learning rate to settle into a sharp minimum. The question is: how do you transition?

Step decay (divide LR by 10 at epoch 30, 60, 90) creates jarring transitions — the model is suddenly learning 10x slower. Linear decay wastes time at very low learning rates where progress is negligible. Cosine annealing is the sweet spot: it decays slowly at first (spending time near the peak where learning is fast), accelerates through the middle, then slows again near zero (giving the model a long, gentle landing).

The cosine shape is not derived from any optimality principle — it just works well empirically. The key property is that it’s smooth and concave in the first half: you stay near the high learning rate longer than linear decay would, squeezing more useful optimization out of those steps.

Warm restarts take this further: instead of one long cosine, use several shorter cosines back-to-back, snapping the learning rate back to the maximum each time. Each restart lets the optimizer escape a local minimum it may have settled into, then re-converge to a (potentially better) one.

Math

Standard cosine annealing (step $t$ , total steps $T$ ):

$\eta(t) = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\frac{\pi t}{T}\right)$

At $t = 0$ : $\eta = \eta_{\max}$ . At $t = T$ : $\eta = \eta_{\min}$ . The curve is symmetric around $t = T/2$ .

With warm restarts (SGDR, Loshchilov & Hutter, 2017) — restart period $T_i$ for the $i$ -th cycle:

$\eta(t) = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\frac{\pi \, t_{\text{cur}}}{T_i}\right)$

where $t_{\text{cur}}$ is the number of steps since the last restart. Common to double the period each cycle: $T_{i+1} = 2 T_i$ .

Common defaults: $\eta_{\min} = 0$ or $\eta_{\min} = 0.1 \cdot \eta_{\max}$ . LLaMA uses $\eta_{\min} = 0.1 \cdot \eta_{\max}$ .

Code

import torch

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

# ── Standard cosine decay over total_steps ──────────────────────
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=100000, eta_min=1e-5  # decay from 3e-4 → 1e-5
)

# ── With warm restarts ──────────────────────────────────────────
# T_0 = first cycle length, T_mult = cycle length multiplier
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
    optimizer, T_0=10000, T_mult=2, eta_min=1e-5
)
# Cycle lengths: 10k, 20k, 40k, ...

# ── Warmup + cosine (the standard LLM recipe) ──────────────────
warmup = torch.optim.lr_scheduler.LinearLR(
    optimizer, start_factor=1e-8/3e-4, total_iters=2000
)
cosine = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=98000, eta_min=3e-5
)
scheduler = torch.optim.lr_scheduler.SequentialLR(
    optimizer, schedulers=[warmup, cosine], milestones=[2000]
)

# ── In training loop ───────────────────────────────────────────
for step, batch in enumerate(dataloader):
    loss = model(batch).loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    scheduler.step()  # call AFTER optimizer.step()

Manual Implementation

import numpy as np

def cosine_annealing_lr(step, lr_max, lr_min, total_steps):
    """
    Standard cosine decay from lr_max to lr_min over total_steps.
    step:        current step (0-indexed)
    lr_max:      peak learning rate
    lr_min:      floor learning rate
    total_steps: total number of training steps
    """
    # Clamp to [0, total_steps] so we don't go negative
    t = min(step, total_steps)
    return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * t / total_steps))

def cosine_warm_restarts_lr(step, lr_max, lr_min, t_0, t_mult=2):
    """
    Cosine annealing with warm restarts (SGDR).
    t_0:    first cycle length in steps
    t_mult: multiply cycle length by this after each restart
    """
    t_cur = step
    t_i = t_0
    while t_cur >= t_i:                      # find which cycle we're in
        t_cur -= t_i
        t_i = int(t_i * t_mult)             # next cycle is longer
    return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * t_cur / t_i))

# Example: 100k steps, cosine from 3e-4 → 1e-5
lrs = [cosine_annealing_lr(t, 3e-4, 1e-5, 100000) for t in range(100000)]
# lrs[0] = 3e-4, lrs[50000] ≈ 1.55e-4, lrs[99999] ≈ 1e-5

Popular Uses

LLM pre-training (GPT-3, LLaMA, Chinchilla): warmup + cosine is the universal schedule; Chinchilla specifically validated that cosine to near-zero is optimal
Vision transformers (ViT, DeiT): cosine annealing replaced step decay as the default for image classification
Diffusion model training (Stable Diffusion): long cosine schedules over millions of steps
Fine-tuning (LoRA, full fine-tune): cosine over a short training run keeps the final LR low for stable convergence
SGDR / snapshot ensembles: warm restarts produce multiple converged models (one per cycle) that can be ensembled cheaply

Alternatives

Alternative	When to use	Tradeoff
Step decay	Legacy CNN training (ResNet-style)	Simpler to implement; jarring LR drops can cause training instability
Linear decay	Short fine-tuning runs	Predictable, but spends too much time at very low LRs on long runs
Inverse square root	Original transformer schedule	Used in “Attention Is All You Need”; largely replaced by cosine
Constant LR	Quick experiments, RL	No schedule overhead; leaves performance on the table for long runs
One-cycle policy	Fast training (super-convergence)	Warmup + cosine decay in a single cycle tuned for max speed; needs careful LR range finding
WSD (warmup-stable-decay)	Continual pre-training (MiniCPM)	Warmup, hold constant for most of training, then rapid decay. Allows extending training without knowing total steps upfront

Historical Context

Cosine annealing was introduced by Loshchilov and Hutter (2017, “SGDR: Stochastic Gradient Descent with Warm Restarts”), originally as part of the warm restarts scheme. The cosine shape was chosen as a smooth alternative to step decay, and the restarts were inspired by simulated annealing in combinatorial optimization.

The warm restarts idea fell somewhat out of favour for large-scale training (the restart can waste compute), but the bare cosine schedule became the dominant choice. Its adoption was cemented by GPT-2/GPT-3 and later by the Chinchilla scaling laws paper, which used cosine annealing in all experiments. Today, “warmup + cosine decay to ~10% of peak LR” is effectively the default recipe for any transformer-based training run.