Skip to content

Cosine Annealing

Decaying the learning rate following a cosine curve from a maximum to a minimum over TT steps. Provides a smooth, gradual decay that spends more time at moderate learning rates than linear or step schedules. Often combined with warm restarts (SGDR) for cyclic training.

Think of training as exploring a loss landscape. Early on, you want a high learning rate to cross ridges and escape bad basins. Late in training, you want a low learning rate to settle into a sharp minimum. The question is: how do you transition?

Step decay (divide LR by 10 at epoch 30, 60, 90) creates jarring transitions — the model is suddenly learning 10x slower. Linear decay wastes time at very low learning rates where progress is negligible. Cosine annealing is the sweet spot: it decays slowly at first (spending time near the peak where learning is fast), accelerates through the middle, then slows again near zero (giving the model a long, gentle landing).

The cosine shape is not derived from any optimality principle — it just works well empirically. The key property is that it’s smooth and concave in the first half: you stay near the high learning rate longer than linear decay would, squeezing more useful optimization out of those steps.

Warm restarts take this further: instead of one long cosine, use several shorter cosines back-to-back, snapping the learning rate back to the maximum each time. Each restart lets the optimizer escape a local minimum it may have settled into, then re-converge to a (potentially better) one.

Standard cosine annealing (step tt, total steps TT):

η(t)=ηmin+12(ηmaxηmin)(1+cosπtT)\eta(t) = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\frac{\pi t}{T}\right)

At t=0t = 0: η=ηmax\eta = \eta_{\max}. At t=Tt = T: η=ηmin\eta = \eta_{\min}. The curve is symmetric around t=T/2t = T/2.

With warm restarts (SGDR, Loshchilov & Hutter, 2017) — restart period TiT_i for the ii-th cycle:

η(t)=ηmin+12(ηmaxηmin)(1+cosπtcurTi)\eta(t) = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\frac{\pi \, t_{\text{cur}}}{T_i}\right)

where tcurt_{\text{cur}} is the number of steps since the last restart. Common to double the period each cycle: Ti+1=2TiT_{i+1} = 2 T_i.

Common defaults: ηmin=0\eta_{\min} = 0 or ηmin=0.1ηmax\eta_{\min} = 0.1 \cdot \eta_{\max}. LLaMA uses ηmin=0.1ηmax\eta_{\min} = 0.1 \cdot \eta_{\max}.

import torch
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
# ── Standard cosine decay over total_steps ──────────────────────
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=100000, eta_min=1e-5 # decay from 3e-4 → 1e-5
)
# ── With warm restarts ──────────────────────────────────────────
# T_0 = first cycle length, T_mult = cycle length multiplier
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
optimizer, T_0=10000, T_mult=2, eta_min=1e-5
)
# Cycle lengths: 10k, 20k, 40k, ...
# ── Warmup + cosine (the standard LLM recipe) ──────────────────
warmup = torch.optim.lr_scheduler.LinearLR(
optimizer, start_factor=1e-8/3e-4, total_iters=2000
)
cosine = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=98000, eta_min=3e-5
)
scheduler = torch.optim.lr_scheduler.SequentialLR(
optimizer, schedulers=[warmup, cosine], milestones=[2000]
)
# ── In training loop ───────────────────────────────────────────
for step, batch in enumerate(dataloader):
loss = model(batch).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
scheduler.step() # call AFTER optimizer.step()
import numpy as np
def cosine_annealing_lr(step, lr_max, lr_min, total_steps):
"""
Standard cosine decay from lr_max to lr_min over total_steps.
step: current step (0-indexed)
lr_max: peak learning rate
lr_min: floor learning rate
total_steps: total number of training steps
"""
# Clamp to [0, total_steps] so we don't go negative
t = min(step, total_steps)
return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * t / total_steps))
def cosine_warm_restarts_lr(step, lr_max, lr_min, t_0, t_mult=2):
"""
Cosine annealing with warm restarts (SGDR).
t_0: first cycle length in steps
t_mult: multiply cycle length by this after each restart
"""
t_cur = step
t_i = t_0
while t_cur >= t_i: # find which cycle we're in
t_cur -= t_i
t_i = int(t_i * t_mult) # next cycle is longer
return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * t_cur / t_i))
# Example: 100k steps, cosine from 3e-4 → 1e-5
lrs = [cosine_annealing_lr(t, 3e-4, 1e-5, 100000) for t in range(100000)]
# lrs[0] = 3e-4, lrs[50000] ≈ 1.55e-4, lrs[99999] ≈ 1e-5
  • LLM pre-training (GPT-3, LLaMA, Chinchilla): warmup + cosine is the universal schedule; Chinchilla specifically validated that cosine to near-zero is optimal
  • Vision transformers (ViT, DeiT): cosine annealing replaced step decay as the default for image classification
  • Diffusion model training (Stable Diffusion): long cosine schedules over millions of steps
  • Fine-tuning (LoRA, full fine-tune): cosine over a short training run keeps the final LR low for stable convergence
  • SGDR / snapshot ensembles: warm restarts produce multiple converged models (one per cycle) that can be ensembled cheaply
AlternativeWhen to useTradeoff
Step decayLegacy CNN training (ResNet-style)Simpler to implement; jarring LR drops can cause training instability
Linear decayShort fine-tuning runsPredictable, but spends too much time at very low LRs on long runs
Inverse square rootOriginal transformer scheduleUsed in “Attention Is All You Need”; largely replaced by cosine
Constant LRQuick experiments, RLNo schedule overhead; leaves performance on the table for long runs
One-cycle policyFast training (super-convergence)Warmup + cosine decay in a single cycle tuned for max speed; needs careful LR range finding
WSD (warmup-stable-decay)Continual pre-training (MiniCPM)Warmup, hold constant for most of training, then rapid decay. Allows extending training without knowing total steps upfront

Cosine annealing was introduced by Loshchilov and Hutter (2017, “SGDR: Stochastic Gradient Descent with Warm Restarts”), originally as part of the warm restarts scheme. The cosine shape was chosen as a smooth alternative to step decay, and the restarts were inspired by simulated annealing in combinatorial optimization.

The warm restarts idea fell somewhat out of favour for large-scale training (the restart can waste compute), but the bare cosine schedule became the dominant choice. Its adoption was cemented by GPT-2/GPT-3 and later by the Chinchilla scaling laws paper, which used cosine annealing in all experiments. Today, “warmup + cosine decay to ~10% of peak LR” is effectively the default recipe for any transformer-based training run.