Positional Encoding
Positional Encoding
Section titled “Positional Encoding”Injecting position information into transformer inputs since self-attention is permutation-invariant — without positional encoding, “the cat sat on the mat” and “mat the on sat cat the” produce identical outputs. The three main approaches are sinusoidal (original transformer), learned (BERT, GPT-2), and rotary/RoPE (the modern standard, used in LLaMA, Mistral, Gemma).
Intuition
Section titled “Intuition”Self-attention computes dot products between all pairs of tokens. Dot products don’t care about order — swapping two tokens in the input just swaps the corresponding rows in the output. But language (and most sequences) is fundamentally ordered: “dog bites man” and “man bites dog” mean different things. Positional encoding solves this by adding position-dependent information to each token before attention sees it.
Sinusoidal encoding uses fixed sine/cosine waves at different frequencies — think of it like a clock with many hands spinning at different speeds. Position 0 has one pattern of hand positions, position 1 has a slightly different pattern, and so on. The key property is that the encoding of any position can be expressed as a linear transformation of any other position’s encoding, which in theory lets attention learn relative position patterns.
RoPE (Rotary Position Embeddings) takes a fundamentally different approach: instead of adding position information to the token embeddings, it rotates the query and key vectors by an angle proportional to their position. When you take the dot product of a rotated query with a rotated key, the result depends only on the relative position between them — not on the absolute positions. This is elegant because relative position is what usually matters (“the word two positions back”) and it naturally extends to longer sequences than seen during training.
Sinusoidal (Vaswani et al., 2017):
where is the position index and is the dimension index. Each dimension oscillates at a different frequency from to .
Learned: Simply a trainable matrix where row is looked up and added to the token embedding at position .
RoPE (Su et al., 2021):
For each pair of dimensions , rotate by angle :
where is the position and . The key property: depends only on , , and (relative position).
import torchimport torch.nn as nnimport math
# ── Sinusoidal (fixed, no parameters) ───────────────────────────def sinusoidal_encoding(T: int, d: int) -> torch.Tensor: pos = torch.arange(T).unsqueeze(1).float() # (T, 1) dim = torch.arange(0, d, 2).float() # (d/2,) freq = 1.0 / (10000 ** (dim / d)) # (d/2,) pe = torch.zeros(T, d) # (T, d) pe[:, 0::2] = torch.sin(pos * freq) # even dims pe[:, 1::2] = torch.cos(pos * freq) # odd dims return pe # Add to token embeddings: x = x + pe[:T]
# ── Learned (trainable) ─────────────────────────────────────────pos_embed = nn.Embedding(max_seq_len, d_model)# Usage: x = token_embed(ids) + pos_embed(positions)# where positions = torch.arange(T) # (T,)# LIMITATION: cannot handle positions > max_seq_len at inference.
# ── RoPE (applied to Q and K, NOT V) ────────────────────────────def apply_rope(x, freqs_cis): """x: (B, T, n_heads, d_head), freqs_cis: (T, d_head/2) complex""" x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2)) x_rotated = x_complex * freqs_cis # broadcast rotation return torch.view_as_real(x_rotated).reshape(x.shape).type_as(x)
def precompute_rope_freqs(d_head: int, max_T: int, theta: float = 10000.0): freqs = 1.0 / (theta ** (torch.arange(0, d_head, 2).float() / d_head)) t = torch.arange(max_T) angles = torch.outer(t, freqs) # (T, d_head/2) return torch.polar(torch.ones_like(angles), angles) # (T, d_head/2) complex
# Apply to Q and K only — V has no position information.# NEVER apply RoPE to V. Position is encoded via Q-K interaction.Manual Implementation
Section titled “Manual Implementation”import numpy as np
def sinusoidal_encoding_np(T, d): """Equivalent to the sinusoidal encoding above. Returns (T, d).""" pos = np.arange(T)[:, None] # (T, 1) dim = np.arange(0, d, 2)[None, :] # (1, d/2) freq = 1.0 / (10000 ** (dim / d)) # (1, d/2) pe = np.zeros((T, d)) pe[:, 0::2] = np.sin(pos * freq) # (T, d/2) pe[:, 1::2] = np.cos(pos * freq) # (T, d/2) return pe
def apply_rope_np(q, k, theta=10000.0): """ Apply RoPE to query and key vectors. q, k: (B, T, d) where d must be even. Returns: rotated q, k of same shape. """ B, T, d = q.shape positions = np.arange(T)[:, None] # (T, 1) dims = np.arange(0, d, 2)[None, :] # (1, d/2) angles = positions / (theta ** (dims / d)) # (T, d/2)
cos_a = np.cos(angles) # (T, d/2) sin_a = np.sin(angles) # (T, d/2)
def rotate(x): x1 = x[:, :, 0::2] # (B, T, d/2) even dims x2 = x[:, :, 1::2] # (B, T, d/2) odd dims out = np.empty_like(x) out[:, :, 0::2] = x1 * cos_a - x2 * sin_a out[:, :, 1::2] = x1 * sin_a + x2 * cos_a return out
return rotate(q), rotate(k)Popular Uses
Section titled “Popular Uses”- Original transformer (Vaswani et al., 2017): sinusoidal encoding added to input embeddings (see
transformer/) - BERT, GPT-2: learned positional embeddings up to 512 / 1024 tokens
- LLaMA, Mistral, Gemma, Qwen (modern LLMs): RoPE applied to Q and K in every attention layer. The modern standard for autoregressive models
- Vision Transformers (ViT): learned 2D positional embeddings for image patches
- Long-context models (LLaMA 3 128K, Gemini 1M): RoPE with adjusted base frequency (NTK-aware scaling or YaRN) for length extrapolation
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Sinusoidal (fixed) | Simple baselines, no extra parameters | No learning; works well but slightly worse than learned/RoPE in practice |
| Learned absolute | Fixed-length tasks (BERT-style) | Simple; cannot extrapolate beyond training length |
| RoPE (rotary) | Autoregressive LLMs (the modern default) | Elegant relative position; requires careful frequency scaling for long contexts |
| ALiBi (Press et al., 2022) | Length extrapolation without fine-tuning | Adds linear bias to attention scores; simpler than RoPE but less expressive |
| Relative position bias (T5, Transformer-XL) | When you want explicit learned relative offsets | Adds a learned bias matrix indexed by relative position; more parameters |
| No positional encoding | When position doesn’t matter (set-based inputs) | Works for tasks like point cloud processing where input order is irrelevant |
Historical Context
Section titled “Historical Context”Positional encoding was introduced as part of the original transformer (Vaswani et al., 2017). The sinusoidal formulation was chosen because the authors theorised it would allow the model to learn relative positions through linear projections, and because it required no additional parameters. Learned embeddings (BERT, 2018; GPT-2, 2019) quickly became popular as they matched or exceeded sinusoidal performance on fixed-length tasks.
The major innovation was RoPE (Su et al., 2021, “RoFormer”), which encoded position directly into the attention computation through rotation rather than as an additive input. RoPE’s adoption by LLaMA (2023) made it the de facto standard for modern LLMs. The challenge of extending RoPE to longer contexts than seen during training has spawned several techniques: NTK-aware scaling (adjusting the base frequency), YaRN (combining frequency scaling with attention scaling), and dynamic NTK, all of which modify RoPE’s frequency schedule to handle longer sequences without fine-tuning.