Positional Encoding

Injecting position information into transformer inputs since self-attention is permutation-invariant — without positional encoding, “the cat sat on the mat” and “mat the on sat cat the” produce identical outputs. The three main approaches are sinusoidal (original transformer), learned (BERT, GPT-2), and rotary/RoPE (the modern standard, used in LLaMA, Mistral, Gemma).

Intuition

Self-attention computes dot products between all pairs of tokens. Dot products don’t care about order — swapping two tokens in the input just swaps the corresponding rows in the output. But language (and most sequences) is fundamentally ordered: “dog bites man” and “man bites dog” mean different things. Positional encoding solves this by adding position-dependent information to each token before attention sees it.

Sinusoidal encoding uses fixed sine/cosine waves at different frequencies — think of it like a clock with many hands spinning at different speeds. Position 0 has one pattern of hand positions, position 1 has a slightly different pattern, and so on. The key property is that the encoding of any position can be expressed as a linear transformation of any other position’s encoding, which in theory lets attention learn relative position patterns.

RoPE (Rotary Position Embeddings) takes a fundamentally different approach: instead of adding position information to the token embeddings, it rotates the query and key vectors by an angle proportional to their position. When you take the dot product of a rotated query with a rotated key, the result depends only on the relative position between them — not on the absolute positions. This is elegant because relative position is what usually matters (“the word two positions back”) and it naturally extends to longer sequences than seen during training.

Math

Sinusoidal (Vaswani et al., 2017):

$PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)$

where $pos$ is the position index and $i$ is the dimension index. Each dimension oscillates at a different frequency from $2\pi$ to $10000 \cdot 2\pi$ .

Learned: Simply a trainable matrix $E \in \mathbb{R}^{T_{\max} \times d}$ where row $t$ is looked up and added to the token embedding at position $t$ .

RoPE (Su et al., 2021):

For each pair of dimensions $(2i, 2i+1)$ , rotate by angle $\theta_i \cdot pos$ :

$\begin{pmatrix} q'_{2i} \\ q'_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} q_{2i} \\ q_{2i+1} \end{pmatrix}$

where $m$ is the position and $\theta_i = 10000^{-2i/d}$ . The key property: $\langle \text{RoPE}(q, m), \text{RoPE}(k, n) \rangle$ depends only on $q$ , $k$ , and $m - n$ (relative position).

Code

import torch
import torch.nn as nn
import math

# ── Sinusoidal (fixed, no parameters) ───────────────────────────
def sinusoidal_encoding(T: int, d: int) -> torch.Tensor:
    pos = torch.arange(T).unsqueeze(1).float()         # (T, 1)
    dim = torch.arange(0, d, 2).float()                # (d/2,)
    freq = 1.0 / (10000 ** (dim / d))                  # (d/2,)
    pe = torch.zeros(T, d)                              # (T, d)
    pe[:, 0::2] = torch.sin(pos * freq)                # even dims
    pe[:, 1::2] = torch.cos(pos * freq)                # odd dims
    return pe  # Add to token embeddings: x = x + pe[:T]

# ── Learned (trainable) ─────────────────────────────────────────
pos_embed = nn.Embedding(max_seq_len, d_model)
# Usage: x = token_embed(ids) + pos_embed(positions)
# where positions = torch.arange(T)  # (T,)
# LIMITATION: cannot handle positions > max_seq_len at inference.

# ── RoPE (applied to Q and K, NOT V) ────────────────────────────
def apply_rope(x, freqs_cis):
    """x: (B, T, n_heads, d_head), freqs_cis: (T, d_head/2) complex"""
    x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
    x_rotated = x_complex * freqs_cis                   # broadcast rotation
    return torch.view_as_real(x_rotated).reshape(x.shape).type_as(x)

def precompute_rope_freqs(d_head: int, max_T: int, theta: float = 10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, d_head, 2).float() / d_head))
    t = torch.arange(max_T)
    angles = torch.outer(t, freqs)                      # (T, d_head/2)
    return torch.polar(torch.ones_like(angles), angles)  # (T, d_head/2) complex

# Apply to Q and K only — V has no position information.
# NEVER apply RoPE to V. Position is encoded via Q-K interaction.

Manual Implementation

import numpy as np

def sinusoidal_encoding_np(T, d):
    """Equivalent to the sinusoidal encoding above. Returns (T, d)."""
    pos = np.arange(T)[:, None]                         # (T, 1)
    dim = np.arange(0, d, 2)[None, :]                   # (1, d/2)
    freq = 1.0 / (10000 ** (dim / d))                   # (1, d/2)
    pe = np.zeros((T, d))
    pe[:, 0::2] = np.sin(pos * freq)                    # (T, d/2)
    pe[:, 1::2] = np.cos(pos * freq)                    # (T, d/2)
    return pe

def apply_rope_np(q, k, theta=10000.0):
    """
    Apply RoPE to query and key vectors.
    q, k: (B, T, d) where d must be even.
    Returns: rotated q, k of same shape.
    """
    B, T, d = q.shape
    positions = np.arange(T)[:, None]                    # (T, 1)
    dims = np.arange(0, d, 2)[None, :]                   # (1, d/2)
    angles = positions / (theta ** (dims / d))            # (T, d/2)

    cos_a = np.cos(angles)                                # (T, d/2)
    sin_a = np.sin(angles)                                # (T, d/2)

    def rotate(x):
        x1 = x[:, :, 0::2]                               # (B, T, d/2) even dims
        x2 = x[:, :, 1::2]                               # (B, T, d/2) odd dims
        out = np.empty_like(x)
        out[:, :, 0::2] = x1 * cos_a - x2 * sin_a
        out[:, :, 1::2] = x1 * sin_a + x2 * cos_a
        return out

    return rotate(q), rotate(k)

Popular Uses

Original transformer (Vaswani et al., 2017): sinusoidal encoding added to input embeddings (see transformer/)
BERT, GPT-2: learned positional embeddings up to 512 / 1024 tokens
LLaMA, Mistral, Gemma, Qwen (modern LLMs): RoPE applied to Q and K in every attention layer. The modern standard for autoregressive models
Vision Transformers (ViT): learned 2D positional embeddings for image patches
Long-context models (LLaMA 3 128K, Gemini 1M): RoPE with adjusted base frequency (NTK-aware scaling or YaRN) for length extrapolation

Alternatives

Alternative	When to use	Tradeoff
Sinusoidal (fixed)	Simple baselines, no extra parameters	No learning; works well but slightly worse than learned/RoPE in practice
Learned absolute	Fixed-length tasks (BERT-style)	Simple; cannot extrapolate beyond training length
RoPE (rotary)	Autoregressive LLMs (the modern default)	Elegant relative position; requires careful frequency scaling for long contexts
ALiBi (Press et al., 2022)	Length extrapolation without fine-tuning	Adds linear bias to attention scores; simpler than RoPE but less expressive
Relative position bias (T5, Transformer-XL)	When you want explicit learned relative offsets	Adds a learned bias matrix indexed by relative position; more parameters
No positional encoding	When position doesn’t matter (set-based inputs)	Works for tasks like point cloud processing where input order is irrelevant

Historical Context

Positional encoding was introduced as part of the original transformer (Vaswani et al., 2017). The sinusoidal formulation was chosen because the authors theorised it would allow the model to learn relative positions through linear projections, and because it required no additional parameters. Learned embeddings (BERT, 2018; GPT-2, 2019) quickly became popular as they matched or exceeded sinusoidal performance on fixed-length tasks.

The major innovation was RoPE (Su et al., 2021, “RoFormer”), which encoded position directly into the attention computation through rotation rather than as an additive input. RoPE’s adoption by LLaMA (2023) made it the de facto standard for modern LLMs. The challenge of extending RoPE to longer contexts than seen during training has spawned several techniques: NTK-aware scaling (adjusting the base frequency), YaRN (combining frequency scaling with attention scaling), and dynamic NTK, all of which modify RoPE’s frequency schedule to handle longer sequences without fine-tuning.