Embedding Layers

A lookup table mapping discrete tokens (integers) to dense vectors. The input layer for all language models and any neural network that processes categorical data. In language models, the embedding matrix is often tied with the output projection (weight tying), so the same matrix maps tokens to vectors and vectors back to tokens.

Intuition

A neural network operates on continuous vectors, but language is made of discrete tokens — words, subwords, or characters represented as integers. An embedding layer is simply a matrix where row $i$ contains the vector for token $i$ . “Looking up” token 42 means indexing into row 42 of this matrix. There is no multiplication, no activation — just a table lookup. The matrix is learned end-to-end through backpropagation like any other parameter.

The beauty is in what the network learns to put in these vectors. After training, similar tokens end up with similar vectors. The classic example is word2vec’s “king - man + woman = queen,” but modern embeddings capture far richer structure. In a trained GPT, the embedding for “Paris” is close to “London” and “Berlin” because they appear in similar contexts. The embedding is the network’s entire learned representation of what a token means.

Weight tying is worth understanding. The output layer of a language model is a linear projection from hidden states to vocabulary logits — that’s a matrix multiply with a (d_model, vocab_size) matrix. If you reuse the embedding matrix (vocab_size, d_model) transposed as this output projection, you halve the parameters in the vocabulary layers (which can be hundreds of millions for large vocabularies) and force the model to use a single consistent representation for each token as both input and output.

Math

Forward pass — pure indexing, no multiplication:

$\text{embed}(x) = W_e[x] \quad \text{where } W_e \in \mathbb{R}^{V \times d}, \; x \in \{0, 1, \ldots, V-1\}$

For a sequence of tokens $x = [x_1, x_2, \ldots, x_T]$ , the output is the matrix $[W_e[x_1]; W_e[x_2]; \ldots; W_e[x_T]] \in \mathbb{R}^{T \times d}$ .

Gradient: The gradient with respect to $W_e$ is sparse — only the rows corresponding to tokens in the current batch receive nonzero gradients. This is why embedding layers need special optimiser handling (e.g. sparse gradients).

Weight tying (output projection):

$\text{logits} = h \cdot W_e^T \quad \in \mathbb{R}^{V}$

where $h \in \mathbb{R}^{d}$ is the final hidden state and $W_e^T$ reuses the transposed embedding matrix.

Code

import torch
import torch.nn as nn

# ── Basic embedding ─────────────────────────────────────────────
vocab_size = 32000
d_model = 768
embed = nn.Embedding(vocab_size, d_model)

token_ids = torch.tensor([0, 42, 1337, 5])     # (4,) — integer indices
vectors = embed(token_ids)                       # (4, d_model)

# In a transformer: batch of sequences
ids = torch.randint(0, vocab_size, (B, T))       # (B, T)
x = embed(ids)                                   # (B, T, d_model)

# ── With weight tying ───────────────────────────────────────────
class LM(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        # Tie weights: output projection shares embedding matrix
        self.head.weight = self.embed.weight  # same Parameter object

    def forward(self, ids):
        x = self.embed(ids)                      # (B, T, d_model)
        # ... transformer layers ...
        logits = self.head(x)                    # (B, T, vocab_size)
        return logits

# ── Padding token (ignore index 0 in loss) ──────────────────────
embed = nn.Embedding(vocab_size, d_model, padding_idx=0)
# Token 0's embedding is initialised to zeros and NOT updated.

Manual Implementation

import numpy as np

def embedding_forward(token_ids, weight):
    """
    Equivalent to nn.Embedding forward pass.
    token_ids: (B, T) integer token indices
    weight:    (V, d) embedding matrix
    Returns:   (B, T, d)
    """
    return weight[token_ids]  # pure integer indexing — that's it

def embedding_backward(token_ids, grad_output, vocab_size):
    """
    Gradient of embedding lookup. Only touched rows get gradients.
    token_ids:   (B, T) indices
    grad_output: (B, T, d) upstream gradient
    Returns:     (V, d) sparse gradient for the weight matrix
    """
    B, T, d = grad_output.shape
    grad_weight = np.zeros((vocab_size, d))
    # Scatter-add: accumulate gradients for each token
    for b in range(B):
        for t in range(T):
            grad_weight[token_ids[b, t]] += grad_output[b, t]
    return grad_weight

Popular Uses

Language models (GPT, LLaMA, BERT): map subword tokens (BPE or SentencePiece) to dense vectors. Vocabulary sizes range from 32K (LLaMA) to 256K (Gemini)
Recommendation systems: embed user IDs and item IDs into a shared space, compute relevance via dot product
Reinforcement learning (see q-learning/): embed discrete actions or discrete state components (e.g. Atari game screens use CNNs, but board game states use embeddings for piece types)
Vision transformers (ViT): while image patches use linear projection (not strictly an embedding), the class token and position indices use learned embeddings
Contrastive learning (see contrastive-self-supervising/): CLIP embeds text tokens then pools to get sentence vectors; the embedding layer is the entry point

Alternatives

Alternative	When to use	Tradeoff
One-hot encoding	Very small vocabulary, linear model	No learned representation; dimensionality equals vocab size (impractical for V > 1000)
Feature hashing	Huge or open vocabulary, memory-constrained	Fixed hash function maps tokens to buckets; collisions lose information but require no storage
Pre-trained embeddings (word2vec, GloVe)	Small dataset, transfer learning	Fixed representations from unsupervised pre-training; may not adapt to task-specific semantics
Character-level models	Morphologically rich languages, no tokeniser needed	Much longer sequences; harder to learn long-range dependencies
Continuous inputs (linear projection)	Data is already continuous (images, audio)	Not an embedding — directly projects features; no discrete lookup needed

Historical Context

Word embeddings became a major focus after Bengio et al. (2003) introduced neural language models that learned distributed word representations. The field exploded with word2vec (Mikolov et al., 2013), which showed that simple models trained on massive corpora learned embeddings with remarkable algebraic properties (the famous “king - man + woman” analogy). GloVe (Pennington et al., 2014) provided a matrix factorisation perspective on the same idea.

Weight tying between input embeddings and output projections was proposed by Press & Wolf (2017, “Using the Output Embedding to Improve Language Models”) and independently by Inan et al. (2017). It became standard practice in transformers starting with the original paper (Vaswani et al., 2017) and remains the default in most modern LLMs. The practical significance is large: for a 100K-token vocabulary with d_model = 4096, the embedding matrix alone is 400M parameters — weight tying eliminates the duplicate.