Skip to content

Embedding Layers

A lookup table mapping discrete tokens (integers) to dense vectors. The input layer for all language models and any neural network that processes categorical data. In language models, the embedding matrix is often tied with the output projection (weight tying), so the same matrix maps tokens to vectors and vectors back to tokens.

A neural network operates on continuous vectors, but language is made of discrete tokens — words, subwords, or characters represented as integers. An embedding layer is simply a matrix where row ii contains the vector for token ii. “Looking up” token 42 means indexing into row 42 of this matrix. There is no multiplication, no activation — just a table lookup. The matrix is learned end-to-end through backpropagation like any other parameter.

The beauty is in what the network learns to put in these vectors. After training, similar tokens end up with similar vectors. The classic example is word2vec’s “king - man + woman = queen,” but modern embeddings capture far richer structure. In a trained GPT, the embedding for “Paris” is close to “London” and “Berlin” because they appear in similar contexts. The embedding is the network’s entire learned representation of what a token means.

Weight tying is worth understanding. The output layer of a language model is a linear projection from hidden states to vocabulary logits — that’s a matrix multiply with a (d_model, vocab_size) matrix. If you reuse the embedding matrix (vocab_size, d_model) transposed as this output projection, you halve the parameters in the vocabulary layers (which can be hundreds of millions for large vocabularies) and force the model to use a single consistent representation for each token as both input and output.

Forward pass — pure indexing, no multiplication:

embed(x)=We[x]where WeRV×d,  x{0,1,,V1}\text{embed}(x) = W_e[x] \quad \text{where } W_e \in \mathbb{R}^{V \times d}, \; x \in \{0, 1, \ldots, V-1\}

For a sequence of tokens x=[x1,x2,,xT]x = [x_1, x_2, \ldots, x_T], the output is the matrix [We[x1];We[x2];;We[xT]]RT×d[W_e[x_1]; W_e[x_2]; \ldots; W_e[x_T]] \in \mathbb{R}^{T \times d}.

Gradient: The gradient with respect to WeW_e is sparse — only the rows corresponding to tokens in the current batch receive nonzero gradients. This is why embedding layers need special optimiser handling (e.g. sparse gradients).

Weight tying (output projection):

logits=hWeTRV\text{logits} = h \cdot W_e^T \quad \in \mathbb{R}^{V}

where hRdh \in \mathbb{R}^{d} is the final hidden state and WeTW_e^T reuses the transposed embedding matrix.

import torch
import torch.nn as nn
# ── Basic embedding ─────────────────────────────────────────────
vocab_size = 32000
d_model = 768
embed = nn.Embedding(vocab_size, d_model)
token_ids = torch.tensor([0, 42, 1337, 5]) # (4,) — integer indices
vectors = embed(token_ids) # (4, d_model)
# In a transformer: batch of sequences
ids = torch.randint(0, vocab_size, (B, T)) # (B, T)
x = embed(ids) # (B, T, d_model)
# ── With weight tying ───────────────────────────────────────────
class LM(nn.Module):
def __init__(self, vocab_size, d_model):
super().__init__()
self.embed = nn.Embedding(vocab_size, d_model)
self.head = nn.Linear(d_model, vocab_size, bias=False)
# Tie weights: output projection shares embedding matrix
self.head.weight = self.embed.weight # same Parameter object
def forward(self, ids):
x = self.embed(ids) # (B, T, d_model)
# ... transformer layers ...
logits = self.head(x) # (B, T, vocab_size)
return logits
# ── Padding token (ignore index 0 in loss) ──────────────────────
embed = nn.Embedding(vocab_size, d_model, padding_idx=0)
# Token 0's embedding is initialised to zeros and NOT updated.
import numpy as np
def embedding_forward(token_ids, weight):
"""
Equivalent to nn.Embedding forward pass.
token_ids: (B, T) integer token indices
weight: (V, d) embedding matrix
Returns: (B, T, d)
"""
return weight[token_ids] # pure integer indexing — that's it
def embedding_backward(token_ids, grad_output, vocab_size):
"""
Gradient of embedding lookup. Only touched rows get gradients.
token_ids: (B, T) indices
grad_output: (B, T, d) upstream gradient
Returns: (V, d) sparse gradient for the weight matrix
"""
B, T, d = grad_output.shape
grad_weight = np.zeros((vocab_size, d))
# Scatter-add: accumulate gradients for each token
for b in range(B):
for t in range(T):
grad_weight[token_ids[b, t]] += grad_output[b, t]
return grad_weight
  • Language models (GPT, LLaMA, BERT): map subword tokens (BPE or SentencePiece) to dense vectors. Vocabulary sizes range from 32K (LLaMA) to 256K (Gemini)
  • Recommendation systems: embed user IDs and item IDs into a shared space, compute relevance via dot product
  • Reinforcement learning (see q-learning/): embed discrete actions or discrete state components (e.g. Atari game screens use CNNs, but board game states use embeddings for piece types)
  • Vision transformers (ViT): while image patches use linear projection (not strictly an embedding), the class token and position indices use learned embeddings
  • Contrastive learning (see contrastive-self-supervising/): CLIP embeds text tokens then pools to get sentence vectors; the embedding layer is the entry point
AlternativeWhen to useTradeoff
One-hot encodingVery small vocabulary, linear modelNo learned representation; dimensionality equals vocab size (impractical for V > 1000)
Feature hashingHuge or open vocabulary, memory-constrainedFixed hash function maps tokens to buckets; collisions lose information but require no storage
Pre-trained embeddings (word2vec, GloVe)Small dataset, transfer learningFixed representations from unsupervised pre-training; may not adapt to task-specific semantics
Character-level modelsMorphologically rich languages, no tokeniser neededMuch longer sequences; harder to learn long-range dependencies
Continuous inputs (linear projection)Data is already continuous (images, audio)Not an embedding — directly projects features; no discrete lookup needed

Word embeddings became a major focus after Bengio et al. (2003) introduced neural language models that learned distributed word representations. The field exploded with word2vec (Mikolov et al., 2013), which showed that simple models trained on massive corpora learned embeddings with remarkable algebraic properties (the famous “king - man + woman” analogy). GloVe (Pennington et al., 2014) provided a matrix factorisation perspective on the same idea.

Weight tying between input embeddings and output projections was proposed by Press & Wolf (2017, “Using the Output Embedding to Improve Language Models”) and independently by Inan et al. (2017). It became standard practice in transformers starting with the original paper (Vaswani et al., 2017) and remains the default in most modern LLMs. The practical significance is large: for a 100K-token vocabulary with d_model = 4096, the embedding matrix alone is 400M parameters — weight tying eliminates the duplicate.