Embedding Layers
Embedding Layers
Section titled “Embedding Layers”A lookup table mapping discrete tokens (integers) to dense vectors. The input layer for all language models and any neural network that processes categorical data. In language models, the embedding matrix is often tied with the output projection (weight tying), so the same matrix maps tokens to vectors and vectors back to tokens.
Intuition
Section titled “Intuition”A neural network operates on continuous vectors, but language is made of discrete tokens — words, subwords, or characters represented as integers. An embedding layer is simply a matrix where row contains the vector for token . “Looking up” token 42 means indexing into row 42 of this matrix. There is no multiplication, no activation — just a table lookup. The matrix is learned end-to-end through backpropagation like any other parameter.
The beauty is in what the network learns to put in these vectors. After training, similar tokens end up with similar vectors. The classic example is word2vec’s “king - man + woman = queen,” but modern embeddings capture far richer structure. In a trained GPT, the embedding for “Paris” is close to “London” and “Berlin” because they appear in similar contexts. The embedding is the network’s entire learned representation of what a token means.
Weight tying is worth understanding. The output layer of a language model is a linear projection from hidden states to vocabulary logits — that’s a matrix multiply with a (d_model, vocab_size) matrix. If you reuse the embedding matrix (vocab_size, d_model) transposed as this output projection, you halve the parameters in the vocabulary layers (which can be hundreds of millions for large vocabularies) and force the model to use a single consistent representation for each token as both input and output.
Forward pass — pure indexing, no multiplication:
For a sequence of tokens , the output is the matrix .
Gradient: The gradient with respect to is sparse — only the rows corresponding to tokens in the current batch receive nonzero gradients. This is why embedding layers need special optimiser handling (e.g. sparse gradients).
Weight tying (output projection):
where is the final hidden state and reuses the transposed embedding matrix.
import torchimport torch.nn as nn
# ── Basic embedding ─────────────────────────────────────────────vocab_size = 32000d_model = 768embed = nn.Embedding(vocab_size, d_model)
token_ids = torch.tensor([0, 42, 1337, 5]) # (4,) — integer indicesvectors = embed(token_ids) # (4, d_model)
# In a transformer: batch of sequencesids = torch.randint(0, vocab_size, (B, T)) # (B, T)x = embed(ids) # (B, T, d_model)
# ── With weight tying ───────────────────────────────────────────class LM(nn.Module): def __init__(self, vocab_size, d_model): super().__init__() self.embed = nn.Embedding(vocab_size, d_model) self.head = nn.Linear(d_model, vocab_size, bias=False) # Tie weights: output projection shares embedding matrix self.head.weight = self.embed.weight # same Parameter object
def forward(self, ids): x = self.embed(ids) # (B, T, d_model) # ... transformer layers ... logits = self.head(x) # (B, T, vocab_size) return logits
# ── Padding token (ignore index 0 in loss) ──────────────────────embed = nn.Embedding(vocab_size, d_model, padding_idx=0)# Token 0's embedding is initialised to zeros and NOT updated.Manual Implementation
Section titled “Manual Implementation”import numpy as np
def embedding_forward(token_ids, weight): """ Equivalent to nn.Embedding forward pass. token_ids: (B, T) integer token indices weight: (V, d) embedding matrix Returns: (B, T, d) """ return weight[token_ids] # pure integer indexing — that's it
def embedding_backward(token_ids, grad_output, vocab_size): """ Gradient of embedding lookup. Only touched rows get gradients. token_ids: (B, T) indices grad_output: (B, T, d) upstream gradient Returns: (V, d) sparse gradient for the weight matrix """ B, T, d = grad_output.shape grad_weight = np.zeros((vocab_size, d)) # Scatter-add: accumulate gradients for each token for b in range(B): for t in range(T): grad_weight[token_ids[b, t]] += grad_output[b, t] return grad_weightPopular Uses
Section titled “Popular Uses”- Language models (GPT, LLaMA, BERT): map subword tokens (BPE or SentencePiece) to dense vectors. Vocabulary sizes range from 32K (LLaMA) to 256K (Gemini)
- Recommendation systems: embed user IDs and item IDs into a shared space, compute relevance via dot product
- Reinforcement learning (see
q-learning/): embed discrete actions or discrete state components (e.g. Atari game screens use CNNs, but board game states use embeddings for piece types) - Vision transformers (ViT): while image patches use linear projection (not strictly an embedding), the class token and position indices use learned embeddings
- Contrastive learning (see
contrastive-self-supervising/): CLIP embeds text tokens then pools to get sentence vectors; the embedding layer is the entry point
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| One-hot encoding | Very small vocabulary, linear model | No learned representation; dimensionality equals vocab size (impractical for V > 1000) |
| Feature hashing | Huge or open vocabulary, memory-constrained | Fixed hash function maps tokens to buckets; collisions lose information but require no storage |
| Pre-trained embeddings (word2vec, GloVe) | Small dataset, transfer learning | Fixed representations from unsupervised pre-training; may not adapt to task-specific semantics |
| Character-level models | Morphologically rich languages, no tokeniser needed | Much longer sequences; harder to learn long-range dependencies |
| Continuous inputs (linear projection) | Data is already continuous (images, audio) | Not an embedding — directly projects features; no discrete lookup needed |
Historical Context
Section titled “Historical Context”Word embeddings became a major focus after Bengio et al. (2003) introduced neural language models that learned distributed word representations. The field exploded with word2vec (Mikolov et al., 2013), which showed that simple models trained on massive corpora learned embeddings with remarkable algebraic properties (the famous “king - man + woman” analogy). GloVe (Pennington et al., 2014) provided a matrix factorisation perspective on the same idea.
Weight tying between input embeddings and output projections was proposed by Press & Wolf (2017, “Using the Output Embedding to Improve Language Models”) and independently by Inan et al. (2017). It became standard practice in transformers starting with the original paper (Vaswani et al., 2017) and remains the default in most modern LLMs. The practical significance is large: for a 100K-token vocabulary with d_model = 4096, the embedding matrix alone is 400M parameters — weight tying eliminates the duplicate.