Skip to content

Residual Connections

Adding the input of a block to its output: y = F(x) + x. Also called “skip connections.” The single most important architectural innovation for training deep networks — used in every modern architecture from ResNet to GPT to Stable Diffusion.

Without residual connections, a 100-layer network must learn the entire transformation from input to output as one long chain of matrix multiplications. Gradients must flow backwards through every layer, and they tend to either vanish (multiply by numbers < 1 repeatedly) or explode (multiply by numbers > 1 repeatedly). By the time the gradient reaches early layers, it’s either negligibly small or catastrophically large.

Residual connections fix this by giving the gradient a highway. The addition y = F(x) + x means the gradient of y with respect to x is dF/dx + I, where I is the identity matrix. That ”+ I” term means the gradient always has a direct path back to earlier layers, regardless of what F does. Even if dF/dx vanishes entirely, the gradient still flows through the identity branch. This is why you can train networks with hundreds or thousands of layers.

There’s a deeper insight: residual connections change what each layer learns. Instead of learning the full mapping H(x), each layer only needs to learn the residual F(x) = H(x) - x, i.e. the delta from the identity. If the optimal transformation is close to identity (which it often is in deep networks), learning a small residual is much easier than learning the full mapping from scratch. This is why the paper is called “Deep Residual Learning.”

Residual block:

y=F(x)+xy = F(x) + x

where FF is any parameterised function (one or more layers). The gradient flows as:

Lx=Ly(Fx+I)\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \cdot \left(\frac{\partial F}{\partial x} + I\right)

The +I+ I term guarantees gradient flow regardless of Fx\frac{\partial F}{\partial x}.

With a projection (when input and output dimensions differ):

y=F(x)+Wsxy = F(x) + W_s x

where WsW_s is a linear projection to match dimensions. In practice this is a 1x1 convolution (ResNet) or a linear layer (transformers).

Pre-norm variant (the modern standard, used in GPT, LLaMA):

y=x+F(LayerNorm(x))y = x + F(\text{LayerNorm}(x))

Normalisation is applied before the block, so the residual stream stays unnormalised. This is more stable for very deep networks because the residual stream isn’t repeatedly normalised.

import torch
import torch.nn as nn
# ── Basic residual block ────────────────────────────────────────
class ResidualBlock(nn.Module):
def __init__(self, d_model: int):
super().__init__()
self.norm = nn.LayerNorm(d_model)
self.mlp = nn.Sequential(
nn.Linear(d_model, 4 * d_model), # expand
nn.GELU(),
nn.Linear(4 * d_model, d_model), # project back
)
def forward(self, x): # x: (B, T, d_model)
return x + self.mlp(self.norm(x)) # (B, T, d_model)
# That's it. The entire point is the "+ x".
# ── With dimension mismatch (e.g. ResNet downsampling) ──────────
class ResidualBlockWithProjection(nn.Module):
def __init__(self, in_dim: int, out_dim: int):
super().__init__()
self.block = nn.Sequential(
nn.Linear(in_dim, out_dim), nn.ReLU(),
nn.Linear(out_dim, out_dim),
)
# Project the skip path to match output dimension
self.shortcut = nn.Linear(in_dim, out_dim) if in_dim != out_dim else nn.Identity()
def forward(self, x): # x: (B, in_dim)
return self.block(x) + self.shortcut(x) # (B, out_dim)
import numpy as np
def residual_block_forward(x, W1, b1, W2, b2):
"""
Pre-norm residual MLP block, numpy only.
x: (B, D) input
W1: (D, 4D), b1: (4D,), W2: (4D, D), b2: (D,)
Returns: (B, D)
"""
# Layer norm (simplified: per-sample, per-feature)
mu = x.mean(axis=-1, keepdims=True) # (B, 1)
var = x.var(axis=-1, keepdims=True) # (B, 1)
x_norm = (x - mu) / np.sqrt(var + 1e-5) # (B, D)
# MLP: expand -> GELU -> project back
h = x_norm @ W1 + b1 # (B, 4D)
h = h * 0.5 * (1.0 + np.tanh( # GELU approximation
np.sqrt(2.0 / np.pi) * (h + 0.044715 * h ** 3)
)) # (B, 4D)
out = h @ W2 + b2 # (B, D)
return x + out # <-- the residual connection: input + block output
  • Transformers (GPT, LLaMA, BERT, ViT): every attention and MLP sub-layer uses a residual connection. The “residual stream” is the backbone of transformer computation (see transformer/)
  • ResNet (He et al., 2015): the original application — enabled training 152-layer CNNs for ImageNet, up from ~20 layers without skip connections
  • Diffusion U-Nets (see diffusion/): skip connections between encoder and decoder at matching resolutions, plus residual blocks within each resolution level
  • Policy and value networks (see policy-gradient/, q-learning/): deeper RL networks use residual blocks to stabilise training
  • GANs (see gans/): both generators and discriminators in modern GANs (StyleGAN, BigGAN) use residual blocks extensively
AlternativeWhen to useTradeoff
Dense connections (DenseNet)Need maximum feature reuse in CNNsConcatenates all previous outputs instead of adding; much higher memory cost
Highway networksPredecessor to ResNets; historical interestLearned gating T(x) controls how much signal passes through; more parameters, no clear benefit over simple addition
No skip connectionVery shallow networks (< 5 layers)Simpler, but gradient flow degrades rapidly with depth
Weighted residualFine-grained control over skip strengthy = F(x) + alpha * x with learnable alpha; used in some diffusion architectures (e.g. progressive training)
ReZeroFaster early trainingy = x + alpha * F(x), with alpha initialised to 0; each layer starts as identity

Residual connections were introduced by He et al. in “Deep Residual Learning for Image Recognition” (2015), which won the ImageNet competition by training a 152-layer CNN — dramatically deeper than anything before. The key insight was reframing each layer as learning a residual delta rather than a full transformation, which they showed eliminated the “degradation problem” where adding more layers to a sufficiently deep network actually increased training error.

The idea was quickly adopted by the transformer architecture (Vaswani et al., 2017), where it became even more critical — transformers stack dozens of attention and MLP blocks, and without residual connections, training diverges. The shift from “post-norm” (original transformer: normalise after the addition) to “pre-norm” (GPT-2 onward: normalise before the block) further improved training stability and is now the universal default. The concept has become so fundamental that modern architectures are often described in terms of their “residual stream” — the data flowing through the skip connections, modified incrementally by each block.