ReLU (Rectified Linear Unit)
ReLU (Rectified Linear Unit)
Section titled “ReLU (Rectified Linear Unit)”The simplest modern activation function: output the input if positive, zero otherwise. ReLU made deep networks trainable by solving the vanishing gradient problem that plagued sigmoid/tanh networks. It remains the default hidden-layer activation for CNNs and a strong baseline everywhere else.
Intuition
Section titled “Intuition”Before ReLU, deep networks used sigmoid or tanh activations. Both squash their inputs into bounded ranges, and their gradients approach zero for large or small inputs. Stack 10 layers of these tiny gradients together (via the chain rule) and the gradient vanishes to nothing — early layers stop learning.
ReLU fixes this with a brutally simple idea: for positive inputs, the gradient is exactly 1. No matter how deep the network, the gradient flows straight through any active ReLU. For negative inputs, the output is zero and the gradient is zero — the neuron is “off” and contributes nothing.
This on/off behaviour creates sparse activations: typically 50% of neurons output zero for any given input. Sparsity is computationally efficient (multiply by zero is free) and acts as a soft form of regularisation. It also means each input activates a different subset of neurons, giving the network a form of conditional computation for free.
The downside: if a neuron’s weights shift so that its input is always negative, it outputs zero forever and receives zero gradient — it’s permanently dead. This “dying ReLU” problem is real in practice, especially with high learning rates or poor initialisation.
Definition:
Gradient:
In practice, the gradient at is set to 0 (or sometimes 1) — it doesn’t matter because is a measure-zero event.
Leaky ReLU (variant that avoids dying neurons):
import torchimport torch.nn.functional as F
x = torch.randn(B, d_model) # (B, d_model)
# Functional formout = F.relu(x) # (B, d_model) — zeros out negatives
# As a module (for nn.Sequential)layer = torch.nn.ReLU(inplace=True) # inplace saves memory but breaks autogradout = layer(x) # if you need to reuse x later
# WARNING: inplace=True is a common source of bugs. It modifies the tensor# in-place, which can cause "variable has been modified by an inplace operation"# errors during backward(). Use inplace=False when debugging.
# Leaky ReLUout = F.leaky_relu(x, negative_slope=0.01) # (B, d_model)Manual Implementation
Section titled “Manual Implementation”import numpy as np
def relu(x): """ReLU: max(0, x). x: any shape array.""" return np.maximum(0, x)
def relu_backward(x, grad_output): """ Gradient of ReLU. x: input to the forward pass (any shape) grad_output: upstream gradient (same shape as x) """ return grad_output * (x > 0).astype(grad_output.dtype) # 1 where x > 0, else 0
def leaky_relu(x, alpha=0.01): """Leaky ReLU: allows small gradient for negative inputs.""" return np.where(x > 0, x, alpha * x)Popular Uses
Section titled “Popular Uses”- CNNs (ResNet, VGG, EfficientNet): the standard activation between conv layers — fast and effective
- MLPs in reinforcement learning (DQN, PPO): hidden layer activation in Q-networks and policy networks
- Feedforward blocks in Transformers: older Transformer architectures (BERT-base, original GPT) used ReLU before GELU became the default
- Generative models: discriminator networks in GANs commonly use LeakyReLU to avoid sparse gradients
- Neural network training (
nn-training/): typically the first activation function taught and the default choice for basic MLP training
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| GELU | Transformer models (BERT, GPT, ViT) | Smoother; slightly better empirical results in NLP/vision transformers but ~2x slower to compute |
| SiLU/Swish | Modern architectures (SwiGLU in LLaMA) | Smooth, non-monotonic; better gradient flow but more compute |
| LeakyReLU | When dying ReLU is a problem (GANs, deep plain networks) | Prevents dead neurons at the cost of never fully zeroing out noise |
| PReLU | When you want the leak slope to be learnable | One extra parameter per channel; marginal gains, rarely worth the complexity |
| ELU | When you want zero-centred outputs with saturation for negatives | Smoother negative region; uses exp() which is slower |
Historical Context
Section titled “Historical Context”ReLU was introduced to deep learning by Nair & Hinton (2010, “Rectified Linear Units Improve Restricted Boltzmann Machines”), though the function itself (half-wave rectification) was well known in signal processing and neuroscience. The breakthrough moment was Krizhevsky et al. (2012, AlexNet), which showed that ReLU made training deep CNNs on ImageNet practical — training was 6x faster than with tanh.
The simplicity of ReLU — no exp(), no division, just a comparison and a conditional — made it not just mathematically convenient but computationally fast, which mattered enormously as networks scaled. Glorot & Bengio (2011) provided theoretical analysis of why ReLU avoids vanishing gradients. Variants like LeakyReLU (Maas et al., 2013) and PReLU (He et al., 2015) addressed the dying neuron problem but never fully displaced plain ReLU for CNNs. In Transformers, ReLU has been largely superseded by GELU and SiLU/Swish.