ReLU (Rectified Linear Unit)

The simplest modern activation function: output the input if positive, zero otherwise. ReLU made deep networks trainable by solving the vanishing gradient problem that plagued sigmoid/tanh networks. It remains the default hidden-layer activation for CNNs and a strong baseline everywhere else.

Intuition

Before ReLU, deep networks used sigmoid or tanh activations. Both squash their inputs into bounded ranges, and their gradients approach zero for large or small inputs. Stack 10 layers of these tiny gradients together (via the chain rule) and the gradient vanishes to nothing — early layers stop learning.

ReLU fixes this with a brutally simple idea: for positive inputs, the gradient is exactly 1. No matter how deep the network, the gradient flows straight through any active ReLU. For negative inputs, the output is zero and the gradient is zero — the neuron is “off” and contributes nothing.

This on/off behaviour creates sparse activations: typically 50% of neurons output zero for any given input. Sparsity is computationally efficient (multiply by zero is free) and acts as a soft form of regularisation. It also means each input activates a different subset of neurons, giving the network a form of conditional computation for free.

The downside: if a neuron’s weights shift so that its input is always negative, it outputs zero forever and receives zero gradient — it’s permanently dead. This “dying ReLU” problem is real in practice, especially with high learning rates or poor initialisation.

Math

Definition:

$\text{ReLU}(x) = \max(0, x)$

Gradient:

$\frac{d}{dx}\text{ReLU}(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x < 0 \\ \text{undefined} & \text{if } x = 0 \end{cases}$

In practice, the gradient at $x = 0$ is set to 0 (or sometimes 1) — it doesn’t matter because $x = 0$ is a measure-zero event.

Leaky ReLU (variant that avoids dying neurons):

$\text{LeakyReLU}(x) = \max(\alpha x, x), \quad \alpha = 0.01 \text{ typically}$

Code

import torch
import torch.nn.functional as F

x = torch.randn(B, d_model)          # (B, d_model)

# Functional form
out = F.relu(x)                       # (B, d_model) — zeros out negatives

# As a module (for nn.Sequential)
layer = torch.nn.ReLU(inplace=True)   # inplace saves memory but breaks autograd
out = layer(x)                        # if you need to reuse x later

# WARNING: inplace=True is a common source of bugs. It modifies the tensor
# in-place, which can cause "variable has been modified by an inplace operation"
# errors during backward(). Use inplace=False when debugging.

# Leaky ReLU
out = F.leaky_relu(x, negative_slope=0.01)  # (B, d_model)

Manual Implementation

import numpy as np

def relu(x):
    """ReLU: max(0, x). x: any shape array."""
    return np.maximum(0, x)

def relu_backward(x, grad_output):
    """
    Gradient of ReLU.
    x:           input to the forward pass (any shape)
    grad_output: upstream gradient (same shape as x)
    """
    return grad_output * (x > 0).astype(grad_output.dtype)  # 1 where x > 0, else 0

def leaky_relu(x, alpha=0.01):
    """Leaky ReLU: allows small gradient for negative inputs."""
    return np.where(x > 0, x, alpha * x)

Popular Uses

CNNs (ResNet, VGG, EfficientNet): the standard activation between conv layers — fast and effective
MLPs in reinforcement learning (DQN, PPO): hidden layer activation in Q-networks and policy networks
Feedforward blocks in Transformers: older Transformer architectures (BERT-base, original GPT) used ReLU before GELU became the default
Generative models: discriminator networks in GANs commonly use LeakyReLU to avoid sparse gradients
Neural network training (nn-training/): typically the first activation function taught and the default choice for basic MLP training

Alternatives

Alternative	When to use	Tradeoff
GELU	Transformer models (BERT, GPT, ViT)	Smoother; slightly better empirical results in NLP/vision transformers but ~2x slower to compute
SiLU/Swish	Modern architectures (SwiGLU in LLaMA)	Smooth, non-monotonic; better gradient flow but more compute
LeakyReLU	When dying ReLU is a problem (GANs, deep plain networks)	Prevents dead neurons at the cost of never fully zeroing out noise
PReLU	When you want the leak slope to be learnable	One extra parameter per channel; marginal gains, rarely worth the complexity
ELU	When you want zero-centred outputs with saturation for negatives	Smoother negative region; uses exp() which is slower

Historical Context

ReLU was introduced to deep learning by Nair & Hinton (2010, “Rectified Linear Units Improve Restricted Boltzmann Machines”), though the function itself (half-wave rectification) was well known in signal processing and neuroscience. The breakthrough moment was Krizhevsky et al. (2012, AlexNet), which showed that ReLU made training deep CNNs on ImageNet practical — training was 6x faster than with tanh.

The simplicity of ReLU — no exp(), no division, just a comparison and a conditional — made it not just mathematically convenient but computationally fast, which mattered enormously as networks scaled. Glorot & Bengio (2011) provided theoretical analysis of why ReLU avoids vanishing gradients. Variants like LeakyReLU (Maas et al., 2013) and PReLU (He et al., 2015) addressed the dying neuron problem but never fully displaced plain ReLU for CNNs. In Transformers, ReLU has been largely superseded by GELU and SiLU/Swish.