GELU (Gaussian Error Linear Unit)

Smooth activation function defined as $x \cdot \Phi(x)$ , where $\Phi$ is the standard normal CDF. Weights each input by “how likely it is to be positive” under a Gaussian assumption. The default activation in Transformer models: BERT, GPT-2/3/4, ViT, and most modern NLP architectures.

Intuition

ReLU makes a hard binary decision: positive inputs pass through, negative inputs get zeroed. GELU asks a softer question: “given that neural network pre-activations are roughly normally distributed, what’s the probability this input is greater than other inputs?”

Concretely, $\Phi(x)$ is the probability that a standard normal variable is less than $x$ . When $x$ is large and positive, $\Phi(x) \approx 1$ and GELU behaves like the identity. When $x$ is large and negative, $\Phi(x) \approx 0$ and the output is near zero. In between, there’s a smooth transition — inputs near zero are partially passed through, weighted by their percentile rank.

This smoothness matters for optimisation. ReLU’s sharp corner at zero creates a discontinuous gradient, which can cause oscillations in optimisation. GELU is smooth everywhere, meaning the loss landscape has smoother curvature and gradient-based optimisers can take more reliable steps. Empirically, this translates to slightly better performance in Transformers, where the activation is applied at every position in every layer.

The practical difference from ReLU is small but consistent: GELU tends to produce marginally better validation metrics across NLP and vision transformer benchmarks. It has become the standard not because it’s dramatically better, but because it’s never worse and the computational overhead is negligible on modern hardware.

Math

Definition:

$\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]$

where $\Phi$ is the standard normal CDF and $\text{erf}$ is the error function.

Tanh approximation (used in many implementations):

$\text{GELU}(x) \approx 0.5 \, x \left[1 + \tanh\left(\sqrt{2/\pi}\left(x + 0.044715 \, x^3\right)\right)\right]$

Gradient:

$\frac{d}{dx}\text{GELU}(x) = \Phi(x) + x \cdot \phi(x)$

where $\phi(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2}$ is the standard normal PDF.

Key values: GELU(0) = 0, GELU(1) $\approx$ 0.841, GELU(-1) $\approx$ -0.159 (note: slightly negative — GELU is non-monotonic near $x \approx -0.68$ ).

Code

import torch
import torch.nn.functional as F

x = torch.randn(B, T, d_model)                # (B, T, d_model)

# Exact GELU (uses erf — slightly slower but exact)
out = F.gelu(x)                                # (B, T, d_model)

# Tanh approximation (matches GPT-2's original implementation)
out = F.gelu(x, approximate='tanh')            # (B, T, d_model)

# WARNING: The 'tanh' approximation and exact GELU give slightly different
# values. If loading pretrained weights, match the variant used during
# training. GPT-2 used tanh approx; most modern code uses exact.

# As a module
layer = torch.nn.GELU(approximate='none')      # 'none' = exact, 'tanh' = approx

Manual Implementation

import numpy as np
from scipy.special import erf

def gelu_exact(x):
    """Exact GELU using the error function. x: any shape array."""
    return 0.5 * x * (1.0 + erf(x / np.sqrt(2.0)))

def gelu_tanh_approx(x):
    """
    Tanh approximation of GELU (Hendrycks & Gimpel, 2016).
    This is what GPT-2 uses. Avoids scipy dependency.
    x: any shape array.
    """
    c = np.sqrt(2.0 / np.pi)
    return 0.5 * x * (1.0 + np.tanh(c * (x + 0.044715 * x ** 3)))

def gelu_backward(x, grad_output):
    """Gradient of exact GELU."""
    phi = 0.5 * (1.0 + erf(x / np.sqrt(2.0)))       # CDF Φ(x)
    pdf = np.exp(-0.5 * x ** 2) / np.sqrt(2 * np.pi) # PDF φ(x)
    return grad_output * (phi + x * pdf)

Popular Uses

BERT and GPT-2/3/4: GELU is the activation in the feedforward blocks of virtually all Transformer language models since BERT (2018)
Vision Transformers (ViT, DeiT, Swin): adopted GELU from the NLP convention
Diffusion model backbones (U-Net in Stable Diffusion): GELU in attention and feedforward blocks
Contrastive learning (CLIP): the text and image encoders both use GELU in their Transformer backbones
Modern MLPs (MLP-Mixer): GELU as the default non-linearity in non-attention architectures

Alternatives

Alternative	When to use	Tradeoff
ReLU	CNNs, simple MLPs, speed-critical inference	Faster to compute; sharp corner can cause optimisation issues in Transformers
SiLU/Swish	SwiGLU feedforward blocks (LLaMA, Mistral)	Nearly identical shape to GELU; paired with gated linear units in modern LLMs
ReLU squared	PaLM, some efficiency-focused models	$\text{ReLU}(x)^2$ : sparser activations, but amplifies large values
Softplus	When you need a smooth ReLU but not the stochastic interpretation	$\log(1 + e^x)$ : smooth like GELU but monotonic; lacks the non-monotonic “dip”
Mish	Some computer vision models (YOLOv4)	$x \cdot \tanh(\text{softplus}(x))$ : similar shape to GELU/SiLU, marginal gains

Historical Context

GELU was proposed by Hendrycks & Gimpel (2016, “Gaussian Error Linear Units”) with a stochastic motivation: GELU can be interpreted as the expected value of a stochastic regulariser that randomly multiplies inputs by 0 or 1, where the probability depends on the input’s magnitude. This connects it to dropout (multiply by 0 with fixed probability) and zoneout.

GELU gained widespread adoption when BERT (Devlin et al., 2018) chose it as the feedforward activation, and GPT-2 (Radford et al., 2019) followed suit. Once two of the most influential Transformer models used it, GELU became the de facto standard for Transformers. The tanh approximation was used initially because exact erf() was slow on GPUs, but modern hardware and libraries now support exact GELU efficiently. Despite being largely superseded by SiLU/Swish in the latest LLMs (LLaMA, Mistral), GELU remains the most widely deployed activation in production Transformer models.