GELU (Gaussian Error Linear Unit)
GELU (Gaussian Error Linear Unit)
Section titled “GELU (Gaussian Error Linear Unit)”Smooth activation function defined as , where is the standard normal CDF. Weights each input by “how likely it is to be positive” under a Gaussian assumption. The default activation in Transformer models: BERT, GPT-2/3/4, ViT, and most modern NLP architectures.
Intuition
Section titled “Intuition”ReLU makes a hard binary decision: positive inputs pass through, negative inputs get zeroed. GELU asks a softer question: “given that neural network pre-activations are roughly normally distributed, what’s the probability this input is greater than other inputs?”
Concretely, is the probability that a standard normal variable is less than . When is large and positive, and GELU behaves like the identity. When is large and negative, and the output is near zero. In between, there’s a smooth transition — inputs near zero are partially passed through, weighted by their percentile rank.
This smoothness matters for optimisation. ReLU’s sharp corner at zero creates a discontinuous gradient, which can cause oscillations in optimisation. GELU is smooth everywhere, meaning the loss landscape has smoother curvature and gradient-based optimisers can take more reliable steps. Empirically, this translates to slightly better performance in Transformers, where the activation is applied at every position in every layer.
The practical difference from ReLU is small but consistent: GELU tends to produce marginally better validation metrics across NLP and vision transformer benchmarks. It has become the standard not because it’s dramatically better, but because it’s never worse and the computational overhead is negligible on modern hardware.
Definition:
where is the standard normal CDF and is the error function.
Tanh approximation (used in many implementations):
Gradient:
where is the standard normal PDF.
Key values: GELU(0) = 0, GELU(1) 0.841, GELU(-1) -0.159 (note: slightly negative — GELU is non-monotonic near ).
import torchimport torch.nn.functional as F
x = torch.randn(B, T, d_model) # (B, T, d_model)
# Exact GELU (uses erf — slightly slower but exact)out = F.gelu(x) # (B, T, d_model)
# Tanh approximation (matches GPT-2's original implementation)out = F.gelu(x, approximate='tanh') # (B, T, d_model)
# WARNING: The 'tanh' approximation and exact GELU give slightly different# values. If loading pretrained weights, match the variant used during# training. GPT-2 used tanh approx; most modern code uses exact.
# As a modulelayer = torch.nn.GELU(approximate='none') # 'none' = exact, 'tanh' = approxManual Implementation
Section titled “Manual Implementation”import numpy as npfrom scipy.special import erf
def gelu_exact(x): """Exact GELU using the error function. x: any shape array.""" return 0.5 * x * (1.0 + erf(x / np.sqrt(2.0)))
def gelu_tanh_approx(x): """ Tanh approximation of GELU (Hendrycks & Gimpel, 2016). This is what GPT-2 uses. Avoids scipy dependency. x: any shape array. """ c = np.sqrt(2.0 / np.pi) return 0.5 * x * (1.0 + np.tanh(c * (x + 0.044715 * x ** 3)))
def gelu_backward(x, grad_output): """Gradient of exact GELU.""" phi = 0.5 * (1.0 + erf(x / np.sqrt(2.0))) # CDF Φ(x) pdf = np.exp(-0.5 * x ** 2) / np.sqrt(2 * np.pi) # PDF φ(x) return grad_output * (phi + x * pdf)Popular Uses
Section titled “Popular Uses”- BERT and GPT-2/3/4: GELU is the activation in the feedforward blocks of virtually all Transformer language models since BERT (2018)
- Vision Transformers (ViT, DeiT, Swin): adopted GELU from the NLP convention
- Diffusion model backbones (U-Net in Stable Diffusion): GELU in attention and feedforward blocks
- Contrastive learning (CLIP): the text and image encoders both use GELU in their Transformer backbones
- Modern MLPs (MLP-Mixer): GELU as the default non-linearity in non-attention architectures
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| ReLU | CNNs, simple MLPs, speed-critical inference | Faster to compute; sharp corner can cause optimisation issues in Transformers |
| SiLU/Swish | SwiGLU feedforward blocks (LLaMA, Mistral) | Nearly identical shape to GELU; paired with gated linear units in modern LLMs |
| ReLU squared | PaLM, some efficiency-focused models | : sparser activations, but amplifies large values |
| Softplus | When you need a smooth ReLU but not the stochastic interpretation | : smooth like GELU but monotonic; lacks the non-monotonic “dip” |
| Mish | Some computer vision models (YOLOv4) | : similar shape to GELU/SiLU, marginal gains |
Historical Context
Section titled “Historical Context”GELU was proposed by Hendrycks & Gimpel (2016, “Gaussian Error Linear Units”) with a stochastic motivation: GELU can be interpreted as the expected value of a stochastic regulariser that randomly multiplies inputs by 0 or 1, where the probability depends on the input’s magnitude. This connects it to dropout (multiply by 0 with fixed probability) and zoneout.
GELU gained widespread adoption when BERT (Devlin et al., 2018) chose it as the feedforward activation, and GPT-2 (Radford et al., 2019) followed suit. Once two of the most influential Transformer models used it, GELU became the de facto standard for Transformers. The tanh approximation was used initially because exact erf() was slow on GPUs, but modern hardware and libraries now support exact GELU efficiently. Despite being largely superseded by SiLU/Swish in the latest LLMs (LLaMA, Mistral), GELU remains the most widely deployed activation in production Transformer models.