Skip to content

GELU (Gaussian Error Linear Unit)

Smooth activation function defined as xΦ(x)x \cdot \Phi(x), where Φ\Phi is the standard normal CDF. Weights each input by “how likely it is to be positive” under a Gaussian assumption. The default activation in Transformer models: BERT, GPT-2/3/4, ViT, and most modern NLP architectures.

ReLU makes a hard binary decision: positive inputs pass through, negative inputs get zeroed. GELU asks a softer question: “given that neural network pre-activations are roughly normally distributed, what’s the probability this input is greater than other inputs?”

Concretely, Φ(x)\Phi(x) is the probability that a standard normal variable is less than xx. When xx is large and positive, Φ(x)1\Phi(x) \approx 1 and GELU behaves like the identity. When xx is large and negative, Φ(x)0\Phi(x) \approx 0 and the output is near zero. In between, there’s a smooth transition — inputs near zero are partially passed through, weighted by their percentile rank.

This smoothness matters for optimisation. ReLU’s sharp corner at zero creates a discontinuous gradient, which can cause oscillations in optimisation. GELU is smooth everywhere, meaning the loss landscape has smoother curvature and gradient-based optimisers can take more reliable steps. Empirically, this translates to slightly better performance in Transformers, where the activation is applied at every position in every layer.

The practical difference from ReLU is small but consistent: GELU tends to produce marginally better validation metrics across NLP and vision transformer benchmarks. It has become the standard not because it’s dramatically better, but because it’s never worse and the computational overhead is negligible on modern hardware.

Definition:

GELU(x)=xΦ(x)=x12[1+erf(x2)]\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]

where Φ\Phi is the standard normal CDF and erf\text{erf} is the error function.

Tanh approximation (used in many implementations):

GELU(x)0.5x[1+tanh(2/π(x+0.044715x3))]\text{GELU}(x) \approx 0.5 \, x \left[1 + \tanh\left(\sqrt{2/\pi}\left(x + 0.044715 \, x^3\right)\right)\right]

Gradient:

ddxGELU(x)=Φ(x)+xϕ(x)\frac{d}{dx}\text{GELU}(x) = \Phi(x) + x \cdot \phi(x)

where ϕ(x)=12πex2/2\phi(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2} is the standard normal PDF.

Key values: GELU(0) = 0, GELU(1) \approx 0.841, GELU(-1) \approx -0.159 (note: slightly negative — GELU is non-monotonic near x0.68x \approx -0.68).

import torch
import torch.nn.functional as F
x = torch.randn(B, T, d_model) # (B, T, d_model)
# Exact GELU (uses erf — slightly slower but exact)
out = F.gelu(x) # (B, T, d_model)
# Tanh approximation (matches GPT-2's original implementation)
out = F.gelu(x, approximate='tanh') # (B, T, d_model)
# WARNING: The 'tanh' approximation and exact GELU give slightly different
# values. If loading pretrained weights, match the variant used during
# training. GPT-2 used tanh approx; most modern code uses exact.
# As a module
layer = torch.nn.GELU(approximate='none') # 'none' = exact, 'tanh' = approx
import numpy as np
from scipy.special import erf
def gelu_exact(x):
"""Exact GELU using the error function. x: any shape array."""
return 0.5 * x * (1.0 + erf(x / np.sqrt(2.0)))
def gelu_tanh_approx(x):
"""
Tanh approximation of GELU (Hendrycks & Gimpel, 2016).
This is what GPT-2 uses. Avoids scipy dependency.
x: any shape array.
"""
c = np.sqrt(2.0 / np.pi)
return 0.5 * x * (1.0 + np.tanh(c * (x + 0.044715 * x ** 3)))
def gelu_backward(x, grad_output):
"""Gradient of exact GELU."""
phi = 0.5 * (1.0 + erf(x / np.sqrt(2.0))) # CDF Φ(x)
pdf = np.exp(-0.5 * x ** 2) / np.sqrt(2 * np.pi) # PDF φ(x)
return grad_output * (phi + x * pdf)
  • BERT and GPT-2/3/4: GELU is the activation in the feedforward blocks of virtually all Transformer language models since BERT (2018)
  • Vision Transformers (ViT, DeiT, Swin): adopted GELU from the NLP convention
  • Diffusion model backbones (U-Net in Stable Diffusion): GELU in attention and feedforward blocks
  • Contrastive learning (CLIP): the text and image encoders both use GELU in their Transformer backbones
  • Modern MLPs (MLP-Mixer): GELU as the default non-linearity in non-attention architectures
AlternativeWhen to useTradeoff
ReLUCNNs, simple MLPs, speed-critical inferenceFaster to compute; sharp corner can cause optimisation issues in Transformers
SiLU/SwishSwiGLU feedforward blocks (LLaMA, Mistral)Nearly identical shape to GELU; paired with gated linear units in modern LLMs
ReLU squaredPaLM, some efficiency-focused modelsReLU(x)2\text{ReLU}(x)^2: sparser activations, but amplifies large values
SoftplusWhen you need a smooth ReLU but not the stochastic interpretationlog(1+ex)\log(1 + e^x): smooth like GELU but monotonic; lacks the non-monotonic “dip”
MishSome computer vision models (YOLOv4)xtanh(softplus(x))x \cdot \tanh(\text{softplus}(x)): similar shape to GELU/SiLU, marginal gains

GELU was proposed by Hendrycks & Gimpel (2016, “Gaussian Error Linear Units”) with a stochastic motivation: GELU can be interpreted as the expected value of a stochastic regulariser that randomly multiplies inputs by 0 or 1, where the probability depends on the input’s magnitude. This connects it to dropout (multiply by 0 with fixed probability) and zoneout.

GELU gained widespread adoption when BERT (Devlin et al., 2018) chose it as the feedforward activation, and GPT-2 (Radford et al., 2019) followed suit. Once two of the most influential Transformer models used it, GELU became the de facto standard for Transformers. The tanh approximation was used initially because exact erf() was slow on GPUs, but modern hardware and libraries now support exact GELU efficiently. Despite being largely superseded by SiLU/Swish in the latest LLMs (LLaMA, Mistral), GELU remains the most widely deployed activation in production Transformer models.