SiLU / Swish
SiLU / Swish
Section titled “SiLU / Swish”Self-gated activation function: , where is the sigmoid function. Smooth and non-monotonic like GELU, but uses sigmoid gating instead of the Gaussian CDF. The activation inside SwiGLU, which is the feedforward block in LLaMA, Mistral, Gemma, and most post-2023 LLMs.
Intuition
Section titled “Intuition”SiLU is “self-gated”: the input itself controls how much of itself passes through. The sigmoid acts as a gate that outputs a value between 0 and 1. For large positive , the gate is fully open () and SiLU behaves like the identity. For large negative , the gate is nearly closed () and the output is near zero. So far, this sounds like ReLU.
The interesting part is what happens near zero. SiLU has a small negative region: it dips below zero around before coming back up. This non-monotonicity means the function provides a small negative “push” for moderately negative inputs before suppressing them. Empirically, this seems to help optimisation by creating smoother loss landscapes compared to ReLU’s sharp corner.
SiLU and GELU are nearly identical in shape — the maximum difference between them is about 0.01. The practical choice between them is driven more by convention and how they compose with other components. SiLU’s claim to fame is its pairing with Gated Linear Units: in SwiGLU (Shazeer, 2020), the feedforward block computes , where the element-wise product with a second linear projection gives the network an additional multiplicative interaction. SwiGLU consistently outperforms plain GELU feedforward blocks.
Definition (SiLU/Swish):
Generalised Swish (with learnable , rarely used):
When , this is SiLU. As , it converges to ReLU.
Gradient:
Key values: SiLU(0) = 0, minimum at .
SwiGLU feedforward block:
Note: SwiGLU uses 3 weight matrices (, , ) instead of the standard FFN’s 2, so the hidden dimension is typically reduced by to keep parameter count constant.
import torchimport torch.nn.functional as F
x = torch.randn(B, T, d_model) # (B, T, d_model)
# Functional formout = F.silu(x) # (B, T, d_model)
# As a modulelayer = torch.nn.SiLU()
# ── SwiGLU feedforward block (as in LLaMA) ─────────────────────# This is the main reason SiLU matters in modern LLMs.class SwiGLU(torch.nn.Module): def __init__(self, d_model, d_ff): super().__init__() self.w1 = torch.nn.Linear(d_model, d_ff, bias=False) self.w2 = torch.nn.Linear(d_model, d_ff, bias=False) self.w_out = torch.nn.Linear(d_ff, d_model, bias=False)
def forward(self, x): # (B, T, d_model) gate = F.silu(self.w1(x)) # (B, T, d_ff) — SiLU-gated projection up = self.w2(x) # (B, T, d_ff) — ungated projection return self.w_out(gate * up) # (B, T, d_model) — element-wise gate then project back
# WARNING: SwiGLU has 3 matrices instead of 2. When matching parameter counts# with a standard FFN (d_ff = 4 * d_model), use d_ff = (2/3) * 4 * d_model,# often rounded to a multiple of 256 for hardware efficiency.Manual Implementation
Section titled “Manual Implementation”import numpy as np
def sigmoid(x): """Numerically stable sigmoid.""" return np.where(x >= 0, 1.0 / (1.0 + np.exp(-x)), np.exp(x) / (1.0 + np.exp(x)))
def silu(x): """SiLU/Swish: x * sigmoid(x). x: any shape array.""" return x * sigmoid(x)
def silu_backward(x, grad_output): """Gradient of SiLU.""" s = sigmoid(x) return grad_output * (s + x * s * (1 - s)) # σ(x)(1 + x(1 - σ(x)))
def swiglu(x, W1, W2, W_out): """ SwiGLU feedforward block. x: (B, d_model), W1/W2: (d_model, d_ff), W_out: (d_ff, d_model) """ gate = silu(x @ W1) # (B, d_ff) up = x @ W2 # (B, d_ff) return (gate * up) @ W_out # (B, d_model)Popular Uses
Section titled “Popular Uses”- SwiGLU in LLMs (LLaMA 1/2/3, Mistral, Gemma, PaLM): SiLU is the gating activation in the SwiGLU feedforward block, now the standard FFN design for large language models
- EfficientNet / EfficientNetV2: Swish activation throughout the convolutional backbone, one of the first large-scale uses
- Transformer feedforward variants (
transformer/): SwiGLU is the modern replacement for ReLU/GELU FFN blocks, covered in the SwiGLU section - Diffusion U-Nets (Stable Diffusion XL): SiLU activations in the residual blocks of the denoising backbone
- Mobile architectures (MobileNetV3): hard-swish () is a hardware-friendly approximation
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| GELU | Transformer models without gated FFN (BERT, GPT-2, ViT) | Nearly identical shape; standard when using plain 2-matrix FFN blocks |
| ReLU | CNNs, speed-critical inference, simple MLPs | Fastest to compute; no smooth gating but works well in non-Transformer architectures |
| Hard Swish | Mobile/edge deployment | : piecewise linear approximation, no exp() needed |
| GeGLU | Alternative gated FFN (some research models) | Uses GELU instead of SiLU in the gate; similar performance, less widely adopted |
| ReGLU | Simpler gated FFN | Uses ReLU in the gate; slightly worse than SwiGLU but faster |
Historical Context
Section titled “Historical Context”Swish was proposed by Ramachandran et al. (2017, “Searching for Activation Functions”) at Google Brain, discovered through automated search over activation function design spaces using reinforcement learning. The same function was independently proposed as SiLU (Sigmoid Linear Unit) by Elfwing et al. (2018). The name “SiLU” is now the standard in PyTorch.
Swish gained traction when EfficientNet (Tan & Le, 2019) used it throughout, showing consistent gains over ReLU in image classification. The function’s real impact came when Shazeer (2020, “GLU Variants Improve Transformer”) showed that pairing SiLU with Gated Linear Units (SwiGLU) substantially improved Transformer feedforward blocks. This combination was adopted by PaLM (Chowdhery et al., 2022) and then LLaMA (Touvron et al., 2023), making SwiGLU the de facto standard for LLM architectures. The learnable parameter from the original Swish paper is almost never used — (plain SiLU) works fine.