Skip to content

Epsilon-Greedy Exploration

With probability epsilon take a random action, otherwise take the greedy (highest Q-value) action. The simplest exploration strategy for discrete action spaces, used in DQN and most tabular RL. Usually epsilon decays from a high value (1.0) to a small floor (0.01-0.1) over the course of training.

Imagine you always eat at the restaurant you think is best. You’ll never discover the new place around the corner that might be even better. Epsilon-greedy fixes this by occasionally (with probability epsilon) choosing a random restaurant instead. Early in training, epsilon is high — you explore aggressively because your estimates are unreliable anyway. As you learn, epsilon decays — you exploit your knowledge more, but never stop exploring entirely.

The key property is simplicity: it requires no learned exploration model, no uncertainty estimates, no special architecture. You compute Q-values, pick the best one most of the time, and randomly deviate the rest. This is “dithering” exploration — completely undirected randomness. It works surprisingly well in practice for discrete action spaces, especially when combined with a replay buffer that re-uses past exploratory experiences.

The limitation is equally clear: epsilon-greedy explores uniformly over actions, not intelligently. In a 1000-action space, the random action is almost certainly useless. For continuous action spaces, random uniform noise is even worse — methods like Gaussian noise (DDPG) or entropy bonuses (SAC) are strictly better. But for Atari-scale discrete action spaces (4-18 actions), epsilon-greedy remains the standard.

Action selection:

at={random action from Awith probability εargmaxaQ(st,a)with probability 1εa_t = \begin{cases} \text{random action from } \mathcal{A} & \text{with probability } \varepsilon \\ \arg\max_{a} Q(s_t, a) & \text{with probability } 1 - \varepsilon \end{cases}

Linear epsilon decay (most common):

εt=max(εend,  εstarttTdecay(εstartεend))\varepsilon_t = \max\bigl(\varepsilon_{\text{end}}, \; \varepsilon_{\text{start}} - \frac{t}{T_{\text{decay}}} \cdot (\varepsilon_{\text{start}} - \varepsilon_{\text{end}})\bigr)

Typical values: εstart=1.0\varepsilon_{\text{start}} = 1.0, εend=0.01\varepsilon_{\text{end}} = 0.01, Tdecay=100,000T_{\text{decay}} = 100{,}000 steps.

Probability of each action under epsilon-greedy:

π(as)={1ε+εAif a=argmaxaQ(s,a)εAotherwise\pi(a | s) = \begin{cases} 1 - \varepsilon + \frac{\varepsilon}{|\mathcal{A}|} & \text{if } a = \arg\max_{a'} Q(s, a') \\ \frac{\varepsilon}{|\mathcal{A}|} & \text{otherwise} \end{cases}

This means even the greedy action gets a small boost from the random component.

import torch
import random
def epsilon_greedy_action(q_values, epsilon):
"""
Select action using epsilon-greedy.
q_values: (n_actions,) Q-values for current state
epsilon: float, exploration probability
Returns: int, selected action index
"""
if random.random() < epsilon:
return random.randrange(q_values.shape[0]) # random action
else:
return q_values.argmax().item() # greedy action
# ── Batched version (for vectorised envs) ───────────────────────
def epsilon_greedy_batch(q_values, epsilon):
"""
q_values: (B, n_actions)
Returns: (B,) action indices
"""
B = q_values.shape[0]
greedy = q_values.argmax(dim=1) # (B,)
random_actions = torch.randint(0, q_values.shape[1], (B,))
mask = (torch.rand(B) < epsilon) # (B,) True = explore
return torch.where(mask, random_actions, greedy)
# ── Linear decay schedule ───────────────────────────────────────
def get_epsilon(step, eps_start=1.0, eps_end=0.01, decay_steps=100_000):
return max(eps_end, eps_start - step * (eps_start - eps_end) / decay_steps)
import numpy as np
def epsilon_greedy(q_values, epsilon, rng=None):
"""
Epsilon-greedy action selection, pure numpy.
q_values: (n_actions,) estimated Q-values for current state
epsilon: float in [0, 1], exploration probability
rng: optional numpy random generator for reproducibility
"""
rng = rng or np.random.default_rng()
n_actions = len(q_values)
if rng.random() < epsilon:
return rng.integers(n_actions) # uniform random action
else:
# Break ties randomly (important early in training when Q ≈ 0)
max_q = q_values.max()
best_actions = np.where(q_values == max_q)[0] # all actions with max Q
return rng.choice(best_actions)
def linear_decay(step, start=1.0, end=0.01, decay_steps=100_000):
"""Linear epsilon decay, clamped to [end, start]."""
fraction = min(step / decay_steps, 1.0)
return start + fraction * (end - start)
def exponential_decay(step, start=1.0, end=0.01, half_life=10_000):
"""Exponential decay: epsilon halves every half_life steps."""
return max(end, start * (0.5 ** (step / half_life)))
  • DQN (see q-learning/): the original DQN paper uses epsilon-greedy with linear decay from 1.0 to 0.1 over 1M frames, then holds at 0.1
  • Double DQN, Dueling DQN: all DQN variants use the same epsilon-greedy exploration
  • Tabular Q-learning and SARSA: the classic RL textbook algorithms (Sutton & Barto) use epsilon-greedy as the default exploration strategy
  • CQL (see q-learning/): offline RL — during the original data collection phase, epsilon-greedy with high epsilon generates the exploratory dataset
AlternativeWhen to useTradeoff
Boltzmann (softmax) explorationWant action probabilities proportional to Q-valuesMore directed than uniform random; temperature is another hyperparameter
Gaussian noise (DDPG, TD3)Continuous action spacesAdds N(0, sigma) to continuous actions; not applicable to discrete spaces
Entropy bonus (SAC)Want principled exploration in continuous spacesMaximises entropy alongside reward; requires modified objective
UCB (Upper Confidence Bound)Bandit problems, tree search (MCTS)Explores actions with high uncertainty; needs visit counts or uncertainty estimates
Noisy Networks (NoisyNet)Want state-dependent explorationLearned noise in network weights; replaces epsilon entirely; more compute
Intrinsic motivation (RND, ICM)Sparse reward environmentsExplores based on novelty/surprise; adds complexity and a second model

Epsilon-greedy is one of the oldest exploration strategies, originating from the multi-armed bandit literature in the 1950s-60s (Robbins, 1952). It was adopted into RL through Watkins’ Q-learning (1989) and became the standard exploration method in the tabular RL era.

The DQN paper (Mnih et al., 2015) used epsilon-greedy for Atari games and cemented it as the default for deep Q-learning. Despite being theoretically suboptimal (it explores uniformly rather than intelligently), its simplicity and reliability kept it dominant. More sophisticated methods like NoisyNets (Fortunato et al., 2018) and intrinsic motivation approaches exist but are rarely worth the added complexity for standard benchmarks. The trend in modern RL is toward entropy-based exploration (SAC) for continuous control, while epsilon-greedy remains standard for discrete action spaces.