Epsilon-Greedy Exploration

With probability epsilon take a random action, otherwise take the greedy (highest Q-value) action. The simplest exploration strategy for discrete action spaces, used in DQN and most tabular RL. Usually epsilon decays from a high value (1.0) to a small floor (0.01-0.1) over the course of training.

Intuition

Imagine you always eat at the restaurant you think is best. You’ll never discover the new place around the corner that might be even better. Epsilon-greedy fixes this by occasionally (with probability epsilon) choosing a random restaurant instead. Early in training, epsilon is high — you explore aggressively because your estimates are unreliable anyway. As you learn, epsilon decays — you exploit your knowledge more, but never stop exploring entirely.

The key property is simplicity: it requires no learned exploration model, no uncertainty estimates, no special architecture. You compute Q-values, pick the best one most of the time, and randomly deviate the rest. This is “dithering” exploration — completely undirected randomness. It works surprisingly well in practice for discrete action spaces, especially when combined with a replay buffer that re-uses past exploratory experiences.

The limitation is equally clear: epsilon-greedy explores uniformly over actions, not intelligently. In a 1000-action space, the random action is almost certainly useless. For continuous action spaces, random uniform noise is even worse — methods like Gaussian noise (DDPG) or entropy bonuses (SAC) are strictly better. But for Atari-scale discrete action spaces (4-18 actions), epsilon-greedy remains the standard.

Math

Action selection:

$a_t = \begin{cases} \text{random action from } \mathcal{A} & \text{with probability } \varepsilon \\ \arg\max_{a} Q(s_t, a) & \text{with probability } 1 - \varepsilon \end{cases}$

Linear epsilon decay (most common):

$\varepsilon_t = \max\bigl(\varepsilon_{\text{end}}, \; \varepsilon_{\text{start}} - \frac{t}{T_{\text{decay}}} \cdot (\varepsilon_{\text{start}} - \varepsilon_{\text{end}})\bigr)$

Typical values: $\varepsilon_{\text{start}} = 1.0$ , $\varepsilon_{\text{end}} = 0.01$ , $T_{\text{decay}} = 100{,}000$ steps.

Probability of each action under epsilon-greedy:

$\pi(a | s) = \begin{cases} 1 - \varepsilon + \frac{\varepsilon}{|\mathcal{A}|} & \text{if } a = \arg\max_{a'} Q(s, a') \\ \frac{\varepsilon}{|\mathcal{A}|} & \text{otherwise} \end{cases}$

This means even the greedy action gets a small boost from the random component.

Code

import torch
import random

def epsilon_greedy_action(q_values, epsilon):
    """
    Select action using epsilon-greedy.
    q_values: (n_actions,) Q-values for current state
    epsilon:  float, exploration probability
    Returns:  int, selected action index
    """
    if random.random() < epsilon:
        return random.randrange(q_values.shape[0])     # random action
    else:
        return q_values.argmax().item()                 # greedy action

# ── Batched version (for vectorised envs) ───────────────────────
def epsilon_greedy_batch(q_values, epsilon):
    """
    q_values: (B, n_actions)
    Returns:  (B,) action indices
    """
    B = q_values.shape[0]
    greedy = q_values.argmax(dim=1)                     # (B,)
    random_actions = torch.randint(0, q_values.shape[1], (B,))
    mask = (torch.rand(B) < epsilon)                    # (B,) True = explore
    return torch.where(mask, random_actions, greedy)

# ── Linear decay schedule ───────────────────────────────────────
def get_epsilon(step, eps_start=1.0, eps_end=0.01, decay_steps=100_000):
    return max(eps_end, eps_start - step * (eps_start - eps_end) / decay_steps)

Manual Implementation

import numpy as np

def epsilon_greedy(q_values, epsilon, rng=None):
    """
    Epsilon-greedy action selection, pure numpy.
    q_values: (n_actions,) estimated Q-values for current state
    epsilon:  float in [0, 1], exploration probability
    rng:      optional numpy random generator for reproducibility
    """
    rng = rng or np.random.default_rng()
    n_actions = len(q_values)

    if rng.random() < epsilon:
        return rng.integers(n_actions)                  # uniform random action
    else:
        # Break ties randomly (important early in training when Q ≈ 0)
        max_q = q_values.max()
        best_actions = np.where(q_values == max_q)[0]   # all actions with max Q
        return rng.choice(best_actions)


def linear_decay(step, start=1.0, end=0.01, decay_steps=100_000):
    """Linear epsilon decay, clamped to [end, start]."""
    fraction = min(step / decay_steps, 1.0)
    return start + fraction * (end - start)


def exponential_decay(step, start=1.0, end=0.01, half_life=10_000):
    """Exponential decay: epsilon halves every half_life steps."""
    return max(end, start * (0.5 ** (step / half_life)))

Popular Uses

DQN (see q-learning/): the original DQN paper uses epsilon-greedy with linear decay from 1.0 to 0.1 over 1M frames, then holds at 0.1
Double DQN, Dueling DQN: all DQN variants use the same epsilon-greedy exploration
Tabular Q-learning and SARSA: the classic RL textbook algorithms (Sutton & Barto) use epsilon-greedy as the default exploration strategy
CQL (see q-learning/): offline RL — during the original data collection phase, epsilon-greedy with high epsilon generates the exploratory dataset

Alternatives

Alternative	When to use	Tradeoff
Boltzmann (softmax) exploration	Want action probabilities proportional to Q-values	More directed than uniform random; temperature is another hyperparameter
Gaussian noise (DDPG, TD3)	Continuous action spaces	Adds N(0, sigma) to continuous actions; not applicable to discrete spaces
Entropy bonus (SAC)	Want principled exploration in continuous spaces	Maximises entropy alongside reward; requires modified objective
UCB (Upper Confidence Bound)	Bandit problems, tree search (MCTS)	Explores actions with high uncertainty; needs visit counts or uncertainty estimates
Noisy Networks (NoisyNet)	Want state-dependent exploration	Learned noise in network weights; replaces epsilon entirely; more compute
Intrinsic motivation (RND, ICM)	Sparse reward environments	Explores based on novelty/surprise; adds complexity and a second model

Historical Context

Epsilon-greedy is one of the oldest exploration strategies, originating from the multi-armed bandit literature in the 1950s-60s (Robbins, 1952). It was adopted into RL through Watkins’ Q-learning (1989) and became the standard exploration method in the tabular RL era.

The DQN paper (Mnih et al., 2015) used epsilon-greedy for Atari games and cemented it as the default for deep Q-learning. Despite being theoretically suboptimal (it explores uniformly rather than intelligently), its simplicity and reliability kept it dominant. More sophisticated methods like NoisyNets (Fortunato et al., 2018) and intrinsic motivation approaches exist but are rarely worth the added complexity for standard benchmarks. The trend in modern RL is toward entropy-based exploration (SAC) for continuous control, while epsilon-greedy remains standard for discrete action spaces.