Unified GAN Algorithm
Unified GAN Algorithm
Section titled “Unified GAN Algorithm”Introduction
Section titled “Introduction”This is the one file in the series where the core algorithm is fundamentally different — it’s two competing optimization steps, not one. Every other algorithm has a single loss to minimize; GANs have a minimax game.
Less dominant now that diffusion has taken over for generation, but the adversarial training idea (two networks with opposing objectives) is a genuinely different training paradigm that still appears in other contexts.
The key narrative:
Vanilla GAN introduces the adversarial idea: sharp outputs because any blur is a signal D can exploit. But the JS divergence underlying the BCE loss is flat when distributions don’t overlap (which is almost always in high dimensions), causing vanishing gradients.
WGAN swaps the divergence measure. The Wasserstein distance provides gradients even when distributions are disjoint — and D’s loss actually correlates with sample quality for the first time (you can look at the loss curve and know if training is going well). The catch is you need D to be Lipschitz, enforced crudely by weight clipping.
WGAN-GP fixes the clipping with a gradient penalty — same Wasserstein loss, but the Lipschitz constraint is enforced softly by penalising ‖∇D‖ ≠ 1 at interpolated points. This is the version that actually works well in practice.
Hinge GAN takes a different approach to the D-too-strong problem: the loss saturates once D is confident enough, so it can’t keep sharpening endlessly. Paired with spectral normalization (an architectural constraint, not a loss term), this powered BigGAN and SAGAN.
cGAN and Pix2Pix show that conditioning is orthogonal to the loss choice — you can bolt it onto any variant. Pix2Pix is interesting because G takes an image as input, not noise, which bends the skeleton (hence the separate train_step_paired method).
The training instability section is arguably the most important part. GANs need a Nash equilibrium, not a minimum — and gradient descent isn’t designed for that. This is the fundamental reason diffusion won: stable MSE regression vs. a delicate adversarial balancing act.
Summary: What changes vs. what stays the same
Section titled “Summary: What changes vs. what stays the same”Always the same (core loop)
Section titled “Always the same (core loop)”- Sample noise z, generate fake = G(z)
- D scores both real and fake
- Update D to better distinguish real from fake (PLUGGABLE loss)
- Generate new fakes, D scores them
- Update G to better fool D (PLUGGABLE loss)
What varies by variant
Section titled “What varies by variant”| Variant | D loss | G loss | Regularisation |
|---|---|---|---|
| Vanilla | BCE (real/fake) | −log D(G(z)) | — |
| WGAN | Wasserstein | −E[D(G(z))] | Weight clipping |
| WGAN-GP | Wasserstein | −E[D(G(z))] | Gradient penalty |
| Hinge | Hinge | −E[D(G(z))] | Spectral norm (arch.) |
| cGAN | Any + condition | Any + condition | Any |
| Pix2Pix | BCE (paired) | BCE + L1 recon. | — |
Motives for each variant
Section titled “Motives for each variant”| Variant | Problem Solved | Intuition for Solution |
|---|---|---|
| Vanilla GAN | Generative models before GANs (VAEs, PixelRNN) produce blurry outputs because they average over modes to minimise pixel error | Frame generation as a game: G tries to fool D, D tries to catch G. The adversarial loss forces G to produce SHARP outputs — any blur or artifact is a signal D can exploit |
| WGAN | Vanilla GAN training is unstable — mode collapse, oscillation, vanishing D gradients. The JS divergence is flat when distributions don’t overlap | Replace JS divergence with Wasserstein distance, which provides smooth, meaningful gradients even when the real and fake distributions don’t overlap. D loss now correlates with sample quality |
| WGAN-GP | WGAN’s weight clipping is crude: it pushes D weights toward ±c, under-using capacity. D becomes a simple function regardless of network size | Instead of clamping weights, directly penalise D’s gradient norm at random interpolations between real and fake. ‖∇D‖ ≈ 1 is the Lipschitz constraint enforced softly, preserving D’s capacity |
| Hinge | D can become “too confident” — it classifies everything correctly with extreme logits, leaving no useful gradient signal for G | Hinge loss saturates once D is “good enough” (score > 1 for real, < −1 for fake). Beyond that threshold, D gets no further reward, preventing it from dominating G. Paired with spectral norm to control D’s Lipschitz constant |
| cGAN | Unconditional GANs generate random samples — no control over what class or attribute appears | Feed the condition c (class, text, etc.) to BOTH G and D. D now checks “is this a real cat?” not just “is this real?” G must produce outputs consistent with c |
| Pix2Pix | Paired image translation (edges → photo, day → night) with just L1 loss produces blurry results because L1 averages over all plausible outputs | G takes an input image instead of noise. D sees (input, output) pairs. Adversarial loss ensures sharp, realistic output; L1 loss ensures the output matches the input’s structure (not just any realistic image) |
The training instability problem (why GANs are notoriously hard)
Section titled “The training instability problem (why GANs are notoriously hard)”GANs are a two-player game, not an optimisation problem. Standard gradient descent finds MINIMA of a loss — but GANs need a SADDLE POINT (Nash equilibrium). This causes three signature failure modes:
-
Mode collapse: G finds a few outputs that fool D and only produces those. D adapts, G finds new modes, D adapts again — G cycles through modes but never covers the full distribution.
- WGAN/WGAN-GP help by providing smoother gradients
- Diffusion models sidestep this entirely (no adversarial training)
-
Oscillation: G and D chase each other endlessly without converging. D gets good → G adjusts → D adjusts → … The loss curves oscillate rather than converge.
- Hinge loss / spectral norm help by limiting D’s aggressiveness
- Learning rate balance (lower LR for D) is a common hack
-
Vanishing gradients: When D is too good, it outputs 0/1 with certainty. The gradient of log(1 − D(G(z))) vanishes — G gets no learning signal at all.
- Non-saturating loss (−log D) helps early in training
- WGAN’s Wasserstein distance provides gradients everywhere
- Gradient penalty keeps D from becoming too sharp
This instability is the fundamental reason diffusion models replaced GANs for most generation tasks: diffusion training is stable MSE regression, with no adversarial dynamics to balance. GANs produce sharper single samples, but diffusion covers more modes and trains reliably.
How GANs connect to other algorithms in this series
Section titled “How GANs connect to other algorithms in this series”- GAN D → VAE decoder loss: The KL-AE (Stable Diffusion’s VAE) uses a GAN discriminator as part of its reconstruction loss. The adversarial signal forces sharp reconstructions that pure MSE or perceptual loss alone can’t achieve.
- GAN → Diffusion: Diffusion replaced GANs for generation, but some hybrid approaches (e.g. consistency models, GANs for single-step distillation of diffusion) combine both.
- cGAN → CLIP / Contrastive: Both condition on semantic info. CLIP aligns image-text in embedding space; cGAN conditions generation on text/labels directly.
- GAN → RL: The adversarial setup (G improves against D’s criticism) is structurally similar to actor-critic RL (policy improves against value function’s criticism). GAIL (Generative Adversarial Imitation Learning) makes this connection explicit.