Skip to content

Core Neural Network Training

A minimal, single-file example covering the fundamentals of training a neural network in PyTorch: architecture, weight initialisation, training loop, and TensorBoard logging.

Python 3.10+ and PyTorch 2.0+.

python nn_training_core.py

This trains a small MLP on synthetic classification data for 20 epochs. You should see output like:

Epoch 1/20 │ train loss 2.3841 acc 0.102 │ val loss 2.3377 acc 0.097 │ lr 2.97e-04
Epoch 2/20 │ train loss 2.2996 acc 0.116 │ val loss 2.3049 acc 0.108 │ lr 2.79e-04
...

Training logs are written to runs/mlp_demo/.

tensorboard --logdir runs

Then open http://localhost:6006 in your browser. You’ll find:

  • train/loss_epoch and val/loss — training and validation loss curves per epoch. These should both decrease; a growing gap between them indicates overfitting.
  • train/accuracy and val/accuracy — classification accuracy per epoch.
  • lr — learning rate over time. Confirms the cosine schedule is decaying as expected.
  • Graphs tab — a visual representation of the model architecture. Useful for verifying the layer stack matches what you intended.

TensorBoard reads tfevents files — binary files written in a Protocol Buffers format defined by TensorFlow’s record_io convention. Each file is named something like events.out.tfevents.1773602131.hostname.pid.0. When you point tensorboard --logdir runs at a directory, it recursively scans for any file whose name contains tfevents and streams the records inside.

Each record in the file is a serialized Event protobuf containing a wall-clock timestamp, a training step number, and a payload (a scalar value, an image, a histogram, a model graph, etc.). The SummaryWriter from torch.utils.tensorboard handles serialization: calls like writer.add_scalar("train/loss_epoch", loss, epoch) create an Event with a Summary payload holding one Value tagged train/loss_epoch, then append that record to the open tfevents file and flush it to disk.

The hierarchical tag names (e.g. train/loss_epoch, val/loss) are just slash-separated strings — TensorBoard uses the prefix before the first / to group curves into collapsible sections in the UI.

Because the format is append-only, you can watch metrics update live while training is still running. If you restart a run into the same log_dir without clearing old files, TensorBoard will see overlapping step numbers and the plots can look garbled — so either delete the old runs/ directory or use a fresh sub-directory for each experiment.

.
├── nn_training_core.py # all code: model, init, training loop, logging
├── runs/ # created at runtime, contains TensorBoard logs
│ └── mlp_demo/
└── README.md
StageWhat / Why
ArchitectureStack of Linear → LayerNorm → ReLU. No norm/act after the final layer (it outputs raw logits).
InitialisationKaiming Normal for ReLU layers. Prevents signals from vanishing or exploding on the first forward.
OptimiserAdamW: adaptive per-param LR + proper weight decay. Cosine schedule smoothly anneals LR to zero.
Forward passmodel(x) → logits. No softmax here — cross_entropy does log-softmax + NLL internally for stability.
Backward passloss.backward() computes ∂loss/∂θ for every param. Grad clipping prevents any single update from being too large.
Validationmodel.eval() + torch.no_grad(). Disables dropout, saves memory by not building the computation graph.
LoggingTensorBoard: per-epoch loss & accuracy (track convergence), LR curve (verify schedule), model graph (sanity check arch).

The synthetic dataset is a placeholder. To use your own data, replace the X and Y tensors in main() with real data, or swap in a proper Dataset / DataLoader pair. Everything else — the model, training loop, and logging — stays the same.