Core Neural Network Training

A minimal, single-file example covering the fundamentals of training a neural network in PyTorch: architecture, weight initialisation, training loop, and TensorBoard logging.

Requirements

Python 3.10+ and PyTorch 2.0+.

Run training

python nn_training_core.py

This trains a small MLP on synthetic classification data for 20 epochs. You should see output like:

Epoch   1/20 │ train loss 2.3841  acc 0.102 │ val loss 2.3377  acc 0.097 │ lr 2.97e-04
Epoch   2/20 │ train loss 2.2996  acc 0.116 │ val loss 2.3049  acc 0.108 │ lr 2.79e-04
...

Training logs are written to runs/mlp_demo/.

View results in TensorBoard

tensorboard --logdir runs

Then open http://localhost:6006 in your browser. You’ll find:

train/loss_epoch and val/loss — training and validation loss curves per epoch. These should both decrease; a growing gap between them indicates overfitting.
train/accuracy and val/accuracy — classification accuracy per epoch.
lr — learning rate over time. Confirms the cosine schedule is decaying as expected.
Graphs tab — a visual representation of the model architecture. Useful for verifying the layer stack matches what you intended.

How TensorBoard logging works

TensorBoard reads tfevents files — binary files written in a Protocol Buffers format defined by TensorFlow’s record_io convention. Each file is named something like events.out.tfevents.1773602131.hostname.pid.0. When you point tensorboard --logdir runs at a directory, it recursively scans for any file whose name contains tfevents and streams the records inside.

Each record in the file is a serialized Event protobuf containing a wall-clock timestamp, a training step number, and a payload (a scalar value, an image, a histogram, a model graph, etc.). The SummaryWriter from torch.utils.tensorboard handles serialization: calls like writer.add_scalar("train/loss_epoch", loss, epoch) create an Event with a Summary payload holding one Value tagged train/loss_epoch, then append that record to the open tfevents file and flush it to disk.

The hierarchical tag names (e.g. train/loss_epoch, val/loss) are just slash-separated strings — TensorBoard uses the prefix before the first / to group curves into collapsible sections in the UI.

Because the format is append-only, you can watch metrics update live while training is still running. If you restart a run into the same log_dir without clearing old files, TensorBoard will see overlapping step numbers and the plots can look garbled — so either delete the old runs/ directory or use a fresh sub-directory for each experiment.

Project structure

.
├── nn_training_core.py   # all code: model, init, training loop, logging
├── runs/                 # created at runtime, contains TensorBoard logs
│   └── mlp_demo/
└── README.md

Summary: what happens and why

Stage	What / Why
Architecture	Stack of Linear → LayerNorm → ReLU. No norm/act after the final layer (it outputs raw logits).
Initialisation	Kaiming Normal for ReLU layers. Prevents signals from vanishing or exploding on the first forward.
Optimiser	AdamW: adaptive per-param LR + proper weight decay. Cosine schedule smoothly anneals LR to zero.
Forward pass	`model(x)` → logits. No softmax here — `cross_entropy` does log-softmax + NLL internally for stability.
Backward pass	`loss.backward()` computes ∂loss/∂θ for every param. Grad clipping prevents any single update from being too large.
Validation	`model.eval()` + `torch.no_grad()`. Disables dropout, saves memory by not building the computation graph.
Logging	TensorBoard: per-epoch loss & accuracy (track convergence), LR curve (verify schedule), model graph (sanity check arch).

Adapting to real data

The synthetic dataset is a placeholder. To use your own data, replace the X and Y tensors in main() with real data, or swap in a proper Dataset / DataLoader pair. Everything else — the model, training loop, and logging — stays the same.