Core Neural Network Training
Core Neural Network Training
Section titled “Core Neural Network Training”A minimal, single-file example covering the fundamentals of training a neural network in PyTorch: architecture, weight initialisation, training loop, and TensorBoard logging.
Requirements
Section titled “Requirements”Python 3.10+ and PyTorch 2.0+.
Run training
Section titled “Run training”python nn_training_core.pyThis trains a small MLP on synthetic classification data for 20 epochs. You should see output like:
Epoch 1/20 │ train loss 2.3841 acc 0.102 │ val loss 2.3377 acc 0.097 │ lr 2.97e-04Epoch 2/20 │ train loss 2.2996 acc 0.116 │ val loss 2.3049 acc 0.108 │ lr 2.79e-04...Training logs are written to runs/mlp_demo/.
View results in TensorBoard
Section titled “View results in TensorBoard”tensorboard --logdir runsThen open http://localhost:6006 in your browser. You’ll find:
- train/loss_epoch and val/loss — training and validation loss curves per epoch. These should both decrease; a growing gap between them indicates overfitting.
- train/accuracy and val/accuracy — classification accuracy per epoch.
- lr — learning rate over time. Confirms the cosine schedule is decaying as expected.
- Graphs tab — a visual representation of the model architecture. Useful for verifying the layer stack matches what you intended.
How TensorBoard logging works
Section titled “How TensorBoard logging works”TensorBoard reads tfevents files — binary files written in a Protocol Buffers format defined by TensorFlow’s record_io convention. Each file is named something like events.out.tfevents.1773602131.hostname.pid.0. When you point tensorboard --logdir runs at a directory, it recursively scans for any file whose name contains tfevents and streams the records inside.
Each record in the file is a serialized Event protobuf containing a wall-clock timestamp, a training step number, and a payload (a scalar value, an image, a histogram, a model graph, etc.). The SummaryWriter from torch.utils.tensorboard handles serialization: calls like writer.add_scalar("train/loss_epoch", loss, epoch) create an Event with a Summary payload holding one Value tagged train/loss_epoch, then append that record to the open tfevents file and flush it to disk.
The hierarchical tag names (e.g. train/loss_epoch, val/loss) are just slash-separated strings — TensorBoard uses the prefix before the first / to group curves into collapsible sections in the UI.
Because the format is append-only, you can watch metrics update live while training is still running. If you restart a run into the same log_dir without clearing old files, TensorBoard will see overlapping step numbers and the plots can look garbled — so either delete the old runs/ directory or use a fresh sub-directory for each experiment.
Project structure
Section titled “Project structure”.├── nn_training_core.py # all code: model, init, training loop, logging├── runs/ # created at runtime, contains TensorBoard logs│ └── mlp_demo/└── README.mdSummary: what happens and why
Section titled “Summary: what happens and why”| Stage | What / Why |
|---|---|
| Architecture | Stack of Linear → LayerNorm → ReLU. No norm/act after the final layer (it outputs raw logits). |
| Initialisation | Kaiming Normal for ReLU layers. Prevents signals from vanishing or exploding on the first forward. |
| Optimiser | AdamW: adaptive per-param LR + proper weight decay. Cosine schedule smoothly anneals LR to zero. |
| Forward pass | model(x) → logits. No softmax here — cross_entropy does log-softmax + NLL internally for stability. |
| Backward pass | loss.backward() computes ∂loss/∂θ for every param. Grad clipping prevents any single update from being too large. |
| Validation | model.eval() + torch.no_grad(). Disables dropout, saves memory by not building the computation graph. |
| Logging | TensorBoard: per-epoch loss & accuracy (track convergence), LR curve (verify schedule), model graph (sanity check arch). |
Adapting to real data
Section titled “Adapting to real data”The synthetic dataset is a placeholder. To use your own data, replace the X and Y tensors in main() with real data, or swap in a proper Dataset / DataLoader pair. Everything else — the model, training loop, and logging — stays the same.