There is a particular kind of code that machine-learning teams ship without ever quite deciding to: the training loop. Nobody sits down to write a thousand-line orchestration layer. You write twenty lines to get a first model training, then you add early stopping, then a checkpoint every few epochs because the box rebooted once, then a second metric, then a branch for the multiclass run, then a try-and-except around the optimiser step because something threw at 3am. None of these were decisions. They were patches. And the result, in our first pass at the raster well-log digitisation model, was a loop that ran perfectly and that none of us could read.
This is not a deployment story. The model that came out of the hand-rolled loop is the same model that came out of the Lightning one, numerically. The run shape never moved: batch size 1 on the binary segmentation set, batch size 16 on the multiclass set through a custom collate function, 50 epochs in both cases, 110 minutes of wall clock on the 2,000-instance binary run and 550 minutes on the 15,000-instance multiclass run. Migrating to PyTorch Lightning [2] bought us exactly zero seconds of training time. What it bought was the ability to see what the loop was doing, which turned out to be worth more than any speedup, because one of the things the loop was doing was quietly wrong.
A loop that runs is not a loop you can read
Bare PyTorch [1] is the reason hand-rolled loops exist. Its eager, imperative style is a genuine pleasure to prototype in: there is no graph to compile, no session to manage, you just write the maths and it runs. The same property that makes it lovely to start with is the property that lets a training loop rot. Because nothing forces a structure on you, the structure you end up with is the accreted history of every quick fix, and that history is invisible the moment the loop works.
Ours had reached a familiar state. The epoch sweep, the forward pass, the loss, the backward call, the optimiser step, the gradient zeroing, the running-metric accumulators, the checkpoint logic, the device placement, the early-stopping counter, and the variable-width batch handling were all interleaved in one function. Reading it meant holding all of that in your head at once. Reviewing a change to it meant reasoning about whether a tweak to the metric accumulator could perturb the optimiser step, because they lived eleven lines apart in the same scope. The model architecture, our encoder-decoder segmenter in the lineage of U-Net [3], was the cleanest part of the codebase precisely because it had a clear job. The loop had every job, so it had no shape.
The hygiene argument for migrating is the same one Martin Fowler made for refactoring in general: restructuring that preserves behaviour is a deliverable in its own right [4]. You are not adding a feature. You are changing the code so a person can understand it, while the observable result stays byte-for-byte the same. That is an unglamorous thing to spend a sprint on, and it is also the single highest-leverage thing you can do to a training loop that has stopped being legible.
What Lightning actually takes off your hands
The mental model that made the migration click for us is simple. A training loop is a mix of two things: the science, which is yours and nobody else's, and the engineering boilerplate, which is the same in every project on earth and which you have no business hand-writing. Lightning [2] draws that line for you. The science goes into a LightningModule (the model, the forward pass, what a training step computes, which optimiser to configure). Everything else (the epoch and step loop, calling backward and stepping the optimiser, zeroing gradients in the right order, moving tensors to the device, checkpointing, resuming, logging) is handled by a Trainer that the whole community reviews.
The ledger below is the migration, concern by concern, on our real run shape. Each row is one responsibility the bespoke loop carried in its body, and what happened to it after the refactor. Toggle between the columns to read the same concerns before and after, and drag the run-scale lever to move between the two actual runs.
Read the teal rows first, because they are the boring win, and the boring win is the point. The epoch and step loop, thirty-odd lines of for-and-while scaffolding in our code, simply disappears: the Trainer drives it and there is nothing left for us to maintain. The metric accumulators, which we had been summing by hand and dividing at epoch end (a place we had already shipped one off-by-one), collapse into a per-step log call. Checkpoint-and-resume, an eighteen-line tangle of conditional torch.save calls, becomes a single callback. None of this is clever. All of it is code we no longer own, no longer review, and can no longer break.
The orange row: the concern the loop hid
The row that matters is the orange one, and it is the whole reason this is a story rather than a chore. Somewhere in the accretion, the order of the optimiser step and the gradient zeroing in our hand-rolled loop had drifted out of the intended sequence on one of the two run configurations. It did not crash. It did not throw. The loss still went down, the model still trained, the validation numbers were still plausible, because a slightly-wrong gradient bookkeeping on a network this over-parameterised degrades gracefully into looking fine. That is the most dangerous failure mode in all of machine learning: the bug that costs you a few points of quality and announces itself with absolutely nothing.
We did not find it by staring at the loop. We found it because Lightning's configure_optimizers and its prescribed step ordering gave the optimiser exactly one correct place to live, and when we moved our logic into that shape the discrepancy with the old behaviour became a thing we had to explain. The framework did not fix the bug. The framework made the bug impossible to keep hidden, by replacing a free-form region of our code, where the step could go anywhere, with a contract that says where it goes. A hand-rolled loop has no opinion about whether you zeroed the gradients before or after you stepped. A Trainer does, and its opinion is the correct one, written down once and reviewed by thousands of people who are not us.
This is the hygiene thesis stated plainly. Boilerplate is not just tedious to write. Boilerplate is where bugs go to hide, because boilerplate is the code nobody reviews on the assumption that it is too simple to be wrong. Lifting it into a framework does not only save typing. It moves an entire category of silent defect out of your repository and into a place where it has already been found.
The one piece we kept by hand, on purpose
Migrating to a framework is not the same as surrendering to it, and the dashed grey row in the ledger is where we drew the line. Our logs are not tidy fixed-size tensors. The synthetic rasters we train on span 3,200 to 12,800 pixels wide and 480 to 640 pixels tall, deliberately ragged because real scanned logs are ragged, and a single batch has to hold images of different sizes. That is why the binary run sits at batch size 1 in the first place: a single wide raster is already most of a GPU, and there is no naive way to stack a 3,200-pixel log and a 12,800-pixel log into one tensor. The multiclass run reaches batch size 16 only because of a custom collate function that pads and groups the ragged rasters into a batch the model can consume.
No framework ships that collate function, because no framework knows that our images are variable-width well logs. So we kept it, unchanged, twenty-seven lines of our own code, and handed it to Lightning's data path exactly as we would have handed it to a bare DataLoader. This is the part of the migration teams get wrong in the other direction: they treat the framework as all-or-nothing and either reject it because it cannot do their one weird thing, or contort their one weird thing to fit a generic mould. The right boundary is the one the ledger draws. Give the framework every concern that is the same as everyone else's, and guard jealously the one concern that is genuinely yours. Our science is reading curves off ragged rasters; the ragged-raster batching is science, so it stays ours. The epoch loop is not science, so it goes.
What the run shape tells you, and what it does not
It is worth being precise about what migrating did and did not change in the numbers, because the temptation after any refactor is to claim a performance story that is not there. The two runs are the same after the migration as before it. The binary segmentation model trains on 2,000 synthetic instances at batch size 1 for 50 epochs in 110 minutes. The multiclass model trains on 15,000 instances at batch size 16, through the collate function we kept, for the same 50 epochs in 550 minutes. Drag the lever in the instrument across that range and you are watching the real cost of each run, and that cost is identical whether the loop driving it is ours or Lightning's. We did not make training faster. We would be lying if we said we had.
What changed is everything around the run. A new engineer can read the LightningModule and understand what the model computes without first decoding a thousand lines of orchestration. A change to the metric logging cannot reach into the optimiser step, because they no longer share a scope. Resuming a 550-minute run that died at minute 400 is a flag, not an archaeology project. And the class of silent gradient-bookkeeping bug that hid in the old loop cannot recur, because the place it lived no longer exists. The wall clock is the same; the cost of working on the code dropped by an order of magnitude, and the floor under the code's correctness rose.
The boundary, in one line
Hand the framework every concern that is identical across all projects: the epoch loop, the optimiser step, gradient zeroing, checkpointing, logging, device placement. Keep by hand only the concern that is genuinely yours and that no framework can know about, which here is the custom collate function that batches variable-width log rasters. The migration is exactly the act of sorting your loop into those two piles.
If you are sitting on a loop like ours
The advice is not "use Lightning." The advice is to treat a training loop that runs and that you can no longer read as a defect, even when nothing is failing, because a loop you cannot read is a loop hiding something from you, and ours was. Pick the moment after a model is working and before the next feature lands, sort every line of the loop into science you own and boilerplate you do not, push the boilerplate into a reviewed framework, and keep the one or two pieces that are truly yours visible and unchanged. The wall clock will not move. The bug you have been carrying without knowing it might.
Key takeaways
- The migration to PyTorch Lightning bought zero training speedup. The run shape was identical before and after: batch 1 on the 2,000-instance binary set, batch 16 on the 15,000-instance multiclass set via a custom collate_fn, 50 epochs, 110 minutes and 550 minutes of wall clock respectively. The value was legibility, not performance.
- A hand-rolled PyTorch loop rots because nothing forces structure on it: early stopping, checkpointing, metric accumulation, and a multiclass branch accrete as patches until the model architecture is the only readable part of the codebase and the loop has every job and no shape.
- Lightning draws a clean line between science (the LightningModule: model, forward pass, training step, optimiser choice) and boilerplate (the Trainer: epoch loop, backward, optimiser step, gradient zeroing, checkpointing, logging, device placement). The boilerplate is code you stop owning and stop being able to break.
- The refactor surfaced a silent bug: the order of the optimiser step and gradient zeroing had drifted on one run configuration. It never crashed and the loss still fell, which is the most dangerous failure mode there is. Lightning did not fix it; by replacing a free-form region with a contract, it made the bug impossible to keep hidden.
- Framework adoption is not all-or-nothing. We kept our 27-line custom collate_fn by hand because batching ragged 3,200-to-12,800-pixel log rasters is genuinely our science and no framework can know about it. Give the framework what is the same as everyone else's; guard the one concern that is truly yours.
References
[1] A. Paszke, S. Gross, F. Massa, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS 2019. https://arxiv.org/abs/1912.01703
[2] W. Falcon and the PyTorch Lightning team. PyTorch Lightning. Open-source project, first released 2019. https://github.com/Lightning-AI/pytorch-lightning
[3] O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. https://arxiv.org/abs/1505.04597
[4] M. Fowler. Refactoring: Improving the Design of Existing Code, 2nd edition. Addison-Wesley, 2018. https://martinfowler.com/books/refactoring.html