The model was not the problem. By the time we picked up the trainer, VeerNet's multiclass segmenter, the network that reads a scanned well-log and assigns every pixel to background, curve one, or curve two, was a known quantity. We had its architecture frozen, its synthetic corpus built, its loss function chosen. What we did not have was a training loop we trusted to run the same way twice. The loop was ours, written line by line over the life of the engagement, and it had accreted the way bespoke loops always do: a custom collate function welded to the iteration, hand-written device placement scattered through the forward pass, a checkpoint writer that saved whatever we had remembered to pass it that week. It trained, it converged, and every time we changed one thing we held our breath. This is the story of moving that loop onto PyTorch Lightning, deleting the parts we should never have owned, and keeping the model untouched while the runs around it finally settled down.
The loop we had outgrown
Our trainer for the operator's raster archive had started small and stayed in our hands long after it should have. The forward pass mixed model logic with infrastructure logic freely. Tensors got pushed to the GPU with explicit calls sprinkled wherever a new tensor appeared, so adding an input meant remembering to move it by hand or watching the run die on a device mismatch at epoch three. The checkpoint logic was a function we maintained that decided, by our own conventions, what to serialize and when. The collate function that pads our variable-size logs into a single batch tensor, which we had written because the scans run from 3,200 to 12,800 pixels wide and refuse to stack otherwise, was threaded directly into the iteration rather than handed cleanly to a data loader.
None of this was wrong, exactly. It was the imperative-PyTorch way of doing things, and the underlying library makes all of it possible and fast [2]. The trouble was that we were carrying it. Every one of those hand-rolled pieces was infrastructure we had to keep correct ourselves, on a project where the thing we were actually paid to get right was the segmentation, not the device-placement boilerplate. The custom code was load-bearing and invisible, the worst combination, because when a run behaved oddly we could never be sure whether the model had moved or whether our own plumbing had shifted under it.
What we set out to change, and what we refused to touch
The goal was narrow and we wrote it down so we would not drift: migrate the training loop onto a standard trainer scaffold, and change the model by exactly nothing. Same encoder-decoder, same five stride-2 stages down and five upsampling stages back up, the contracting-then-expanding shape that U-Net established for dense prediction [3], same group normalization, same multiclass softmax over background and two curves. Same synthetic data, same loss, same hyperparameters. If the migration altered a single weight tensor's meaning we would not be able to tell a real change from a refactor artifact, and the whole point was to be able to tell.
The constraint cut the other way too. We were not allowed to keep the parts of the loop that were genuinely ours by accident rather than by need. The device placement, the checkpoint bookkeeping, the loop-level batch assembly: those were candidates for deletion, not preservation. The test of the migration was simple. Anything that PyTorch Lightning [1] could own correctly on our behalf, we would let it own, and anything left in our hands afterward had to be there because it was specific to reading well-logs, not because we had once written it and never stopped.
Pulling the model out of its plumbing
The migration was, in the end, an act of separation. PyTorch Lightning organizes a training run around a module that holds the model and a small set of named steps, with the loop itself, the device movement, the gradient bookkeeping, and the checkpointing handled by a trainer object you configure rather than write. So the work was to take our one tangled loop and sort each line into one of two piles: this belongs to the model, or this belongs to the scaffold.
The device placement went first, and it was the most satisfying deletion. Every explicit call that pushed a tensor onto the GPU came out. The trainer moves the module and its batches to the configured device, so the scattered hand-placement that had caused our intermittent device-mismatch failures simply stopped existing. There was nothing to get wrong because there was nothing left to write. The checkpoint writer went next. Our hand-maintained serialization logic was replaced by the trainer's checkpointing, which saves and restores the module state on a schedule we declare instead of one we hand-code, so the question of whether we had remembered to checkpoint the right thing this week stopped being a question.
The collate function was the interesting one, because it was the piece we could not delete and should not have wanted to. Padding variable-width logs to a common tensor is not generic infrastructure; it is the specific thing reading this archive requires, and no framework was going to know that our scans need rounding up to a 32-pixel alignment before they batch. So the collate function survived. But it moved. Instead of being threaded into the loop body, it became what it always should have been: the collate argument on a data loader, a clean function with one job, handed to the data side of the scaffold rather than smeared across the training side. The migration did not remove our domain logic. It put our domain logic in the one place that was unmistakably ours, and let the framework have everything that was not.
How the runs behaved once the scaffold was standard
The honest surprise was how little the headline numbers moved and how much the experience of getting them changed. The model was identical, so its convergence behavior was identical. What changed was that the runs stopped having a second, infrastructural way to fail. Before the migration, a multiclass run could die for two unrelated reasons: the model could fail to learn, or our plumbing could misplace a tensor or botch a checkpoint. After the migration, only the first kind of failure was possible, because the second kind no longer had any code to live in. That is the stabilization we were after, and it is not visible in an accuracy curve. It is visible in the fact that a run that started finished, and finished the same way the last one did.
The exhibit below is the crosswalk we used to reason about the two regimes the trainer had to serve, before and after the refactor, on the real run figures. Pick a regime and drag the epoch lever to see the wall-clock arithmetic, which the migration deliberately left alone.
Read the two regimes off it honestly. The binary stage trained 2,000 instances at a batch size of 1, forced there by the variable image dimensions, and took 110 minutes, about 2 hours, for 50 epochs. The multiclass stage trained 15,000 instances at a batch size of 16, made possible by the custom collate function padding the variable dims into one tensor, and took 550 minutes, about 10 hours, for the same 50 epochs. Those figures are the same on both sides of the migration, and that is the point. The refactor was not a speed play. It changed nothing about how long the network takes per epoch, because it changed nothing about the network. It changed what we owned around the network, and the per-epoch arithmetic was never ours to begin with.
What the migration deleted and what it kept
- Hand-written device placement came out entirely: the trainer moves the module and its batches to the configured device, so the intermittent device-mismatch failures that used to surface mid-run had no code left to occur in.
- The custom checkpoint writer was replaced by the framework's checkpointing on a declared schedule, removing a piece of bespoke serialization bookkeeping we had been maintaining by hand and getting subtly wrong.
- The collate function that pads variable-width logs to a batch tensor was kept, because it is domain-specific to reading well-logs, but it moved from inside the loop body to the data loader's collate argument, where it is unmistakably our logic and nothing else's.
Why a standard scaffold was the reproducibility win
The reflection we kept coming back to is that the migration bought us reproducibility, and reproducibility was the thing the bespoke loop had been quietly costing us. A training run is only trustworthy if you can say what produced it, and a hand-rolled loop spreads that answer across dozens of lines you have to read correctly every time. The field has been blunt about this: a meaningful share of what looks like model variance is actually loop variance, the run-to-run drift that comes from infrastructure you wrote yourself behaving slightly differently than you remember [4]. Our custom device placement and checkpoint logic were exactly that kind of surface. They were small, they were ours, and they were unaudited.
Moving onto a standard trainer scaffold did not make our model better. It made our claims about our model defensible. When the loop is a configured object rather than a script we maintain, the things that could silently differ between two runs collapse to the things we explicitly set, and the rest is owned by a library that thousands of other runs exercise daily. The bespoke loop had given us total control and, with it, total responsibility for a set of problems that had nothing to do with reading curves off a scan. Handing that responsibility to the scaffold was not a loss of control. It was a narrowing of what we had to be careful about, down to the model and the data, which were the only two things on this engagement we were ever the right people to own.
The shape of the win, restated for the next migration
What we will carry forward from this is a sharper instinct for the difference between code that is specific to the problem and code that is merely specific to us. The collate function was the former and we kept it. The device juggling and the checkpoint bookkeeping were the latter, written by us only because we had once needed something and never asked whether we still needed to own it. A trainer that reads scanned well-logs has exactly one piece of genuinely bespoke training-loop logic, the padding, and on the day we started it had a half-dozen, the other five hiding as habit. The migration's real product was telling those two categories apart and letting the framework take the second one back, so that the next time a multiclass run surprised us we would know, without a second thought, that the surprise was in the model and not in the scaffold.