We had settled the hard modelling questions before this run and were left with a plainer one that turned out to matter just as much: could we train the multiclass segmenter fast enough to iterate on it. The dataset was fifteen thousand synthetic log instances, the target was three classes, and the schedule was fifty epochs. None of that was in question here. The problem was that the first version of the training loop treated the fifteen thousand instances more or less one at a time, and on images this large that is not a training run, it is a vigil. This piece is only about the speed. Not the loss, which was decided elsewhere, and not the choice to move from binary masks to a three-class target, which was its own decision with its own reasons. Just the wall-clock, and the handful of levers that pulled it down to about ten hours.
Why one image at a time was the whole problem
The images are the reason the naive loop was slow, and they are worth describing precisely because they dictate everything that follows. A synthetic log instance is tall and narrow and, crucially, not a fixed size. Widths ran from 3,200 to 12,800 pixels and heights from 480 to 640, so no two batches held images of the same shape. That variability is fatal to the ordinary way you accelerate training. The standard speedup is to stack many samples into one batch tensor and let the GPU work on all of them in a single pass, but you cannot stack tensors of different widths and heights into one rectangular array. The path of least resistance is to give up on batching and feed the network one image per step, and that is exactly where the binary run had lived: batch size one, pinned there not by choice but by the fact that variable image sizes left no obvious way to group them.
Batch size one is the slowest place a training loop can be. Every one of the fifteen thousand instances becomes its own optimiser step, so a single epoch is fifteen thousand steps, and fifty epochs is three quarters of a million of them. Each step carries fixed overhead that has nothing to do with the arithmetic of the forward and backward pass, the launch of the kernels, the synchronisation, the gradient update, and at batch one you pay that overhead fifteen thousand times per epoch instead of amortising it across a group. The GPU spends much of its time waiting rather than computing. We had a concrete reference point for how this felt: the earlier binary run, on two thousand instances at batch one for the same fifty epochs, took 110 minutes. Scale that per-instance cost up to fifteen thousand instances and the multiclass run was heading somewhere we could not plan a week around.
The lever that mattered was the batch, and the collate_fn is what unlocked it
The single relationship that governs this whole story is simple. Total wall-clock is roughly the number of optimiser steps times the cost per step, and the number of steps per epoch is the instance count divided by the batch size. Every fixed-overhead cost you pay per step gets divided by the batch. Take the batch from one to sixteen and you have cut the number of steps per epoch by a factor of sixteen, from fifteen thousand down to under a thousand, and you have handed the GPU sixteen images to work on in parallel where before it had one. That is the entire mechanism. Everything else we did was in service of making a batch of sixteen possible at all on images this awkward.
Making it possible came down to a custom collate_fn. A dataloader's collate function is the step that takes a list of individual samples and assembles them into one batched tensor, and the default one assumes every sample is the same shape. Ours could not assume that, so we wrote a collate_fn that took a group of sixteen variable-dimension images and padded them to a common shape before stacking, recording enough to keep the padding from being mistaken for signal downstream. With that in place the dataloader could actually return batches of sixteen instead of surrendering to batch one, and the throughput relationship above stopped being theoretical.
The three levers that kept a batch of sixteen inside the memory budget
A batch of sixteen large images is a lot of memory, and the reason batch one had been the default was never really the collate logic alone, it was that bigger batches did not fit. So the collate_fn came with company. Mixed precision let us hold activations in half the space by using sixteen-bit floats where full precision was not needed, which roughly halved the activation footprint and, on the right hardware, sped up the math as a bonus rather than a cost. Gradient checkpointing bought more headroom by refusing to keep every intermediate activation from the forward pass in memory, instead recomputing the cheap ones during the backward pass, a deliberate trade of a little extra compute for a lot less memory. And the dataloader itself was tuned to prepare the next batch while the current one trained, so the card was not left idle waiting on the CPU to pad and stack the next sixteen images.
None of these three changed the model's output. Mixed precision, checkpointing, and prefetching are all invisible to the loss and the weights: the network sees the same images, computes the same gradients to within numerical tolerance, and arrives at the same place. What they changed was whether a batch of sixteen fit in the memory budget, and that is the only thing that let the batch lever move. Take any one of them away and the batch that fits shrinks back toward one, and the wall-clock climbs back up the curve the instrument above traces.
What the clock actually read
With the collate_fn, batch sixteen, mixed precision, checkpointing, and a fed dataloader all in place, the multiclass run over fifteen thousand instances for fifty epochs took 550 minutes, a little over nine hours, which we rounded to about ten hours when we talked about it as an overnight job. The comparison that made the point internally was against the binary anchor: 110 minutes for two thousand instances at batch one. The multiclass set was seven and a half times larger and the run took five times as long, not the seven and a half or more you would expect if per-instance cost had held constant. The batch had absorbed the difference. We used an 80/20 train-validation split throughout, so the epoch was training on twelve thousand of the fifteen thousand instances, and that is the number the throughput math above is really dividing.
The reason ten hours mattered was not the ten hours. It was that ten hours is one night. A run that finishes overnight is a run you can start at the end of a day, read in the morning, and change something about before the next evening, which turns a training loop into something you can iterate against on a daily cadence. Two days per run does not give you that. It gives you a couple of attempts a week and a strong incentive to guess rather than measure. Pulling the wall-clock under a working night was the difference between a model we tuned by experiment and one we would have tuned by superstition.
Limitations
The specific 550-minute figure is one number from one engagement on one machine, and it should be read as evidence that these levers move wall-clock by roughly this much on a workload like this, not as a benchmark to port. The projection between the two measured points in the exhibit is a step-count model calibrated to reproduce the measured floor, not a per-batch timing; the real curve would bend where memory bandwidth, kernel efficiency, or dataloader throughput become the binding constraint rather than step count, and we did not sweep the intermediate batch sizes to map that bend. Larger batches also change the effective gradient noise, so beyond a point raising the batch is no longer free with respect to what the model learns, even though it stays cheap with respect to the clock; on this run sixteen was the ceiling memory allowed and we did not need to push against the learning question. Finally, everything here rests on the synthetic image dimensions we describe, 3,200 to 12,800 pixels wide and 480 to 640 tall. Field scans with a wider spread of sizes would put more pressure on the padding in the collate_fn, and the memory headroom that made batch sixteen fit is not guaranteed to survive a heavier tail of very wide images.