The box went down at three in the morning, which is the only time it ever went down, and by the time anyone looked the run had been dead for hours. This was a recurring fact of life on the project that produced VeerNet, the encoder-decoder EarthScan built to lift well-log curves off scanned paper, and it shaped how we trained more than any architecture decision did. We had one good GPU. It was not reliably ours, it shared a host with other work, and it had a habit of disappearing partway through long jobs. The multiclass segmenter needed 50 epochs over 15,000 synthetic logs to converge, and that run takes 550 minutes of wall-clock time when it gets to run uninterrupted. Nine hours is a long time to ask a flaky machine to stay up, and ours would not.
When the hardware will not cooperate, throughput stops being a hardware question and becomes a scheduling one. You cannot make the card faster or more stable, so the only levers left are the ones in your training loop: how many images you push through per step, and what each interruption actually costs you. This post is about those two levers on a real, constrained run. None of the underlying machinery is ours, and we want to be clear about that. The dataloader, the batching hook, the checkpoint format are all standard, documented tools [1]. What we are claiming is the reasoning: which lever to pull, by how much, and why the answer was different for the two datasets we were training.
Why the naive batch size was one
The problem starts with the data, and it is worth being concrete about why, because it is the constraint everything else routes around. A raster well log is a tall, narrow scan, and the logs in our training set are not a uniform shape. Their widths run from 3,200 to 12,800 pixels and their heights from 480 to 640, because real logs come in every aspect ratio a scanner and a paper size can produce, and our synthetic generator reproduced that spread on purpose so the model would not learn a single canvas size.
That variability is fatal to naive batching. The standard way a dataloader assembles a batch is to take a handful of samples and stack them into one tensor, which requires every sample in the batch to have identical dimensions [1]. Two logs of different widths cannot be stacked; the operation has no meaning. The path of least resistance, and the one we started on, is to set the batch size to one. A batch of one never has to reconcile two shapes because there is only ever one shape in the batch. It works, it is correct, and it is slow, because the GPU spends most of each step underfed, processing a single image when it has the memory to process many.
On the binary segmentation set this is exactly where we stayed, and on purpose. That set is 2,000 instances, the run is 110 minutes, and at that scale the batch of one is not the bottleneck worth fighting. The math is unforgiving in the other direction: a run that already fits comfortably inside any window we cared about does not earn the engineering it would take to batch it, and the binary logs were the most extreme in their size spread, which made a collate for them the hardest to write for the least reward. We measured it, decided the throughput we would buy was not worth the code, and left the binary batch at one. That is a scheduling decision too. The cheapest optimisation is the one you correctly decide not to do.
Lifting the multiclass batch from one to sixteen
The multiclass run was the opposite case. 15,000 instances at 550 minutes is the job that has to survive the night, so it is the one where buying throughput pays for itself. The move that bought it was a custom collate function, the small hook a dataloader exposes for deciding how a list of samples becomes a batch [1]. The default collate stacks; ours does not insist on it. Instead of demanding identical shapes, it pads each log in a group up to the largest dimensions in that group and records where the real pixels end, so the network can ignore the padding. With that one function in place the batch size on the multiclass set went from one to 16, and the GPU went from processing a single tall sliver per step to processing sixteen of them packed together.
The reason that helps is throughput arithmetic, and the instrument below makes it visible. A batch of one over 15,000 instances is 15,000 steps per epoch; a batch of 16 is roughly 938. The per-step overhead, the kernel launches, the synchronisations, the fixed cost of touching the card at all, is paid far fewer times, and the card spends more of each step doing the matrix work it is actually good at. The gain is real but it is not linear, because a larger batch does more work per step and padding wastes some of the capacity you gain. Drag the batch lever in the planner and watch the compute curve fall steeply at first and then flatten: the first doublings of batch size buy the most, and the returns taper as the step count stops being the thing that dominates.
We did not have to retune the whole run to make this change safe, which is the part that is easy to get wrong. Raising the batch size changes the gradient noise the optimiser sees, and the literature on large-batch training is essentially a catalogue of how to compensate, with linear learning-rate scaling and warmup the standard recipe [2]. The cleaner framing, and the one we leaned on, is that batch size and learning rate are two views of the same underlying noise scale, so a change in one can often be absorbed by the other rather than treated as a separate crisis [3]. Going from one to sixteen is a modest move on that scale, and with a short warmup the run stayed stable. The point is that batch size is a throughput lever you are allowed to pull, not a sacred hyperparameter, as long as you understand what you are trading against on the optimisation side.
What each crash actually cost
The collate solved the throughput half of the problem. It did nothing for the other half, which is that the run still had to survive a GPU that dropped out of it. A 550-minute job on a machine that fails every few hours will, in expectation, fail before it finishes, and if a failure means starting over then no amount of throughput saves you: you would simply pay the 550 minutes again, plus whatever you had already burned.
The fix is the oldest one in long-running computation, and again it is not ours: checkpoint the state often enough that a crash costs you a tax, not the whole run. We wrote the model weights, the optimiser state, and the dataloader position to disk at the end of every epoch, so that when the box came back the run could reload the last good checkpoint and continue from there rather than from zero. The cost of an interruption collapses from the entire elapsed run to the time it takes to reload state plus the fraction of an epoch you have to recompute since the last save. In the planner that is the orange segment stacked on top of the teal compute bar: each GPU drop adds a flat resume tax, and you can see the finish time creep upward as you add interruptions without the catastrophe of the bar resetting to the full run length. The flat tax is illustrative, but the shape of the relationship is the real lesson. With checkpointing, interruptions are additive and survivable; without it, the first interruption past the halfway mark is very likely to be fatal to your window.
There is a tuning knob inside checkpointing that the instrument does not draw but is worth naming, because we got it wrong before we got it right. Checkpoint too rarely and a late crash throws away a lot of recomputed work; checkpoint too often and the I/O of writing large state to disk starts eating into the throughput the collate just bought you. Per-epoch was the right granularity for us because an epoch on the 15,000-set is short enough that losing one to a crash is cheap and long enough that writing state once per epoch is negligible against it. On a job with much longer epochs we would have checkpointed mid-epoch; on one with very short epochs we would have checkpointed every few. The right interval is the one that makes the expected recompute cost and the I/O cost roughly equal, and that depends entirely on your epoch length and your crash rate.
The scheduling view, in one picture
Put the two levers together and the planner is really a single question with two inputs: does this run finish inside the window I have. The window for us was an overnight slot, and the answer depended jointly on the batch size that set the compute floor and the interruption count that stacked tax on top of it. The collate pushed the floor down far enough that the multiclass run cleared the window with room to spare at batch 16, and checkpointing kept the room to spare from being eaten by the crashes we knew were coming. Neither move alone was sufficient. A faster run that died and restarted would have blown the window; a crash-proof run still pinned at batch one would have been too slow to fit it in the first place. It was the combination, throughput from the collate and survivability from the checkpoints, that made a flaky single card finish a nine-hour job on schedule.
That is also why we keep insisting this was scheduling and not just optimisation. We never improved the hardware. We never even improved the model. We changed how the existing work was packed onto the existing card and how the existing run absorbed the failures the card was always going to throw, and that was enough. The memory-for-compute trades that would have let us push the batch even higher, gradient checkpointing inside the model [4] and mixed-precision storage [5], were on the table and would have bought more headroom; we did not need them because the collate plus epoch-level checkpointing already cleared the window, and the cheapest optimisation remained the one we did not have to do.
What carried the run
- Variable image dimensions, log widths from 3,200 to 12,800 pixels, collapse the naive batch size to one because a dataloader can only stack identically shaped samples. That is the constraint every throughput move on this run routes around.
- A custom collate that pads each log up to the largest in its group lifted the multiclass batch from one to sixteen, cutting steps per epoch from 15,000 to roughly 938 and the wall-clock run toward the measured 550 minutes for 50 epochs on the 15,000-instance set.
- Throughput from larger batches is real but sublinear: the first doublings buy the most and the returns taper as step count stops dominating, and a bigger batch is safe to pull as a lever as long as the learning rate absorbs the changed gradient noise.
- On the 2,000-instance binary set at 110 minutes we kept the batch at one on purpose. A run that already fits its window does not earn the collate it would take to batch it. Correctly declining an optimisation is itself a scheduling decision.
- Epoch-level checkpointing of weights, optimiser state, and dataloader position turns each GPU drop from a full restart into a flat resume tax, so interruptions become additive and survivable. Without it, the first late crash is very likely to blow the window.
The habit this left us with is to ask, before any long job on shaky hardware, two questions in order: how few times can I afford to touch the card per epoch, and how cheaply can I make a crash recoverable. Get those two right and the run finishes whether or not the box behaves, which on this project it reliably did not. The collate and the checkpoint loader were both written by other people and documented far better than we could [1]; our contribution was knowing, on a specific run with a specific window and a specific failure rate, exactly how hard to lean on each one.
References
[1] Paszke, A., Gross, S., Massa, F., and colleagues. PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS (2019). The DataLoader, the collate_fn batching hook this work overrides, and the imperative training loop it all runs inside. https://arxiv.org/abs/1912.01703
[2] Goyal, P., Dollar, P., Girshick, R., and colleagues. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (2017). How batch size interacts with throughput and optimisation, with linear learning-rate scaling and warmup. https://arxiv.org/abs/1706.02677
[3] Smith, S. L., Kindermans, P.-J., Ying, C., and Le, Q. V. Don't Decay the Learning Rate, Increase the Batch Size. ICLR (2018). Batch size and learning rate as two views of the same noise scale, the framing behind treating batch as a throughput lever. https://arxiv.org/abs/1711.00489
[4] Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training Deep Nets with Sublinear Memory Cost (2016). Gradient checkpointing, the memory-for-compute trade that buys headroom for a larger effective batch on a fixed card. https://arxiv.org/abs/1604.06174
[5] Micikevicius, P., Narang, S., Alben, J., and colleagues. Mixed Precision Training. ICLR (2018). Half-precision storage and compute, the other standard way to fit more images per step on one GPU. https://arxiv.org/abs/1710.03740