The First Three Hundred Steps: Why Batch-of-One Training Needs a Warmup

Most learning-rate advice is about the long game: how fast to decay, when to drop by ten, whether to cosine the whole thing down to zero. The opening of the schedule gets far less attention, and for runs with a healthy batch size that neglect is mostly harmless. For the run this piece is about it was not, because we were training a segmentation network on raster well logs at a batch size of one, and at that batch size the first few hundred steps are the most dangerous stretch of the entire run. This is a narrow primer on one fix for that danger, the linear warmup ramp, kept deliberately to the opening segment of the schedule rather than the schedule as a whole.

The warmup ramp is not our invention and we want to be exact about that before we lean on it. The gradual, linear warmup as a named technique comes from Goyal and colleagues, who used it to stabilise the opening of large-minibatch ImageNet training [1]. The transformer schedule that ramps the rate up over a fixed number of warmup_steps and then anneals it is from the original attention paper [2]. The cleanest explanation of why the ramp helps, that it suppresses the wild variance of an adaptive learning rate in the first iterations, is from Liu and colleagues [3]. What we add is none of that machinery. It is a worked account of how the ramp behaves in the one regime where it stops being optional, a batch of a single example, illustrated on VeerNet, the encoder-decoder EarthScan uses to digitise raster logs for an onshore Texas operator.

Why a batch of one is a special case

A gradient computed on a minibatch is an average. Average sixteen examples and the noise in any single one is damped by the other fifteen; the direction the optimiser steps in is a reasonably stable estimate of the true gradient over the data. That averaging is doing quiet work on every well-behaved training run, and it is exactly the work you lose when the batch is one.

We did not choose a batch of one for its statistical properties. We were forced into it. The synthetic raster logs in the binary segmentation set are wildly variable in size, from a few thousand pixels wide to nearly thirteen thousand, and stacking images of such different dimensions into a tensor is awkward enough that, for the binary stage, the practical batch size was 1. So every optimiser step on that run saw the gradient of a single log: not an average over a batch, but one noisy sample, with all of that one image's idiosyncrasies, its particular curve shapes and grid artefacts, passed straight through to the parameter update undiluted. The textbook framing is that stochastic gradient descent already trades gradient accuracy for speed [6], and a batch of one sits at the far, noisiest end of that trade.

That noise is survivable once training has found its footing. It is most dangerous at the very start, when the weights are random, the loss surface under them is steep and badly conditioned, and any running statistics the network keeps have not yet converged to anything meaningful. Hand a full learning rate to a noisy single-sample gradient in that opening regime and you are asking the optimiser to take a large, confident step in a direction estimated from one example on a surface it has not begun to map. Sometimes nothing happens. Sometimes the loss launches off a cliff and the run is dead in the first dozen steps.

What the ramp actually changes

The warmup ramp is almost embarrassingly simple. Instead of starting at the full learning rate, you start near zero and increase the rate linearly over a set number of opening steps until it reaches the target value, after which the normal schedule takes over [1] [2]. That is the whole mechanism. The opening steps are taken at a deliberately small rate, so even a badly aimed single-sample gradient can only move the weights a little, and by the time the rate has climbed to full the network has taken enough small steps that its statistics have settled and the surface beneath it is better conditioned.

Liu and colleagues give the sharper version of the intuition for adaptive optimisers: in the first handful of iterations the adaptive per-parameter rate has seen too few gradients to estimate its scaling reliably, so its variance is enormous, and the warmup simply keeps the effective step small while that estimate stabilises [3]. Whether you reach for the variance argument or the cruder one about noisy steps on a steep surface, the conclusion is the same, and at a batch of one both arguments point the same way harder than usual, because the per-step noise the ramp is protecting against is at its maximum.

Two opening-loss traces over the first 300 optimiser steps. A cold start runs at the full learning rate from step one; a warmed start ramps the rate linearly from near zero. At batch size 1 a single raster-log gradient is noisy, so the opening full-rate steps can launch the loss into a spike before the running statistics settle. Drag the warmup length from 0 to 400 steps: the warmed trace loses its early excursion once the ramp covers the opening window, while the cold trace keeps its spike, and both traces settle to the same floor. The warmup stays on the opening ramp rather than reshaping the full schedule. The batch size of 1, the 50-epoch budget, the roughly 110-minute binary run on 2000 instances, and the two transformer attention layers on the bottleneck are the engagement's own; the two loss curves and the divergence band are an illustrative model of the warmup effect and are flagged as such.

The instrument above is the picture we kept in our heads during that run. Drag the warmup length from zero and watch the two opening-loss traces. The cold start, at the full rate from step one, throws an early spike, the signature of an overshoot the noisy single-sample gradient was always at risk of producing. The warmed start, with the rate ramped in, loses that spike once the ramp covers the opening window, and then both traces settle toward the same floor. That last detail is the point of the next section, so hold onto it: the warmup changes the opening and nothing else.

Keep it on the ramp, not the whole schedule

It is tempting, once a warmup rescues a run, to credit it with more than it did and to start treating the entire learning-rate trajectory as the lever. That conflation is worth resisting, because the warmup is doing one specific job in one specific place. It governs how the rate gets up to its target in the opening steps. What the rate does after that, how it decays across the remaining epochs, whether it cosines down or steps or holds, is a separate question with separate answers, and the literature treats it as one: Loshchilov and Hutter's warm restarts and Smith's cyclical rates are both about the long trajectory, the part that comes after the opening is already safe [4] [5].

We make the distinction because the two segments fail differently and are tuned differently. The warmup's failure mode is early divergence, visible in the first few hundred steps, fixed by lengthening the ramp until the opening spike is gone. The decay's failure modes are late: a rate left too high so the loss never settles, or dropped too early so the model stops improving before it has to. Tuning one tells you almost nothing about the other. On the binary run, the 50-epoch budget took about 110 minutes for the 2,000 instances, which is cheap enough to sweep the warmup length on its own, watch only the opening of the loss curve, and fix the ramp before touching the decay at all. Bundling the two into a single hyperparameter would have hidden which half was actually doing the work.

There is a second reason to keep the warmup scoped narrowly, which is that our network had components that are themselves sensitive to early instability. The bottleneck carried two transformer attention layers, and attention blocks are part of why transformer training reached for warmup in the first place [2] [3]; the convolutional stack used LeakyReLU with a negative slope of 0.2, which keeps a gradient flowing through inactive units and so makes early steps a little livelier, for better and worse. A warmup that protects the opening protects all of that at once, without our having to reason about each piece's early dynamics separately. Stretch the warmup into a full-schedule philosophy and you lose that clean accounting.

What the ramp does not buy you

A warmup is insurance against one failure, not a tonic for the run. It will not fix a learning rate that is simply wrong, only soften the entrance to it; a target rate ten times too high is still ten times too high once the ramp delivers you there. It will not rescue a model from bad data, broken labels, or synthetic logs that fail to cover a real failure mode, because none of those are opening-step problems and the ramp only touches the opening. And it does not make a batch of one behave like a batch of sixteen; the per-step gradient is just as noisy at step five hundred as at step five, so the value of the ramp is bounded to the window where that noise is most likely to be fatal, the start. Past that window the noise is something you live with, or address by other means such as accumulating gradients toward a larger effective batch, which is a different lever for a different problem.

So the honest scope is small and worth stating plainly. On a single-sample-batch run the opening few hundred steps are the part most likely to kill training outright, the cause is the full learning rate meeting a noisy lone gradient on an unconditioned surface, and a linear warmup over those steps is the cheapest, most reliable guard against it. The recipe belongs to Goyal and colleagues, to the transformer schedule, and to the variance analysis that explains it [1] [2] [3]; the only thing we can claim is the reading, that at a batch of one this opening segment stopped being a nicety and became the difference between a run that diverged in the first minute and one that reached its floor.

Key takeaways

A batch of one removes the gradient averaging that normally damps per-step noise, so every optimiser step sees one noisy raster-log gradient undiluted. We were forced into batch size 1 on the binary stage because the logs vary from a few thousand to nearly thirteen thousand pixels wide and would not batch cleanly.
The opening steps are the dangerous ones: random weights, a steep ill-conditioned surface, and unsettled running statistics mean a full-rate step from a noisy lone gradient can launch the loss off a cliff in the first dozen steps.
A linear warmup starts the rate near zero and ramps it to target over a few hundred steps, holding the step small while the surface and statistics settle. For adaptive optimisers the sharper reason is that it suppresses the huge variance of the adaptive rate in the first iterations (Liu et al.).
Keep the warmup on the opening ramp, not the whole schedule. The ramp governs how the rate gets up to target and its failure mode is early divergence; the decay that follows is a separate lever with late failure modes, and the cheap 50-epoch, ~110-minute binary run let us tune the ramp alone on the opening loss before touching decay.
The warmup is not ours and we credit it (Goyal et al., the transformer schedule, the variance analysis of Liu et al.). It does not fix a wrong target rate, bad data, or the per-step noise of a batch of one past the opening window; what is ours is the worked reading at batch size 1 on VeerNet.

References

[1] Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint (2017). The source of the gradual linear warmup we lean on for the opening ramp. https://arxiv.org/abs/1706.02677

[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention Is All You Need. NeurIPS (2017). The transformer schedule that ramps the rate over warmup_steps before decaying, relevant to our two bottleneck attention layers. https://arxiv.org/abs/1706.03762

[3] Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the Variance of the Adaptive Learning Rate and Beyond. ICLR (2020). The variance argument for why warmup works in the opening iterations of adaptive optimisers. https://arxiv.org/abs/1908.03265

[4] Loshchilov, I., and Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR (2017). Treats the post-warmup trajectory as a designed schedule, the part this piece keeps separate from the ramp. https://arxiv.org/abs/1608.03983

[5] Smith, L. N. Cyclical Learning Rates for Training Neural Networks. WACV (2017). The range-test intuition for how small the opening rate should be and how fast it can climb. https://arxiv.org/abs/1506.01186

[6] Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press (2016). Chapter 8 sets out stochastic gradient variance and why the step size has to respect it. https://www.deeplearningbook.org/

The First Three Hundred Steps: Why Batch-of-One Training Needs a Warmup

Why a batch of one is a special case

What the ramp actually changes

Keep it on the ramp, not the whole schedule

What the ramp does not buy you

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on