Reading Loss Curves: What Adam and SGD Told Us Early

Before a confusion matrix resolves, before a validation R-squared settles, before any number you would put in a report exists, there is a curve. Every training run draws one: loss against epoch, a line that falls, flattens, occasionally jumps, and eventually either keeps improving or quietly stops. It is the cheapest diagnostic in deep learning and the one practitioners most often glance at and move past. We want to make the case for the opposite habit, because on the early runs of VeerNet, the encoder-decoder EarthScan uses to digitise raster well logs, the loss curve told us how training was going long before any downstream metric did. This piece is a primer on reading that curve as a diagnostic instrument in its own right, illustrated with the Adam and SGD runs on our first binary segmentation set.

None of the reading techniques below are ours to claim, and we want to be clear about that up front. The optimisers are the canonical ones, Adam from Kingma and Ba [1] and stochastic gradient descent in the form Bottou catalogues [2]. The discipline of watching the gap between training and validation loss and stopping when it widens is textbook, set out in the deep-learning literature [3] and given its careful treatment by Prechelt [4]. What we add is not a method but a worked reading: a concrete account of what three named curve shapes meant on a real, small, imbalanced subsurface dataset, and why we trusted the curve before we trusted the metrics.

The run that taught us to look

The dataset was deliberately modest: 2,000 instances for binary segmentation, foreground against background, on synthetic raster logs. Memory constraints from variable image sizes pinned us to a batch size of 1, and a full 50-epoch run took about 110 minutes. That is a small enough budget that you can afford to run the same configuration twice with two different optimisers and read the curves side by side, which is exactly what we did with Adam and plain SGD.

The point of that comparison was never to crown an optimiser. It was to learn to read. Two optimisers on identical data, identical architecture, and an identical epoch budget produce two curves whose differences are attributable to the optimiser alone, and that controlled contrast is what turns the loss curve from a vague reassurance into a legible instrument. The shapes we saw, a plateau, a spike, and a slow upward drift in validation, are the three morphologies every practitioner should be able to name on sight, so the rest of this piece takes them one at a time.

An annotated loss-curve reader for the 50-epoch binary segmentation run (2000 instances, batch size 1, about 110 minutes of training). Two validation traces, Adam and SGD, sit over the Adam training trace so the gap between training and validation is visible directly. Pick one of three named morphologies the curve shape exposes (an early plateau, an SGD divergence spike, and the overfitting onset where validation lifts away from training) or drag the epoch cursor to read each series at any epoch. The orange marker flags the overfitting onset, the point where Adam's validation loss begins to climb while training keeps falling. The run setup numbers are sourced from the engagement archive; the per-epoch loss values are illustrative curve shapes that teach the morphology rather than logged values, and no scalar accuracy is claimed from them.

A plateau is not convergence

The first thing a fast optimiser does is fall quickly, and the first thing a beginner does is mistake the flat stretch that follows for the end of training. It usually is not. A plateau is a region where the gradient the optimiser is following has gone nearly flat, which can mean the model has found a good basin and is about to settle, or it can mean the model is stuck on a saddle or a shelf and will resume improving once it finds the downhill direction again. The curve alone does not always tell you which, but the comparison does.

On our run Adam dropped fast, paused for a handful of epochs in the single digits, then resumed its descent toward a low floor. SGD, with the same data and the same budget, lingered on its plateau noticeably longer before it began to move again. Read together, those two shapes said something specific: Adam's adaptive per-parameter step sizes were carrying it off the shelf faster than SGD's single global rate could [1] [2]. That is the diagnostic value of the contrast. A plateau in isolation is ambiguous; a plateau that one optimiser escapes quickly and another sits on tells you the escape, not the floor, is where the optimisers differ. The practical rule we took from it is to never call a run converged from a flat stretch alone. Give it more epochs, or change the optimiser, and watch whether the floor was real or just a ledge.

A spike is the learning rate talking

The second morphology is the one that looks alarming and usually is not a data problem. Partway through the SGD run the loss threw a sharp spike, a single sudden excursion upward, and then came back down and carried on. A clean up-then-down spike like that, with recovery, is almost always the optimiser taking a step that was too large for the local curvature of the loss surface, overshooting a minimum, and then clawing its way back. It is the learning rate talking, not the dataset failing.

The tell is in the shape and the recovery. A loss that jumps and stays elevated, or climbs without ever coming back, points at something structural: a corrupted batch, an exploded gradient, a label error feeding the same poison every epoch. A loss that spikes once and recovers within an epoch or two is the signature of a step size that is slightly too hot for the region the optimiser wandered into, and the standard responses are to lower the rate, add warmup, clip the gradient, or adopt a schedule that anneals the rate over training. The literature on learning-rate behaviour, including Smith's work on cyclical schedules, is essentially a catalogue of how to keep these excursions from happening or how to use controlled versions of them deliberately [5]. Adam's per-parameter scaling damps a lot of this automatically, which is part of why its trace was smoother than SGD's on the same run [1]. The reading discipline is simple: before you blame your data for a loss spike, ask the curve whether it recovered. If it did, look at your learning rate first.

Overfitting announces itself in the gap, not the metric

The third and most consequential morphology is the one worth training your eye on hardest, because it is the one that decides when to stop. For the first stretch of a healthy run, training loss and validation loss fall together. The moment that matters is when they part: training loss keeps sliding toward zero while validation loss flattens and then begins to climb. That divergence is overfitting, made visible, and it shows up in the curve well before it shows up in any headline accuracy number, because accuracy is a coarse, thresholded summary while the validation loss is a continuous one that registers the model getting worse on held-out data the instant it starts to.

On the Adam run the two traces tracked each other down for the early epochs, then the validation loss bottomed out and began to lift away while training kept falling. The widening band between them is the generalisation gap, and watching it is the whole content of early stopping as Prechelt frames it: you are not waiting for training loss to stop falling, because on a model with enough capacity it never will; you are waiting for validation loss to stop falling and start rising, and you keep the weights from just before that turn [4] [3]. The orange marker in the instrument above sits at that onset for exactly this reason. It is the single most actionable point on the entire curve, and it is invisible to a metric that only reports the best training-set fit.

This is also why the gap, not the floor, is the number to watch. A low training loss with a rising validation loss is a worse model than a higher training loss with the two curves still locked together, because the first has begun to memorise the small dataset and the second is still learning the geology. On 2,000 instances, with the foreground a thin minority of every image, that temptation to memorise is strong and the gap opens early, which is precisely the regime in which reading the curve pays for itself.

What the curve is good for, and what it is not

It would be dishonest to oversell this. The loss curve is a fast, early, continuous signal, and that is its whole value, but it is not the deliverable and it cannot tell you everything. It does not tell you whether your validation split leaked, whether your labels are right, or whether the segmentation you are optimising actually produces a usable digitised curve downstream, which on a digitisation task is the only thing a petrophysicist cares about. The curve is a diagnostic for the training process, not a certificate of the product. A run can show a textbook-clean loss curve and still ship a model that fails on real logs because the synthetic data did not cover the failure mode, and no shape on the loss axis would have warned you.

So the curve sits at the front of the pipeline, not the end of it. It is the thing you watch live, the thing that tells you to lower a learning rate at epoch 23 or to stop at epoch 31 rather than burning the full 50, and the thing that lets a 110-minute run end early when it has already told you what it is going to tell you. The metrics still arbitrate. The curve just gets there first, and on a small, imbalanced, expensive-to-relabel dataset, getting there first is worth a great deal.

The habit worth building

The summary is a single instruction: read the curve before you read the numbers, and learn the three shapes by name. A plateau is ambiguous on its own and is best read against a second run, so do not call it convergence. A spike that recovers is the learning rate, not the data, so look at your step size before you go hunting for bad batches. A widening gap between training and validation loss is overfitting, and the moment it opens is the moment to stop, regardless of what the training loss is still doing.

We did not invent any of that. The optimisers are Kingma and Ba's and Bottou's [1] [2], the early-stopping discipline is Prechelt's and the textbook's [4] [3], and the learning-rate intuition is a field's worth of work that Smith's schedules sit inside [5]. What we can say is ours is the reading: on VeerNet's first binary run, the Adam and SGD curves told us how training was going, where it was about to go wrong, and when to stop, before any metric we would have reported had resolved. The cheapest instrument on the screen turned out to be the first one worth trusting.

Key takeaways

The loss curve is the earliest, cheapest training diagnostic, and on VeerNet's first binary run (2000 instances, batch size 1, about 110 minutes for 50 epochs) the Adam and SGD curves told us how training was going before any scalar metric resolved.
A plateau is not convergence. It is ambiguous on its own; the value comes from contrast. Adam left the early plateau faster than SGD on identical data and budget, which is the curve showing the optimiser difference, not the floor.
A loss spike that goes up and then recovers is the learning rate, not the data. A clean excursion with recovery points at too-large a step for the local curvature; a loss that climbs and stays up points at something structural like a corrupted batch or label error.
Overfitting announces itself in the gap, not the metric. When training loss keeps falling but validation loss flattens and lifts, the widening generalisation gap is the early-stopping signal, visible before any thresholded accuracy number moves.
The reading techniques are not ours and we credit them: Adam (Kingma and Ba), SGD (Bottou), early stopping (Prechelt and the deep-learning textbook), learning-rate behaviour (Smith). What is ours is the worked reading on a real small imbalanced subsurface dataset.

References

[1] Kingma, D. P., and Ba, J. Adam: A Method for Stochastic Optimization. ICLR (2015). The adaptive per-parameter optimiser whose smoother trace we read against SGD. https://arxiv.org/abs/1412.6980

[2] Bottou, L. Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade, 2nd ed., Springer (2012). The practical account of SGD behaviour the slower trace illustrates. https://leon.bottou.org/papers/bottou-tricks-2012

[3] Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press (2016). Chapters 7 and 8 set out monitoring training curves, regularization, and the generalisation gap. https://www.deeplearningbook.org/

[4] Prechelt, L. Early Stopping - But When? In Neural Networks: Tricks of the Trade, Springer (1998, reissued 2012). The careful treatment of when the validation-loss turn means stop. https://link.springer.com/chapter/10.1007/978-3-642-35289-8_5

[5] Smith, L. N. Cyclical Learning Rates for Training Neural Networks. WACV (2017). The learning-rate behaviour and range-test intuition behind reading spikes as step-size signals. https://arxiv.org/abs/1506.01186

Reading Loss Curves: What Adam and SGD Told Us Early

The run that taught us to look

A plateau is not convergence

A spike is the learning rate talking

Overfitting announces itself in the gap, not the metric

What the curve is good for, and what it is not

The habit worth building

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on