A fifty-epoch cap is a budget, not a plan. It tells a run how long it is allowed to go if nothing else stops it first, and on a shared GPU box that ceiling is worth having, because it bounds the damage a forgotten job can do overnight. What the cap does not do is tell you when the run is finished. Those are different questions, and conflating them is how good box-time gets spent watching a model that already peaked drift quietly downhill on held-out data. The discipline that separates the two is early stopping, and the part of it that actually does the work is not the cap at all. It is the criterion: the rule that decides, mid-run, that there is no longer any point continuing. This note is about how we set that rule for the curve-segmentation runs behind VeerNet, the encoder-decoder EarthScan uses to lift well-log curves off scanned paper, and specifically about the two dials a criterion really is.
We should say at the start what this is not. It is not a reprise of reading the morphology of an optimiser's loss curve, the plateau-spike-divergence vocabulary that tells you whether Adam or SGD is behaving on a given run. That reading is a live diagnostic of the training process. A stopping criterion is a different instrument bolted next to it: a small, mechanical decision procedure with two parameters that you set before the run and that fires without you. You can read loss curves beautifully and still pick a bad criterion, and you can pick a good criterion while barely glancing at the curves. The literature treats them separately and so do we, and none of the machinery below is ours to claim. The systematic account of stopping criteria is Prechelt's [1], the capacity-and-overfitting case for early stopping is Caruana, Lawrence, and Giles [2], the regularization framing is Bishop's [3], and the modern engineering description of patience and best-checkpoint restoration is the deep-learning textbook's [5]. What we add is a worked setting of those dials on two real, oddly shaped, expensive-to-rerun subsurface datasets.
The cap bounds the worst case; the criterion finds the best case
Start with why the cap alone is not enough, because it is tempting to think a tight epoch budget makes a stopping criterion redundant. It does not, and the reason is the shape of a held-out metric over training. For the first stretch the model is still learning the signal, and the validation metric improves epoch over epoch. At some point it reaches its best value, and after that, on a model with enough capacity to memorise a finite training set, the validation metric does not hold steady. It degrades, slowly, as the network keeps fitting the training data past the point where that fit still transfers. Caruana and colleagues made the careful version of this argument: a large network trained with early stopping generalises about as well as a smaller one, because the stopping point, not the parameter count, is what bounds the overfitting [2]. Bishop frames the same effect as regularization, with the number of epochs acting as a knob on the effective complexity the network is allowed to reach [3].
If that is the shape, then the cap and the criterion answer different questions. The cap asks how much time the run may consume in the worst case. The criterion asks where, inside that window, the run actually stopped getting better, so that the rest of the budget does not have to be spent confirming the model can get worse. On the 15,000-log multiclass run a full fifty epochs is 550 minutes, so the question is not academic. If the best held-out checkpoint arrives well before epoch fifty, every epoch after it is GPU box-time spent producing a checkpoint we will not ship. The cap cannot reclaim that time because the cap does not know the peak has passed. Only a criterion that watches the metric can.
Dial one: what you choose to monitor
The first dial is the monitored quantity, and it matters more than it looks like it should, because early stopping is only as good as the number it is watching. The curve-segmentation task gives a regression read-out after the masks are turned back into depth-indexed curves, and that read-out can be summarised several ways. We tracked three on the validation split: the coefficient of determination, which we want to push up; mean absolute error; and mean squared error, both of which we want to push down. Their best values on our runs are not interchangeable headline numbers, they are genuinely different lenses: the best validation R-squared we recorded was 0.9891, the lowest MAE 0.0132, and the lowest MSE 0.0004.
The trap is assuming these three always agree about which epoch was best. They usually move together, but they do not have to, and the reason is exactly the one Willmott and Matsuura set out for model evaluation generally: a squared-error summary is not a clean measure of average error, because it folds in the variance of the error distribution and is pulled around by a handful of large residuals, while an absolute-error summary tracks the typical miss [4]. Translated to a stopping criterion, that means an MSE-monitored run can keep improving for an extra epoch or two because it is shaving down a few big outlier depths, while the MAE-monitored view already considers the model best, since the typical depth stopped improving earlier. Monitor the wrong one for what you care about and you stop at the wrong epoch: too early if the metric is noisier than your real objective, too late if it keeps rewarding cosmetic gains on the tail. We monitored the metric that matched what a petrophysicist would actually complain about, the typical depth error, and treated the squared-error view as a secondary check rather than the trigger.
Dial two: how much patience the metric earns
The second dial is patience: the number of consecutive epochs you let pass with no new best before you call the run. Patience exists because a validation metric is not monotone even while the model is still genuinely improving. It wobbles from epoch to epoch on the noise of a finite held-out set, so a naive rule of stop the first time the metric fails to improve would fire constantly, killing runs on a single unlucky epoch long before they peaked. Prechelt's whole contribution is to make this choice systematic rather than a vibe, and to quantify what it costs: slower, more patient criteria buy a little more generalisation for a lot more training time, on the order of a few percent of metric for several times the epochs in his experiments [1]. Patience is the knob that sits on that tradeoff.
Set patience too low and you are at the mercy of noise. The criterion trips on the first downward wobble, the run halts before the metric has actually peaked, and because you stopped early the best checkpoint you keep is whatever the best-so-far was at that premature stop, which is worse than the peak you would have reached. Set patience too high and you have re-created the problem the criterion was supposed to solve: the run rides most of the way to the cap, the metric peaked twenty epochs ago, and you spent the difference in box-time learning nothing. The right value is the smallest patience that comfortably survives the normal epoch-to-epoch noise of your validation metric on your dataset, and no larger. On the binary set at 110 minutes the cost of an over-patient criterion is small, so a slightly generous window is cheap insurance; on the multiclass set at 550 minutes the same generosity is expensive, and it is worth tightening patience to the point where it still clears the noise but stops promptly after the peak.
The exhibit below is the two dials made tangible. Pick the monitored metric, drag the patience window, and the orange marker snaps to where that criterion actually halts the run, which is the best epoch seen so far plus the patience you granted. It then reports the epochs and the GPU box-minutes that stop reclaims against riding the full fifty, and whether the best checkpoint survived. Push patience down to one and watch the marker jump in front of the true peak and the badge turn red: that is the criterion stopping on noise and throwing away the best model. Give it a little more room and the run still ends well short of fifty with the peak checkpoint intact, which is the entire point.
The checkpoint you keep is not the checkpoint you stop on
There is a detail inside early stopping that is easy to get wrong and that the instrument is built to make obvious, because getting it wrong quietly poisons everything above. When a patience-based criterion fires, the weights currently in memory are not the ones you want. By construction the criterion only stops after several epochs of no improvement, so the live weights are several epochs past the peak, slightly overfit. The thing you keep is the best-so-far checkpoint, saved back when the metric was at its lowest error, not the weights at the stopping step. The textbook is explicit that this restoration is part of the method, not an optional nicety [5], and it is the reason a generous patience does not cost you model quality, only time. You are paying extra epochs to be more sure the peak was really the peak; you are not shipping the degraded post-peak weights.
This is also what makes the failure mode of too-little patience so sharp rather than gradual. If patience is high enough to clear the peak, the kept checkpoint is the true best regardless of how much extra time you burned confirming it, so the only thing at stake is wasted box-time. But if patience is so low that the criterion fires before the metric has actually peaked, the best-so-far you restore is a pre-peak checkpoint, and now you have lost real quality, not just time. That asymmetry is why, when we were unsure, we erred toward slightly more patience rather than less, and tightened it only on the expensive multiclass run where the box-time saved was worth the closer call. Losing minutes is recoverable. Shipping a checkpoint from before the model finished learning is not, at least not without paying the full run again.
What this bought on two runs
Put both dials together and the criterion stops being an abstraction and becomes a line on the box-time budget. On the multiclass run, monitoring the depth-error metric with a patience tuned just above the validation noise, the criterion halts the run a healthy margin before epoch fifty and hands back a slice of the 550 minutes while keeping the peak checkpoint, which on a shared box that we were not guaranteed to keep all night is the difference between a job that finishes and a job that gets evicted before it does. On the binary run the absolute saving in minutes is smaller because the whole run is only 110 minutes, but the same criterion applies unchanged, and the marginal cost of running it is nil. The criterion is not a different tool for the two datasets. It is the same two dials, set tighter where the box-time is dear and looser where it is cheap.
None of this is exotic, and that is rather the point. We did not invent a stopping rule; we used the standard patience-based criterion with best-checkpoint restoration that Prechelt formalised and the textbook describes [1] [5], chose the monitored metric to match the error a geoscientist would actually feel rather than the one that looked best on a slide [4], and set patience to the smallest window that survived our validation noise. The capacity argument that makes early stopping safe to lean on is Caruana, Lawrence, and Giles's [2] and the regularization reading is Bishop's [3]. What we contributed was the calibration on two specific runs with two specific budgets, and the discipline to treat the cap and the criterion as the separate decisions they are.
The two dials, in brief
- A fifty-epoch cap only bounds the worst case. It does not know when the run actually peaked, so on the 550-minute multiclass run and the 110-minute binary run it cannot reclaim the box-time spent after the best held-out checkpoint. The stopping criterion is what finds the best case inside the cap.
- A criterion is two dials, not one. The monitored quantity decides what counts as better, and the patience window decides how much epoch-to-epoch noise to tolerate before quitting. Both are set before the run and fire without you.
- The monitored metric is a real choice because the summaries disagree. Best validation R-squared 0.9891, lowest MAE 0.0132, and lowest MSE 0.0004 are different lenses, and a squared-error view can keep improving on a few outlier depths after the typical-depth view already considers the model best.
- Patience too low stops on noise and keeps a pre-peak checkpoint, which loses model quality, not just time. Patience too high rides toward the cap and wastes box-time. The right value is the smallest window that clears your validation noise, tightened where the run is expensive.
- Early stopping restores the best-so-far checkpoint, not the weights at the stop step. That asymmetry is why generous patience costs only time while stingy patience costs quality, and why we erred toward patience and tightened it only on the costly multiclass run.
Limitations
This account is calibration on two runs from one engagement, not a benchmark, and it should be read that way. The monitored metrics and their best values are the real archive numbers, and the full-run costs of 550 and 110 minutes are measured, but the per-epoch trajectory the instrument draws toward those end-points, including the slight post-peak drift that gives patience something to do, is illustrative geometry rather than a logged epoch-by-epoch series; the patience rule itself is applied to that path exactly. The right patience and the right monitored metric are properties of a dataset's validation noise and a team's actual objective, so our settings do not transfer as constants to a different operator's logs or a different curve count. We also held the cap fixed at fifty throughout and only varied where inside it to stop; whether fifty was itself the right ceiling, as opposed to a budget we inherited and never had reason to move, is a separate question this note does not try to answer. And early stopping governs only when training ends. It says nothing about whether the validation split was representative, whether the synthetic logs covered the field failure modes, or whether a checkpoint that scores well on held-out depths actually produces a usable digitised curve, which remain the questions that decide whether the model is any good.
Where the cap and the criterion part ways
The habit this left us with is to refuse to let the epoch cap stand in for a decision it cannot make. The cap is a guardrail for the worst case and we keep it for exactly that reason, but the question of when a run is done belongs to a criterion that watches the metric we actually care about and waits exactly long enough to be sure the peak was real. Set those two dials with some care and a fifty-epoch budget stops being a number the run drifts up against and becomes a ceiling the run rarely needs, because it has already quit, on purpose, at the point where continuing stopped paying.
References
[1] Prechelt, L. Early Stopping - But When? In Neural Networks: Tricks of the Trade, 2nd ed., Lecture Notes in Computer Science 7700, Springer (2012), pp. 53-67. The systematic account of turning validation-based stopping into a criterion, and the tradeoff between patience and training time. https://link.springer.com/chapter/10.1007/978-3-642-35289-8_5
[2] Caruana, R., Lawrence, S., and Giles, C. L. Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping. Advances in Neural Information Processing Systems 13 (NIPS 2000). The argument that the stopping point, not the parameter count, is what bounds overfitting. https://proceedings.neurips.cc/paper_files/paper/2000/hash/059fdcd96baeb75112f09fa1dcc740cc-Abstract.html
[3] Bishop, C. M. Neural Networks for Pattern Recognition. Oxford University Press (1995). The framing of validation-monitored early stopping as implicit regularization that limits the effective complexity the network reaches. https://global.oup.com/academic/product/neural-networks-for-pattern-recognition-9780198538646
[4] Willmott, C. J., and Matsuura, K. Advantages of the Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE) in Assessing Average Model Performance. Climate Research 30 (2005), pp. 79-82. Why squared-error summaries conflate average error with error variance, the reason an absolute-error and a squared-error criterion can disagree on the best epoch. https://www.int-res.com/abstracts/cr/v30/cr030079
[5] Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press (2016), Chapter 7. The modern engineering description of patience, of monitoring a held-out metric, and of restoring the best-so-far parameters when the run halts. https://www.deeplearningbook.org/