There is a comfortable lie that circulates in machine-learning teams: that the optimizer is a settled question. Reach for Adam, set the learning rate to something in the 1e-3 neighbourhood, and move on to the parts that "actually matter" — the architecture, the data, the loss. On a large dataset that complacency is mostly harmless; with enough gradient steps over enough examples, most reasonable optimizers converge to roughly the same place. But move to the regime where this kind of subsurface work actually lives — a Detection Transformer trained on the image logs of fourteen wells — and the optimizer stops being a footnote. It becomes one of the few knobs that decides whether the model generalises or memorises. In our work with a mid-sized Middle East carbonate operator, building a transformer to pick fracture and bedding sinusoids on two different microresistivity imaging tools, we swept SGD, Adam, and AdamW under otherwise identical conditions. AdamW won — a clean lesson in why optimizer choice is a tuning lever precisely when data is thin.
The setup: a small model, deliberately starved
The engineering object is a DETR-derived set-prediction transformer — a ResNet-10 backbone feeding four encoder and four decoder layers, with a Hungarian bipartite matching loss assigning predicted queries to ground-truth sinusoids. The point here is not the architecture; it is the scale. Fourteen vertical wells is not a dataset you brute-force. It is a dataset you respect, because every regularisation decision either helps the model see the geology or lets it overfit the handful of wells it has.
That respect shows up in a training recipe that is conservative by design. Batch size sits at 128. The learning rate was not guessed; it came out of a sweep from 0.001 down to 0.0005, with 0.0004 emerging as the optimal value. Dropout is held at 0.2. And rather than training to a fixed epoch budget, the run is governed by early stopping — training halts after 40 epochs with no improvement, so the model is not given the rope to keep grinding the training wells into memorisation long after validation has plateaued. The backbone is trained from scratch, with no pretrained ImageNet weights, because the texture statistics of a carbonate image log share almost nothing with natural photographs.
Every one of those choices is a regulariser pointing the same direction: keep the effective capacity low, keep the model honest about how little data it has. The optimizer is the last and most consequential member of that family — and the one teams most often leave on autopilot.
Why Adam quietly breaks weight decay
To see why AdamW beats Adam in this regime, you have to look at what "weight decay" actually does inside Adam, and the answer is: not what most people think.
Classical L2 regularisation and weight decay are identical for plain SGD. Add a penalty proportional to the squared norm of the weights to the loss, take the gradient, and you get a term that shrinks every weight toward zero by a fixed fraction each step. The folklore that "L2 regularisation equals weight decay" is true — for SGD.
It stops being true the moment you put an adaptive optimizer underneath it. Adam rescales each parameter's update by a running estimate of that parameter's gradient magnitude. When you implement weight decay the naive way — by folding the L2 penalty into the loss and letting Adam differentiate it — that penalty gets pushed through the same per-parameter rescaling as everything else. The decay a weight actually receives is no longer a clean fraction of its value; it is divided by the square root of that weight's accumulated second moment. Parameters with large, noisy gradients get their decay scaled down; parameters with small, stable gradients get it scaled up. The regularisation you thought you dialled in is silently warped, per-parameter, by the optimizer's own adaptive machinery.
(undefined, undefined) ·AdamW — Adam with decoupled weight decay — fixes exactly this. It removes the penalty from the loss and applies the decay directly to the weights as a separate, post-gradient step: shrink every weight by a fixed fraction, untouched by the adaptive scaling. The decay becomes a true, uniform pull toward zero again, the way it is under SGD, while the adaptive step still handles the gradient. The two jobs — follow the gradient, and shrink the weights — are cleanly separated, which is the whole point of the name.
Decoupled weight decayDecoupled weight decay: applying the weight-shrink term directly to the parameters as a separate step, rather than as an L2 penalty inside the loss. This stops Adam's per-parameter adaptive scaling from distorting the amount of regularisation each weight receives — restoring the uniform shrinkage that L2 only gives you under plain SGD. sounds like a fine implementation detail. On a large dataset it nearly is. On fourteen wells it is the difference between a regulariser that does its job and one that has been quietly defeated by the optimizer it was supposed to ride on.
The three-way sweep, and why each loses or wins
Treat the optimizer as a hyperparameter and the comparison falls out cleanly. We ran SGD, Adam, and AdamW through the same architecture, the same batch size of 128, the same 0.0004 learning rate, the same dropout of 0.2, and the same early-stopping rule. AdamW was the best of the three. Here is the engineering intuition for why each landed where it did.
SGD is the most honest optimizer in the room and, on this problem, the most fragile. With no per-parameter adaptation, it leans entirely on the learning rate and schedule being right, and it is slow to traverse the ill-conditioned, high-curvature loss surface that a transformer with sparse set-prediction supervision presents. The Hungarian matching loss is unforgiving here: the supervision signal per query is sparse, the assignment is global, and SGD's uniform step size struggles to make progress on the cocktail of fast-moving classification logits and slow-moving regression heads at once. It is not that SGD cannot work — it is that, on a small, badly-conditioned problem, it leaves performance on the table that an adaptive method picks up for free.
Adam fixes the conditioning problem and converges fast, which is exactly why it is everyone's default. But on fourteen wells, fast convergence is a liability if it is not paired with effective regularisation — and as we just saw, Adam's coupled weight decay is not effective. It converges quickly to a solution that fits the training wells a little too well, with its nominal regularisation half-disarmed by its own adaptive scaling. Fast to a slightly overfit place is not where you want to be when generalisation across wells is the entire game.
AdamW keeps Adam's fast, well-conditioned steps and restores the weight decay to full strength. You get the convergence speed and the uniform shrinkage, working together instead of against each other. On a problem this small, that combination is decisive: it is the only one of the three that gives you both halves of what a thin-data transformer needs.
The structural lesson generalises beyond optimizers. When you ablate a training recipe under identical conditions, the component whose gradient or update best aligns with what you are actually trying to optimise is the one that wins — and the winner is frequently not the default. The same logic that picks AdamW over Adam here is the logic that picks one loss function over another in a segmentation pipeline.
The recipe is a system, not a list of knobs
It would be a misreading to take "use AdamW" as the lesson and walk away. The optimizer won because it sat inside a coherent regularisation system whose pieces reinforce each other. The 0.0004 learning rate is the value that lets AdamW's decoupled decay matter — too high and the model thrashes before decay can act; too low and early stopping fires before convergence. Dropout at 0.2 attacks overfitting from the activations side while weight decay attacks it from the weights side — complementary, not redundant. Early stopping at 40 epochs of no improvement is the safety net that makes aggressive convergence safe. And training the ResNet-10 from scratch keeps the capacity budget small enough that all of this regularisation has something to bite on.
Pull any one piece and the others compensate poorly. This is why "just use the default optimizer" is bad engineering on small data: the optimizer is load-bearing in a structure where every member carries weight. The hyperparameter sweep that found 0.0004 is the same sweep that should be finding your optimizer — they are not separate concerns, and treating the optimizer as fixed while you tune everything else leaves the single highest-leverage knob untouched.
Takeaways for the practitioner
If you are training a transformer — or any moderately sized model — on a genuinely small dataset, do not inherit your optimizer from a tutorial. Treat it as a first-class hyperparameter and sweep it alongside the learning rate, because the regime where optimizer choice matters most is exactly the regime you are in. Prefer AdamW over Adam by default whenever weight decay is part of your regularisation story, because Adam's coupled decay is quietly distorted by its own adaptive scaling and you will not see the damage until generalisation fails. Reach for SGD only when you have the data and the schedule budget to make it shine. And remember that the optimizer is one member of a regularisation system — learning rate, dropout, early stopping, and capacity all have to agree with it. On fourteen wells, that agreement is not a luxury. It is the only reason the model works at all.
Key takeaways
- On a 14-well fracture-picking transformer, the optimizer is a tuning lever, not a footnote. Swept under identical conditions (batch 128, LR 0.0004 from a 0.001→0.0005 sweep, dropout 0.2, early stop at 40 epochs of no improvement), AdamW beat both Adam and SGD.
- Adam's weight decay is quietly broken: implemented as an L2 penalty in the loss, the decay gets pushed through Adam's per-parameter adaptive scaling, so the regularisation each weight receives is distorted rather than uniform. This barely matters on big data and matters enormously on small data.
- AdamW decouples weight decay — applying the shrink directly to the weights after the gradient step — restoring the uniform pull toward zero that L2 only gives you under plain SGD, while keeping Adam's fast, well-conditioned steps.
- SGD is the most honest and most fragile: no per-parameter adaptation, slow on the ill-conditioned loss surface of a sparsely-supervised set-prediction transformer. Adam converges fast but to a slightly overfit place with its decay half-disarmed. AdamW gets both halves.
- The optimizer is load-bearing inside a regularisation system — LR 0.0004, dropout 0.2, early stopping, a from-scratch ResNet-10 backbone, and decoupled decay all reinforce each other. Sweep the optimizer alongside the learning rate; never inherit it as a default.
References
[1] Loshchilov, I., and Hutter, F. Decoupled Weight Decay Regularization (AdamW). ICLR (2019). The paper that diagnoses the coupling between L2 regularisation and Adam's adaptive update, and proposes decoupled weight decay as the fix. https://arxiv.org/abs/1711.05101