Tuning a Transformer for the Subsurface: The Search Behind a 0.0004 Learning Rate

A learning rate of 0.0004 looks like a number you would never write down on purpose. It is not a round default, not a textbook 1e-3, not a value any tutorial recommends. It is what the data chose. In our work with a mid-sized Middle East carbonate operator we partnered with, getting a Detection Transformer to converge on a 14-well corpus of image logs from two different microresistivity imaging tools came down to a disciplined hyperparameter search — and the final recipe reads like a list of small, defensible departures from every DETR default we started with. This case study is about that search: what we swept, what broke, and why the engineering decisions matter more than the headline metric they eventually produced.

Why the defaults do not survive contact with borehole data

The model — internally GeoBFDT — is a Detection Transformer: a convolutional backbone feeds a transformer encoder–decoder that emits a fixed-size set of object queries, each resolving to a fracture, a bedding plane, or no-object, plus a regressed triplet of depth, dip, and azimuth. The architecture is proven on natural images. None of its standard training settings transfer.

The reason is the data regime. DETR was designed for COCO — over a hundred thousand densely labelled images across eighty classes. Our starting reservoir interval contributed 32 sinusoids across 236 image patches, only 19 of which held a sinusoid at all. That is four orders of magnitude less data, two classes instead of eighty, and a long-tailed label distribution where most queries should resolve to no-object. Every default that assumes abundant data — a large pretrained backbone, a high learning rate, a long warmup — becomes an overfitting machine. The search was, in effect, a controlled descent from a configuration built for data abundance to one built for data scarcity.

We made the regression targets learnable first. Depth, dip, and azimuth live on wildly different scales — metres, and two angular ranges — so each target is normalised into a 0–1 range by dividing by a fixed divisor: patch height (100), 90 for dip, and 360 for azimuth. Without that, the L1 regression gradient is dominated by azimuth and the model never learns depth. It is a one-line change that no detection tutorial mentions, because no detection tutorial regresses geology.

The first configuration, and why it overfit

Our first end-to-end DETR run used settings close to the published reference: AdamW, learning rate 0.0001, batch size 256, 100 epochs, the 100/90/360 normalisation in place. It trained. It also overfit — visibly, with a train/validation gap that widened as the network memorised the handful of real sinusoids rather than learning the geometry of a sine wave. The batch size of 256 was part of the problem: with so few sinusoid-bearing patches, a 256-image batch was reusing the same augmented examples many times per step, and the large batch flattened the gradient noise that small-data training actually needs.

That run told us the search space was not "which learning rate" but "which entire regime." So we set up a proper sweep rather than hand-tuning. The configurable space spanned the learning rate (log-uniform from 1e-4 to 1e-1), a separate backbone learning rate (log-uniform 1e-6 to 1e-4), weight decay (log-uniform 1e-6 to 1e-2), the backbone family (ResNet-18, ResNet-50, EfficientNet-B2 as starting candidates), positional-embedding type (sine vs learned), encoder/decoder depth (3 to 6 layers), feedforward dimension (512, 1024, 2048), hidden dimension (128, 256, 512), and the number of attention heads (4, 8, 16). Early sweep bookkeeping was unglamorous — two experiments complete, twenty in flight, a tuning run on the 30th producing a cross-entropy result on a single well interval a week later — but it is the bookkeeping that turns a guess into a result.

Search, do not guess

The point of a defined search space is not to find a magic value. It is to make the dead ends reproducible. Every setting we rejected — batch 256, ResNet-50, a 1e-3 learning rate — is a documented run, not a hunch, which is what lets the next engineer trust the recipe instead of re-deriving it.

The recipe the data chose

The search converged on a configuration that is, line by line, a small-data argument:

Optimiser: AdamW. We tried SGD and Adam against AdamW under matched conditions. AdamW won — its decoupled weight decay is the difference between a transformer that regularises cleanly on a tiny corpus and one that does not.
Learning rate: 0.0004. Swept from 0.001 down to 0.0005, the optimum landed below the bottom of that range, at 0.0004 — roughly four times the original 0.0001, but only once the batch size came down with it. Higher rates destabilised the set-prediction loss; lower rates stalled inside the 40-epoch patience window.
Batch size: 128. Half the starting 256. Smaller batches restored useful gradient noise and stopped the optimiser from over-smoothing across the scarce positive patches.
Backbone: ResNet-10, from scratch, no pretrained weights. Smaller than any candidate we began with. On this corpus, ImageNet features were a liability, not a head start.
Depth: 4 encoder and 4 decoder layers, feedforward dimension 1024. We tested feedforward widths of 512, 1024, and 2048; 1024 was the sweet spot between capacity and overfitting.
Regularisation: dropout 0.2, early stopping after 40 epochs without improvement. The early-stop patience is doing real work — most runs were halted well before any nominal epoch budget.

The loss is where detection-for-geology departs most sharply from detection-for-COCO. Classification uses Focal loss, weighted 5; the depth/dip/azimuth regression uses L1, weighted 1. The 5:1 ratio is deliberate: get the class wrong and the regressed geometry is meaningless, so the matcher is told to care about what a query is before where it is. Focal loss earns its place because the no-object class swamps the positives — it down-weights the easy negatives so the rare fractures still drive the gradient. At inference we keep queries above a 0.5 probability threshold, the recall-favouring operating point, rather than the 0.9 a precision-first cross-entropy head would want. The resulting best checkpoint encodes its own recipe in its filename — learning rate, epoch count, resnet10, l1_focal — which is how we keep a reproducible audit trail from a config to a weight file.

Why we trust the recipe: the ablations

A converged loss curve is not evidence that the configuration is right; it is evidence that some configuration trained. The ablations are what make the recipe defensible, because each one isolates a single decision and shows the cost of getting it wrong.

The backbone ablation is the sharpest. ResNet-10 reached a class error of 0.499; ResNet-14 held at 0.799; ResNet-18 jumped to 21.013; ResNet-34 collapsed to 26.759. The relationship is monotonic and unambiguous — more capacity overfit harder. This is the empirical justification for choosing the smallest backbone in the search space, and it is precisely the opposite of the instinct most engineers bring from natural-image detection, where bigger backbones win.

The augmentation ablation is existential rather than incremental. With the geometry-preserving augmentation pipeline — colour jitter, Gaussian blur, sharpen, Gaussian noise, emboss, median blur — applied at roughly ten augmentations per sinusoid-bearing patch, the corpus grew from 236 patches / 19 positives / 32 sinusoids to 4,212 / 2,046 / 3,565, a greater-than-tenfold increase that varies texture and contrast while leaving sinusoid shape intact. Train without it and classification error is 100% — the model learns nothing usable. Train with it and class error is 2.618%, with the combined Hungarian loss falling from 0.174 to 0.0135. No learning-rate choice rescues a model that never sees enough variation; augmentation is a precondition for the rest of the recipe even mattering.

From a configuration built for data abundance to one built for data scarcity

Before

DETR defaults

LR 0.0001, batch 256, ResNet-50/pretrained instincts, 100-epoch budget — trains but overfits on 14 wells

After

Searched recipe

AdamW, LR 0.0004, batch 128, ResNet-10 from scratch, FFN 1024, focal(×5)+L1(×1), dropout 0.2, early-stop@40

backbone class error 26.759 → 0.499; augmentation 100% → 2.618%

What this transfers to the next engagement

The specific numbers are tied to one confidential 14-well carbonate dataset and should not be lifted wholesale. The method is what transfers, and it is the part we carry into every adjacent subsurface engagement — across operators in the Middle East and the United States, the small-data failure modes rhyme even when the geology does not.

Three rules survive the move. First, on a small geoscience corpus the search direction is almost always toward smaller — smaller backbone, smaller batch, more regularisation — because the binding constraint is overfitting, not underfitting. Second, the loss weighting is a domain decision, not a default: a 5:1 classification-to-regression ratio reflects that a misclassified sinusoid makes its geometry worthless, and that ratio should be re-derived, not inherited. Third, every rejected configuration is an asset. The reproducible record of what failed — batch 256, the heavier backbones, the higher learning rates — is what lets a downstream team start from a trusted recipe instead of re-running the entire descent. That record, version-controlled alongside the checkpoint that encodes its own hyperparameters, is the difference between a research artefact and a production training pipeline.

The search behind a 0.0004 learning rate

DETR defaults do not survive 14-well borehole data: the search descended from a configuration built for data abundance (LR 0.0001, batch 256, large pretrained backbones) to one built for scarcity (LR 0.0004, batch 128, ResNet-10 from scratch).
The loss design is the domain decision — Focal classification weighted 5 against L1 regression weighted 1, with regression targets normalised by 100/90/360, plus a recall-favouring 0.5 inference threshold rather than a precision-first 0.9.
Ablations make the recipe defensible: smaller backbones win monotonically (ResNet-10 class error 0.499 vs ResNet-34's 26.759), and geometry-preserving augmentation is a precondition, not a tweak — without it classification error is 100%, with it 2.618%.

References

GeoBFDT hyperparameter search, training configuration, and ablation figures derived from internal validation on a 14-well Middle East carbonate borehole-image-log dataset; data, checkpoints, and code withheld under operator confidentiality.
Carion et al. (2020). End-to-End Object Detection with Transformers (DETR). ECCV 2020. https://arxiv.org/abs/2005.12872
Loshchilov & Hutter (2019). Decoupled Weight Decay Regularization (AdamW). ICLR 2019. https://arxiv.org/abs/1711.05101
Lin et al. (2017). Focal Loss for Dense Object Detection. ICCV 2017. https://arxiv.org/abs/1708.02002

Tuning a Transformer for the Subsurface: The Search Behind a 0.0004 Learning Rate

Why the defaults do not survive contact with borehole data

The first configuration, and why it overfit

The recipe the data chose

Why we trust the recipe: the ablations

What this transfers to the next engagement

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on