The instinct every ML engineer carries from ImageNet is that a deeper backbone is a better backbone. On the subsurface dataset we trained for a mid-sized Middle East carbonate operator, that instinct was not merely unhelpful — it was actively wrong, and the failure was loud enough to read off a single ablation table. Swapping the feature extractor inside our Detection Transformer from a 10-layer residual network to a 34-layer one did not buy a fraction of a point of accuracy; it detonated the model. Class error went from 0.499 to 26.759. This is the story of why, and why we shipped the smallest backbone we tried.
The setting: a transformer detector trained on scarce well data
The model is GeoBFDT, a Detection TransformerA DETR-derived, mask-free detector that predicts a whole set of sinusoids end-to-end — emitting class plus depth, dip, and azimuth per sinusoid in one forward pass, with no anchors, no masks, and no non-maximum suppression. we built to pick fractures and bedding planes — collectively, sinusoids — directly from two different microresistivity imaging tools across 14 vertical wells of a fractured carbonate play. The architecture is the standard set-prediction spine: a convolutional backbone extracts a feature map, a 4-layer transformer encoder builds global context across the patch, a 4-layer decoder turns learned query embeddings into a set of predictions, and a Hungarian bipartite matching loss assigns each prediction to at most one ground-truth sinusoid. The classification head uses Focal loss (weight 5); the regression head emits depth, dip, and azimuth under an L1 loss (weight 1).
The backbone is the one component the practitioner picks more or less by reflex, and it is the only part of this architecture whose capacity scales with a single integer — depth in residual blocks. Everything downstream of it (the encoder, the decoder, the matching loss) is fixed. That makes the backbone the cleanest possible lever to study capacity in isolation, which is exactly what the engagement forced us to do.
The forcing function was data. This was not an ImageNet-scale corpus; it was a confidential producing reservoir with a hard ceiling of 14 wells — too few, the reviewers later noted, to even hold out an entire well as a test set. We trained from scratch, with no pretrained weights, because no public borehole-image model exists to fine-tune from. The training recipe was deliberately conservative: AdamW at a learning rate of 0.0004, batch size 128, dropout 0.2, feedforward dimension 1,024, early stopping after 40 epochs without improvement. In a regime that sparse, a feature extractor's capacity stops being a free upgrade and becomes a liability you have to actively manage.
The experiment: four backbones, one knob
We held the entire pipeline constant — same patches, same augmentation, same encoder/decoder, same loss, same optimiser — and varied only the backbone across four residual depths: ResNet-10, ResNet-14, ResNet-18, and ResNet-34. Three metrics travelled with each run: the Hungarian matching loss (the assignment quality), the parameter loss (the L1 error on depth/dip/azimuth), and the validation class error. The class error is the one to watch — it is the share of sinusoids the model mis-classifies, and it is where the capacity story becomes impossible to ignore.
The numbers do not slope; they cliff:
- ResNet-10 — Hungarian loss 0.0151, parameter loss 0.056, class error 0.499. The lightest backbone, and the winner outright.
- ResNet-14 — 0.071 / 0.129 / 0.799. Still healthy; class error not quite doubled.
- ResNet-18 — 0.053 / 0.1562 / 21.013. The hinge. Going from 14 to 18 layers raised class error by a factor of roughly twenty-six. The model is no longer in the same regime.
- ResNet-34 — 0.682 / 0.3029 / 26.759. Every metric is the worst in the table. The deepest backbone produced the deepest failure.
There is no interpretation of this table in which more capacity helps. The relationship between depth and error is monotone in the wrong direction past ResNet-14, and the jump from ResNet-14 to ResNet-18 is not a gentle degradation — it is a phase change from a working model to a broken one.
Why deeper broke: capacity versus a 14-well manifold
The mechanism is the oldest one in the book, and it is worth stating precisely because the magnitude here is unusual. A ResNet-34 has the representational capacity to memorise a corpus this size. With no pretrained prior to anchor it and only a few thousand augmented patches drawn from 14 wells, the extra layers do not learn more general borehole physics — they learn the training set's idiosyncrasies, including its label noise and its per-well imaging quirks. The encoder and decoder downstream then inherit a feature map that is exquisitely tuned to wells they will be scored against and useless on the geology they have not seen. Validation class error is the receipt.
The lighter backbone wins for the symmetric reason. ResNet-10 simply cannot memorise the corpus, so the only loss it can drive down is the one that generalises: the shared sinusoid structure — the conductive-versus-resistive contrast of a fracture trace, the gentle curvature of a bedding plane — that recurs across every well. Capacity control is not a regularisation afterthought layered on top of the architecture; on a dataset this small it is the architecture decision. The backbone you would reach for on ImageNet is the backbone that overfits here, and the one you would dismiss as a toy is the one that ships.
This is also the cleanest demonstration of a principle we apply across every small-data subsurface engagement: the feature extractor is the reusable, load-bearing part of the model, and choosing its capacity to match the data manifold — not the task's apparent difficulty — is where generalisation is won or lost. The head can be swapped; the backbone is what the rest of the network is forced to trust.
Ruling out the easy explanations
A single ablation row is a hypothesis, not a finding. Before we committed ResNet-10 to production we had to confirm the collapse was about backbone capacity and not a confound hiding in the training recipe — a learning rate that happened to suit one depth, or an unlucky seed.
Two adjacent ablations did that work. The augmentation study ran the same ResNet-10 backbone with and without our geometry-preserving augmentation pipeline; without it, class error sat at 100% — the model learned nothing usable — and with it, 2.618%. The well-count study held the architecture fixed and grew the training set from 3 wells to 14: Hungarian loss fell from 0.801 to 0.015 and class error from 93.115 to a low single digit. Both results point the same direction the backbone table does — this model's behaviour is governed by the data-to-capacity ratio, not by a fragile hyperparameter. The augmentation pipeline was what made even the small backbone trainable; it grew a brutally sparse seed interval of 236 patches (just 19 of them sinusoid-bearing, 32 sinusoids total) into 4,212 patches with 2,046 sinusoid patches and 3,565 sinusoids, a greater-than-tenfold expansion that preserves sinusoid shape while varying contrast and texture. With data that scarce and capacity that abundant, ResNet-34 never stood a chance.
The dynamic-versus-static log ablation tells the same story from the data-quality side: training on dynamically normalised images cut class error from 63.45 to 2.536, because the contrast the attention layers need is in the dynamic image, not the static one. Capacity cannot substitute for signal, and a deeper backbone cannot manufacture contrast that was never in the input.
What we shipped, and the metrics that justify it
ResNet-10 was not a compromise we settled for; it was the configuration that made the production fracture model viable. On that backbone, the fractures-only model reached 80.84% overall dip-intensity accuracy, ahead of the 77.10% the combined fracture-plus-bedding model managed on fractures and the 72.52% the standalone bedding model reached. At an 8 cm depth tolerance — the window inside which an interpreter treats a pick as correct — fracture sensitivity sat near 85%. None of those numbers are reachable from a backbone that has overfit its way to a class error of 26.
The engineering lesson generalises well beyond this one reservoir, and it is one we carry into every subsurface model we build for operators across the Middle East and the United States: pick backbone capacity to fit the data manifold, not the task's apparent glamour. A from-scratch model on a few thousand labelled patches is in a different universe from a fine-tuned model on millions, and the universe you are in dictates the depth you can afford. The deeper-is-better heuristic is a fine-tuning heuristic; it does not survive contact with scarce, confidential, train-from-scratch field data. The cheapest, most reliable regulariser available to us was simply choosing fewer layers — and letting the ablation table, not the reflex, make the call.
Before
ResNet-34 — class error 26.759
Deepest backbone tried; Hungarian loss 0.682, parameter loss 0.3029 — every metric the worst in the ablation table
After
ResNet-10 — class error 0.499
Lightest backbone tried; Hungarian loss 0.0151, parameter loss 0.056 — the production choice
~54x lower class error from fewer, not more, layers
Why the smaller backbone won
- On a from-scratch, 14-well corpus, capacity is a liability you manage, not a feature you buy: ResNet-10 reached class error 0.499 while ResNet-18 hit 21.013 and ResNet-34 hit 26.759 — a phase change, not a gradient.
- The deeper backbones had enough capacity to memorise a corpus this small, so they fit the wells they were scored against and failed to generalise; the lighter ResNet-10 could only learn the sinusoid structure shared across wells, which is the part that transfers.
- Adjacent ablations confirmed the mechanism is the data-to-capacity ratio, not a hyperparameter fluke — augmentation moved the same backbone from 100% to 2.618% class error, and 3 to 14 wells moved class error from 93.115 to single digits — so we shipped ResNet-10 and let the lightest model carry the ~85%-sensitivity fracture detector.
References
-
Backbone, augmentation, well-count, and static-versus-dynamic ablation figures derived from internal validation on a 14-well Middle East carbonate microresistivity image-log dataset (Table 11 backbone effect, Table 10 augmentation, Table 9 static/dynamic, Table 7 well count); data and code withheld under operator confidentiality.
-
Carion et al. (2020). End-to-End Object Detection with Transformers (DETR). ECCV 2020. https://arxiv.org/abs/2005.12872
-
He et al. (2016). Deep Residual Learning for Image Recognition (ResNet). CVPR 2016. https://arxiv.org/abs/1512.03385