There is a reflex in applied deep learning that says: when the model underperforms, scale the backbone. Swap ResNet-18 for ResNet-50, reach for a heavier feature extractor, add capacity. On large datasets that reflex is usually right. On the kind of dataset that real subsurface engagements hand you — fourteen wells, painstakingly interpreted, every additional well a quarter of negotiation and acquisition — it is exactly backwards. In a roughly twenty-month engagement with a mid-sized Middle East NOC carbonate operator we partnered with, building a Detection Transformer to pick fractures and beddings on borehole resistivity image logs, the backbone ablation produced one of the most counterintuitive results of the whole programme: a tiny, from-scratch ResNet-10 crushed every deeper variant, and ResNet-34 — three and a half times the depth — was effectively unusable. This piece is about why that happens, what the numbers actually were, and what it tells you about choosing model capacity when data, not compute, is the binding constraint.
The architecture, and where the backbone sits
The fracture model — internally, GeoBFDT — is a DETR-style set-prediction network. A convolutional backbone turns an 800×360 image-log patch into a dense feature map (in our configuration, a [batch, 256, 50, 23] tensor); a transformer encoder–decoder then attends over that feature map and emits a fixed set of object queries, each regressing the depth, dip, and azimuth of one sinusoid. The whole thing trains end-to-end against a Hungarian bipartite matching loss — a focal classification term and an L1 parameter term, class-weighted 5 to 1, with no anchors and no non-maximum suppression.
The backbone is the front door. It is the only component whose job is to extract visual features from raw imagery, and it is where the parameter count of a vision model overwhelmingly lives. Everything downstream — the encoder, the decoder, the matching loss — operates on whatever representation the backbone produces. So the backbone is exactly the place where the capacity-versus-data tension bites hardest. Pick it too small and you bottleneck the features. Pick it too large and, on a small dataset, it memorises.
One detail matters before the numbers: the backbone was trained from scratch, with no pretrained weights. ImageNet pretraining — the usual escape hatch that lets a big backbone behave on small data — does not transfer cleanly to greyscale borehole imagery, where the relevant texture is a sinusoid traced across an unrolled cylinder, not a cat. So every parameter in the feature extractor had to be learned from those fourteen wells. That makes the capacity choice unforgiving, and it makes the ablation honest.
The ablation table
We swept four backbones under otherwise identical training conditions — same matching loss, same AdamW optimiser, same augmentation, same data — and read off the Hungarian loss, the L1 parameter loss, and the classification error (defined simply as 100 − class accuracy). The result is not subtle.
| Backbone | Hungarian loss | Parameter (L1) loss | Class error |
|---|---|---|---|
| ResNet-10 | 0.0151 | 0.056 | 0.499 |
| ResNet-14 | 0.071 | 0.129 | 0.799 |
| ResNet-18 | 0.053 | 0.156 | 21.013 |
| ResNet-34 | 0.682 | 0.303 | 26.759 |
Read the classification-error column from top to bottom: 0.50, 0.80, 21.01, 26.76. The smallest backbone is the best on every one of the three metrics, and the relationship with depth is monotonic in the wrong direction — more layers, worse performance. The gap between the lightest and the heaviest backbone is more than fifty-fold on classification error. That is not measurement noise or a seed effect. It is the signature of a model that has run out of data long before it has run out of capacity.
Notice the shape of the degradation. ResNet-10 to ResNet-14 is a gentle, expected drift — a slightly larger network, marginally worse, both excellent. Then between ResNet-14 and ResNet-18 something breaks: classification error jumps from 0.80 to 21.01, a step change of more than 20 points for what is, in architecture terms, a modest increase in depth. That cliff is the diagnostic. A smooth capacity–error curve would suggest you are merely past the optimum; a cliff says the deeper networks have crossed into a regime where they can fit the training set's idiosyncrasies faster than the matching loss can teach them the geology. Once a backbone has enough parameters to memorise fourteen wells, the regulariser of last resort — more data — is the one thing we did not have.
Why deeper overfits harder here
The mechanism is worth being precise about, because "it overfits" is a slogan, not an explanation.
A from-scratch ResNet-34 has on the order of tens of millions of parameters in its feature extractor. The labelled signal available to constrain those parameters is the set of interpreted sinusoids across fourteen vertical wells — a few thousand patches after augmentation, with the no-object class dominating most of them. The ratio of free parameters to independent supervisory signal is enormous. With that ratio, a deep network has more than enough capacity to find a low-training-loss configuration that keys on well-specific artefacts — a particular tool's response, a particular operator's interpretation habit, the noise floor of one logging run — rather than on the depth-dip-azimuth structure that generalises. The classification head, downstream of those features, then inherits a representation that separates the training wells beautifully and the validation patches not at all.
A ResNet-10 cannot do this. It does not have the capacity to memorise, so the lowest-loss configuration it can reach is, of necessity, one that exploits the genuinely repeatable structure — the sinusoid. The constraint is the feature. This is the small-data version of a very old lesson: when supervision is scarce, the inductive bias of a smaller hypothesis class is not a limitation, it is the regulariser. We saw the same pattern from the other direction in the well-count sweep — classification error fell from 93.1% at 3 wells to 1.06% at 9 wells to roughly 2.5% across the full set — which is simply the same capacity-versus-data axis viewed by adding data instead of subtracting parameters. Both knobs move you along the same curve; on a fixed fourteen-well budget, the only knob we controlled was capacity.
This is an MLOps decision, not a hyperparameter
It would be easy to file "use ResNet-10" as a tuning footnote. It is not. The backbone choice propagated through the entire production stack of the engagement, and treating it as a first-class, ablation-governed decision is what kept the model deployable.
A ResNet-10 trained from scratch is dramatically cheaper to train and to retrain — and retraining cadence matters when an operator drips new interpreted wells in over months. It fits comfortably on the modest on-prem GPU stack the engagement ran on, leaving headroom for the augmentation pipeline and the hyperparameter search (the same search that settled AdamW over Adam and SGD, a 0.0004 learning rate, and a 1024-dimensional feed-forward width). And a smaller backbone with a saner generalisation gap is a smaller drift surface in production: fewer parameters keyed on well-specific quirks means fewer ways for the model to silently degrade when the next field's imagery looks slightly different. The architecture decision, the training-engineering decision, and the MLOps decision are the same decision, and the ablation table is what makes it defensible to a reviewer, a partner, or an operator's own data-science team.
The broader point — the one we carry into every small-data subsurface engagement, across operators in the Middle East and the United States — is procedural. You do not reason your way to the right backbone from first principles and a parameter count. You sweep it, under controlled conditions, and you let the classification-error column tell you where the cliff is. The reflex to scale up is so strong that the only reliable defence against it is an ablation you actually ran.
Takeaways for the practitioner
If you are building a vision model on a dataset measured in tens of examples rather than tens of thousands, invert your priors about capacity. The question is not "how much backbone can I afford to compute" but "how little backbone can I get away with before features bottleneck," and the answer is almost always smaller than your instinct. Run the sweep from genuinely small upward — we started at ResNet-10 and never needed to go up — watch for the cliff where class error steps rather than drifts, and treat the lightest backbone that clears your metric as the production choice, not a compromise. On this engagement, that discipline was the difference between a 0.50 and a 26.76 classification error: the same data, the same loss, the same pipeline, and a fifty-fold swing decided entirely by how many layers we did not use.
Key takeaways
- On a 14-well geoscience dataset, a from-scratch ResNet-10 backbone posted a 0.499 classification error against 26.759 for ResNet-34 — a 50x+ gap that is structural, not noise. Backbone error rose monotonically with depth: 0.50 → 0.80 → 21.01 → 26.76 for ResNet-10/14/18/34.
- The degradation is a cliff, not a drift: class error stepped from 0.80 (ResNet-14) to 21.01 (ResNet-18), the signature of deeper networks crossing into a regime where they memorise the training wells before the matching loss can teach the geology.
- With no usable ImageNet transfer for greyscale borehole imagery, every backbone trained from scratch — so a ResNet-34's tens of millions of free parameters had only a few thousand patches to constrain them. A ResNet-10 lacks the capacity to memorise, which forces it onto the generalising sinusoid structure. Scarcity makes the smaller hypothesis class the regulariser.
- Capacity and data are the same axis: subtracting parameters (ResNet-34→10) and adding wells (3→14, 93.1%→~2.5% class error) move you along one curve. On a fixed well budget, capacity is the only knob you control.
- Backbone choice is an architecture, training-engineering, and MLOps decision at once — cheaper retrains, fits modest on-prem GPUs, smaller production drift surface. Don't reason your way to a backbone from a parameter count; sweep it under controlled conditions and let the class-error column locate the cliff.