Why ImageNet Weights Are Useless Underground: Training a Detector From Scratch on Borehole Textures

Open almost any applied computer-vision repo and the first line of the model does the same thing: it loads a backbone with pretrained=True. A ResNet or a ViT that has already seen a million natural photographs, its early layers tuned to edges and corners and its later layers to fur, wheels, faces. You fine-tune from there. On natural-image tasks this is close to free performance, and skipping it looks like negligence. So when we built a detector for planar features on borehole resistivity image logs, the instinct across the team was to start from those weights and adapt.

We did the opposite. Every parameter in the feature extractor was learned from scratch, from the client's own logs, with no pretrained weights anywhere in the network. This piece is about why that was the correct call for out-of-domain scientific imagery, and how the backbone ablation turned into the cleanest evidence we had that natural-image features simply do not describe our data.

The engagement was a roughly twenty-month programme with a mid-sized Middle East carbonate operator, building a Detection Transformer to pick fractures and beddings on borehole-image logs. The detector's design and the small-data ablation itself are covered in Why a Smaller Backbone Won; here the spine is narrower: the decision to carry no pretrained weights at all, and what it does to how you choose capacity.

What a pretrained backbone actually assumes

A pretrained backbone is not neutral machinery. It is a strong prior about what the world looks like. ImageNet-style pretraining bakes in an assumption that useful low-level structure is the statistics of natural photographs: three colour channels, illumination gradients, object-scale textures, the particular distribution of edges you get from lenses pointed at rooms and animals and streets. Fine-tuning works because it keeps most of that prior and nudges the top.

Now look at the input we actually feed the model. A borehole-image patch is a single-channel grayscale raster, 800 by 360 pixels, cut from a well wall that runs roughly 690,000 pixels tall. The only structure that matters in it is a sinusoid: a planar geological feature intersecting a cylinder and unrolling into one sine cycle across the image, its amplitude encoding dip and its phase encoding azimuth. No colour, no illumination model, no object at natural-image scale. The texture a fracture leaves on micro-resistivity imagery has essentially nothing in common with the texture a cat leaves on a photograph, and the prior a pretrained backbone carries is built entirely from the latter.

When we stated this to a reviewer during the paper's first revision, we put it plainly: open-source weights have learned to identify natural features that are not applicable to our dataset. That is not a hedge. It is the whole reason the usual escape hatch was closed to us. The prior does not just fail to help; on a dataset this small it actively costs you, because you spend capacity un-learning features you never wanted.

The feature map is the tell

There is a concrete way to see how far this domain sits from a natural image. In our configuration the backbone collapses that 800 by 360 patch into a feature map of shape [Batch, 256, 50, 23]: 256 channels over a 50-by-23 spatial grid. When a reviewer asked us to visualise the transformer's attention as a heatmap over the input log, we had to decline, and the reason is instructive. Self-attention operates on that feature grid, which does not map cleanly back onto the 800-by-360 pixels of the log. The representation the network learns is its own coordinate system, wrung out of resistivity texture, not a downsampled photograph you can overlay on the rock.

That mismatch is why a borrowed prior is worthless here. The features a pretrained ResNet would hand you are organised around a visual world the sinusoid does not live in. Better to let the 256 channels learn what a sine trace on an unrolled cylinder looks like directly, from the only fourteen wells we had, than to spend the first half of training arguing the network out of its ImageNet habits.

Then the deepest net should not be the worst, but it was

If natural-image pretraining genuinely does not transfer, a specific and testable consequence follows. Normally a heavier backbone earns its keep on small data precisely because pretraining regularises it: the big net starts from a good place and does not have to memorise. Take pretraining away and that safety net is gone. A deeper from-scratch extractor now has more free parameters to fit, the same fourteen wells to fit them on, and nothing holding it to sensible features. It should overfit harder as it gets deeper. The prediction is blunt: on this data, past some small size, deeper should mean worse, not better.

We swept four ResNet backbones under otherwise identical conditions, same matching loss, same AdamW optimiser, same augmentation, same data, and the prediction held.

The backbone ablation read as a transfer argument. Four residual feature extractors were swept under identical training - same matching loss, optimiser, augmentation and data - and ranked by depth (left to right) against classification error on a log axis, because the sourced errors span 0.499 to 26.759 and a linear axis would crush the three light nets into one line. Every weight was learned from scratch: there are no pretrained weights, because features learned on natural images do not transfer to micro-resistivity borehole textures, where the signal is a sinusoid unrolled from a roughly 690,000-pixel-tall grayscale cylinder that collapses to a [Batch, 256, 50, 23] feature map. With no pretrained backbone to regularise a big net, added depth only buys room to overfit the tiny out-of-domain set, so error climbs monotonically past ResNet-14 and ResNet-34 sits an order of magnitude above the ResNet-10 floor. The orange marker is the only element that argues: the from-scratch ResNet-10 rung, with the read-out stating how many times worse whichever heavier net you select actually is. ResNet-10 uses basic rather than bottleneck blocks. The class errors and the paired Hungarian and parameter losses are sourced from the engagement archive; nothing here is illustrative.

Read the class-error column. A from-scratch ResNet-10 posted 0.499. ResNet-14 drifted to 0.799. Then the floor gives way: ResNet-18 jumps to 21.013 and ResNet-34 to 26.759, better than an order of magnitude worse than the smallest net, on the same fourteen wells. The Hungarian and parameter losses move the same direction, with ResNet-34 worst on every column. This is not the ranking you get when pretraining does its usual job. It is the ranking you get when there is no prior at all and capacity is a liability. If ImageNet features transferred, the deep nets would be fine, and they are not remotely fine.

One detail sharpens it. The winning ResNet-10 uses basic residual blocks, not bottleneck blocks, the leaner of the two standard building blocks, chosen to keep the extractor as small as the problem allows. We did not stop there and try even simpler hand-rolled CNNs, on the argument that a basic-block ResNet-10 is already close to the minimal sensible convolutional extractor for this task. The sweep was never meant to crown the deepest net that still works. It was meant to find how little feature-extraction capacity the data could actually support before the model started memorising it.

What transfer learning means when there is nothing to transfer from

The phrase "transfer learning" quietly assumes a source domain close enough to your target that its features carry over. For most vision teams that assumption is invisible because it is almost always satisfied: the target is another set of natural images. Out-of-domain scientific imagery is where it breaks, and borehole logs break it about as cleanly as anything: grayscale, single-channel, texture defined by a physics of resistivity and geometry that no natural-image corpus contains.

When there is nothing to transfer from, the useful reframing is that your capacity budget is set by your labels, not by your compute. We had fourteen wells, each roughly a quarter of negotiation and acquisition to obtain, and no realistic path to more. That number, not the size of the GPU, decided how large a ResNet backbone we could train, and the honest ceiling was small. The lesson is not "small models are better" as a slogan. It is narrower: absent a transferable prior, every parameter has to be paid for out of your labelled data, and on a scarce out-of-domain dataset a light from-scratch extractor beats a heavy one because it has less room to overfit and no borrowed features fighting the ones you need.

Limitations

This is one dataset and one task: fourteen vertical wells of carbonate borehole imagery, a set-prediction detector regressing sinusoid parameters. The from-scratch verdict follows from those specifics, single-channel resistivity texture, extreme label scarcity, a target with no natural-image analogue, and is not a blanket case against pretraining. On larger out-of-domain corpora, or where self-supervised pretraining on in-domain data is feasible, a pretrained-then-fine-tuned backbone can and often does win; masked-reconstruction pretraining on the logs themselves was an avenue we noted rather than exhausted. The sweep covered four ResNet depths under a fixed training recipe; it does not rule out that a different heavy architecture, or heavier regularisation, could narrow the gap. The class errors and paired losses are the sourced ablation values from the engagement; the transfer-failure reading is our interpretation of why they fall the way they do, not an independently measured quantity.

References

[1] K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. CVPR 2016. Introduces ResNet and the basic-versus-bottleneck residual block distinction referenced above. https://arxiv.org/abs/1512.03385

[2] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko. End-to-End Object Detection with Transformers. ECCV 2020. The Detection Transformer whose set-prediction design the detector adapts. https://arxiv.org/abs/2005.12872

Why ImageNet Weights Are Useless Underground: Training a Detector From Scratch on Borehole Textures

What a pretrained backbone actually assumes

The feature map is the tell

Then the deepest net should not be the worst, but it was

What transfer learning means when there is nothing to transfer from

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on