How Many Wells Is Enough? A Well-Count Ablation for Fracture Detection

The first question any operator asks before funding a borehole-AI programme is the one the literature almost never answers: how much labelled data do we actually need? Buy too little and the model never leaves the lab; demand too much and the labelling bill kills the business case before a single fracture is detected. For a mid-sized Middle East carbonate operator we partnered with, we answered it empirically — by training the same Detection Transformer on a deliberately growing pool of wells and watching exactly where its error stopped falling. The answer was sharp enough to plan a data-acquisition budget around: classification error collapses from 93 to roughly 1 between 3 and 9 wells, bottoms out at 11, and then — counter-intuitively — climbs again at 14. The lever is not raw image count. It is the geological diversity of consistently-picked wells.

Why a well-count ablation, not a learning curve

The model under test is GeoBFDT, a DETR-derived set-prediction network: a ResNet-10 backbone, a 4-layer transformer encoder and 4-layer decoder, and a fixed bank of learned queries that each emit a class (fracture, bedding, or no-object) plus a depth/dip/azimuth triplet. Training matches predictions to ground-truth sinusoids one-to-one with the Hungarian algorithm, then minimises a combined Focal classification loss (weight 5) and L1 regression loss (weight 1) against that pairing.

The naive way to study data sufficiency is a learning curve over number of image patches. That is the wrong axis for borehole geology, and it is worth being precise about why. Patches from one well are not independent samples — they share the same tool string, the same mud system, the same diagenetic history, the same interpreter's picking conventions. Adding ten thousand patches from a single well teaches the model that well's idiosyncrasies, not the population of fractures it will face in production. The unit of generalisation is the well, not the patch. So we ablated on well count: train on 3, then 6, 9, 11, and finally all 14 vertical wells logged with two different microresistivity imaging tools, holding architecture, augmentation, and optimiser fixed, and read three quantities off each run — the Hungarian matching loss, the L1 parameter loss, and the percentage classification error.

What each metric is telling you

The Hungarian loss measures how cleanly the predicted set lines up with the true set of sinusoids. The L1 parameter loss measures geometry — how far off the depth, dip, and azimuth regressions are. Classification error is the blunt instrument: the share of sinusoids assigned the wrong class. Read together, they separate a model that cannot find sinusoids from one that finds them but mis-measures them.

The data wall, measured

Here is the full ablation. Every figure is from internal validation on the operator's data; none is invented.

Wells	Hungarian loss	L1 param loss	Class error
3	0.801	0.291	93.115%
6	0.054	0.210	18.370%
9	0.032	0.110	1.055%
11	0.025	0.083	0.817%
14	0.015	0.059	2.536%

Three wells is not a small model — it is a broken one. At 93.115% class error the network is effectively guessing; the Hungarian loss of 0.801 says it cannot even form a coherent set of detections to match against. The jump to 6 wells is the single largest gain in the whole study: class error falls more than fivefold to 18.370%. By 9 wells the model crosses into usable territory — class error 1.055% — and at 11 wells it reaches its floor of 0.817%. The geometry losses fall monotonically and gracefully throughout: Hungarian loss tracks 0.801 → 0.054 → 0.032 → 0.025 → 0.015, and L1 from 0.291 to 0.059. In other words, finding and measuring sinusoids keeps improving with every well. Classification is where the data wall lives.

GeoBFDT emits the whole (class, depth, dip, azimuth) tuple in one forward pass — but the three axes are not equally hard. Detection along the depth axis is the binding constraint: at a tight 3 cm window fracture F1 is only ~65% (beddings ~63%) and only clears the useful regime for structural work once tolerance loosens to 5 cm (~75% / ~69%); horizontal wells hold ~55% at 4 cm. The geometric axes the interpreter actually fits sinusoids for are already strong at tight tolerance — dip ~90% at 3°, azimuth ~92% (fractures) / ~84% (beddings) at 15°. Pick an axis and step its tolerance: depth is the lever you loosen, dip and azimuth are already past the line. All accuracies and tolerances are the article's own; the dashed ~70% 'useful regime' line is an illustrative reading aid (the article names no exact F1 cutoff).

Once the model has enough wells to clear that wall, the per-axis picture above is what an interpreter inherits: depth detection is the binding constraint at a tight tolerance, while the dip and azimuth geometry the model regresses in the same forward pass is already strong. The ablation explains how many wells it takes to land there; the ladder shows what you get once you do. The two are the same engagement read from opposite ends.

The number that surprises everyone: 14 is worse than 11

The counter-intuitive result is the rise from 0.817% at 11 wells to 2.536% at 14. More data made classification worse. This is not noise, and it is not overfitting in the usual capacity sense — the backbone and regulariser are unchanged. It is a data-quality effect, and it is the most operationally important finding in the study.

The three wells added between the 11- and 14-well runs were brought in to extend the fractures-only dataset, and they carried less consistent sinusoid picks than the core set. In a set-prediction model trained with Hungarian matching, label inconsistency is uniquely corrosive: the optimal assignment is computed against the labels, so a sinusoid that one interpreter picked and another did not, or picked with a different class convention, doesn't just add a hard example — it actively teaches the matcher a contradictory target. The Focal classification head, weighted five-to-one over the geometry head, absorbs that contradiction as elevated class error while the L1 geometry losses, which depend less on the binary find/no-find decision, keep improving (0.083 → 0.059). That divergence — geometry better, classification worse — is the signature of a labelling-consistency problem, not a capacity one.

Loss-function choice decides whether the network learns curve continuity. VeerNet tested five losses under identical conditions; only the one whose gradient aligns with the IoU/F1 metric (Lovász-Softmax) wins, and the shipped answer is a two-loss SCE-warmup → Lovász-finetune schedule. Pick a loss to see its ablation verdict, the reason, and a schematic ground-truth-vs-prediction trace; toggle the two-loss schedule (same accuracy, half the wall-clock). The five candidates, verdicts, F1 35%/IoU 30% and the two-loss schedule are the whitepaper's own; the podium bar heights are ordinal (rank sourced) and the prediction thumbnails are schematic.

The practical reading is blunt: nine to eleven consistently-picked wells generalise better than fourteen unevenly-picked ones. A model is only ever as good as the agreement between its annotators, and a well-count ablation is one of the few instruments that will surface annotation drift quantitatively rather than as a vague suspicion. In our work with this Middle East operator the same pattern held across every retraining we ran — and it is the kind of result that generalises to any image-log programme, in the Gulf or further afield: past a threshold, marginal data helps only if it meets the labelling bar set by the early data.

What the ablation forced us to build

A finding like this is only useful if the pipeline can act on it, which is where the engineering — not the geoscience — earns its keep. The ablation drove four concrete decisions in the production stack.

A small backbone, deliberately. The same ablation discipline that produced the well-count table told us to train a ResNet-10 from scratch rather than fine-tune a deeper pretrained network. On this data regime, heavier backbones overfit hard — a parallel ablation put ResNet-34 at 26.759% class error against ResNet-10's 0.499%. Capacity is not the constraint when wells are; diversity is.

Augmentation as a multiplier, not a substitute. The reservoir interval we started from was savagely sparse — 32 sinusoids across 236 patches, only 19 of which held a sinusoid at all. Geometry-preserving augmentation (ColorJitter, GaussianBlur, Sharpen, Gaussian noise, Emboss, MedianBlur, at roughly ten augmentations per sinusoid-bearing patch) grew the corpus more than tenfold, to 4,212 patches / 2,046 sinusoid-patches / 3,565 sinusoids. Crucially, the well-count ablation proves augmentation cannot stand in for diversity: every run above used the same augmentation, and the curve still demanded real wells. Augmentation multiplies within a well's distribution; it cannot manufacture a new well's geology.

A model-selection split that respects the wall. Because the combined fractures-plus-beddings dataset is consistently picked across only 11 wells (1,492 non-overlapping images: 1,322 with beddings, 754 with fractures) while the fractures-only set spans all 14 (2,291 images, 1,217 with fractures), we ship separate models rather than one. The combined model is trained where the labels agree; the fractures-only model accepts the noisier extra wells only for the task that tolerates them. The ablation is what justifies that split instead of a single monolith.

A label-consistency gate in the MLOps loop. The 11-to-14 reversal turned into a standing CI check: before any new well enters a training pool, its picks are scored for agreement against the existing corpus, and a well that would raise validation class error is quarantined for re-picking rather than merged. In our work with this operator, that gate is the difference between data acquisition that compounds and data acquisition that quietly degrades — exactly the failure the 2.536% number warns about. The incremental evidence backs the floor's flatness, too: moving from 8 to 11 wells improved depth/dip/azimuth by only ~0.007 MAE, confirming the geometry has plateaued and that further gains must come from label quality, not volume.

What this means for an acquisition budget

For a programme owner, the ablation converts an open-ended data ask into a bounded one. You do not need eighty wells to start; you need roughly nine to eleven wells whose sinusoids are picked to a single, enforced convention. Past that, the marginal well only pays off if it clears the same labelling bar — and the cheapest way to keep it there is a consistency gate that refuses the well otherwise. That reframing is what let the operator move from an R&D model to one a validator would accept, and it is the reframing that travels to any field where overlapping geological features must be detected from image logs.

The honest caveats stand. The corpus is 14 vertical wells of a single confidential carbonate play; horizontal wells, where sinusoid height drops sharply, are a different distribution and would need their own ablation. The 14-well reversal is specific to the picks we received — re-pick those three wells to convention and the floor would almost certainly extend rather than reverse. And a well-count ablation tells you when more data stops helping; it cannot tell you which well to acquire next. For that, the geology — and the interpreter — still lead.

What a well-count ablation tells you that a learning curve cannot

The unit of generalisation in borehole AI is the well, not the patch — GeoBFDT's class error falls from 93.115% (3 wells) to 1.055% (9) to a floor of 0.817% (11), so nine to eleven consistently-picked wells, not raw image volume, is the data target an operator should budget for.
More data can make a set-prediction model worse: class error rose to 2.536% at 14 wells because the three added wells carried inconsistent picks, and Hungarian matching is corrosively sensitive to label disagreement — geometry losses kept falling (L1 0.083 → 0.059) while classification regressed, the signature of an annotation problem, not a capacity one.
The ablation drove the production stack: a deliberately small ResNet-10 over heavier backbones, augmentation as a within-well multiplier rather than a substitute for diversity, separate combined-vs-fractures-only models split along where labels agree, and a label-consistency gate in CI that quarantines any new well that would raise validation class error.

References

Carion et al. (2020). End-to-End Object Detection with Transformers (DETR). ECCV 2020. https://arxiv.org/abs/2005.12872
GeoBFDT well-count ablation (Hungarian loss / L1 parameter loss / classification error across 3, 6, 9, 11, 14 wells), backbone and augmentation ablations, and dataset counts derived from internal validation on a 14-well Middle East carbonate dataset logged with two different microresistivity imaging tools; data and code withheld under operator confidentiality.

How Many Wells Is Enough? A Well-Count Ablation for Fracture Detection

Why a well-count ablation, not a learning curve

The data wall, measured

The number that surprises everyone: 14 is worse than 11

What the ablation forced us to build

What this means for an acquisition budget

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on