Crossing the Domain Gap: Well-to-Well Adaptation That Lifted Interpretation Productivity 60% and Accuracy 75%

A fracture model that works beautifully on the well it was trained on, and falls apart on the next one, is not a product — it is a demo. The hard part of operationalising borehole-image interpretation across a real asset is not training the first model; it is the domain gap between wells. Tool string, mud system, borehole rugosity, dynamic-normalisation window, even the vintage of the imaging run all shift the pixel statistics a Detection Transformer has learned to read. Cross that gap and you have an interpretation engine for the field. Fail to, and you are back to bespoke labelling and a fresh training run at every wellbore.

For a mid-sized Middle East carbonate operator we partnered with, crossing that gap was the Phase-3 milestone of the engagement. Well-to-well (W2W) adaptation generalised the fracture-and-bedding model across the operator's wells, lifting interpretation productivity by 60% and interpretation accuracy by 75%, at a 95% target precision and 90% stratigraphic success rate. This case study is about the engineering that earned those numbers — the feature-alignment problem underneath them, and why it is a transfer-learning problem before it is a geoscience one.

The domain gap is a feature-distribution gap

The model at the centre of this work is GeoBFDT, a DETR-derived set-prediction network: a deliberately small ResNet-10 backbone, a transformer encoder–decoder, and a bank of learned queries that each emit a class (fracture, bedding, or no-object) and a regressed depth/dip/azimuth triplet for one sinusoid. We covered its architecture in our work on detection transformers for borehole geology. What matters here is what happens to that network when you point it at a well it has never seen.

In principle, the geology rhymes from well to well across a contiguous carbonate. In practice, the feature maps do not. A model trained on dynamically normalised high-resolution borehole image logs from one set of wells encodes a particular distribution of contrast, texture, and sinusoid amplitude in the activations the encoder builds. Move to a well logged with a different micro-imager — say a compact microresistivity tool whose static range is compressed relative to a standard high-resolution borehole image log — and the same geological sinusoid lands in a different region of feature space. The decoder queries, tuned to attend to the training distribution, mis-read it. This is the textbook signature of covariate shift, and on borehole images it is severe enough that a naive port degrades hardest exactly where interpreters need help: the densely fractured intervals where sinusoids overlap.

The diagnostic question, then, is geometric: how far apart do the two wells sit in the model's feature space, and where is the decision boundary that separates a confident, correct pick from a mis-read? The interactive below makes that tangible — drag the alignment boundary and watch how many target-well sinusoids fall on the wrong side before adaptation pulls the two distributions together.

The mechanism behind EAN-DDA, in one move. At the encoder bottleneck, labelled F3 (source) and unlabelled Penobscot (target) patches map into a feature space. Train on F3 alone and the two clouds sit apart — the domain gap — so the F3 decision boundary cuts through the target cloud and facies get mis-read. The deep-domain-adaptation alignment loss pulls the target cloud onto the source manifold: drag the alignment strength λ from 0 to 1 and the orange gap collapses while grey target rings turn teal as they cross to the correct side of the boundary. Only '6 facies classes labelled on F3' is the article's own number; the point clouds, decision boundary, gap, and the live mis-placed-point count are a schematic of feature-distribution alignment — no benchmark metrics are shown.

The lesson the manifold encodes is the one that governed the whole phase: W2W is feature alignment, not retraining. The objective is to shift the target well's feature distribution onto the source manifold the decoder already understands — so that a model trained on the wells we have labelled reads a new well without a from-scratch training run on fresh annotations the operator does not yet have.

Why more wells, not more data, was the lever

The first thing the ablations told us is that this model is hungry for geological diversity, not raw patch count. Augmentation had already inflated a single sparse interval more than tenfold; that bought convergence, not generalisation. Generalisation came from wells. Across a well-count sweep of 3 → 6 → 9 → 11 → 14 wells, the Hungarian matching loss fell monotonically from 0.801 to 0.015 — finding and measuring sinusoids kept improving with every well added. Classification error told a sharper story: it collapsed from 93.115% at 3 wells to its floor of 0.817% at 11 wells, then — counter-intuitively — rose to 2.536% at 14 wells. That reversal was not noise. The three wells added beyond 11 carried less consistent sinusoid picks, and in a set-prediction model trained with Hungarian matching, label inconsistency is uniquely corrosive: it teaches the matcher contradictory targets. The lesson stuck: the floor is set by labelling consistency, not well count. The marginal step from 8 to 11 wells improved depth, dip, and azimuth mean-absolute error by only roughly 0.007 MAE — confirming the geometry had already plateaued and that further gains had to come from label quality, not volume.

That finding reframed the data strategy. Each consistently-picked well is not "more of the same"; it is a new sample of the field's feature distribution, and every clean sample widens the manifold the decoder can align to — while an inconsistently-picked one can narrow it. The phase ladder shows the trajectory directly: Phase 1 trained on 3 wells, Phase 2 on 8, and Phase 3 on 16 — the count of well-diverse, consistently-labelled wells, not the epoch count, was the axis we pushed.

Interpretation workflow: per-well retraining vs well-to-well adaptation

Before

Per-well bespoke model

Train from scratch on each well's hand-labelled sinusoids; no transfer; labelling cost recurs at every wellbore

After

W2W feature alignment

Align a new well's feature distribution onto the source manifold; the existing model reads it without a fresh training run

+60% interpretation productivity, +75% accuracy at 95% target precision, 90% stratigraphic success

From a single well's picks to a field-scale stratigraphic surface

A ported model that picks sinusoids is necessary but not sufficient. The product the operator's subsurface team actually consumes is a cross-well picture: fracture and bedding density correlated along the stratigraphy, so a geologist can see a fractured corridor track from one well to the next. That correlation layer is where W2W becomes a geomatics-and-software problem, not just a model.

The pipeline aggregates the model's per-sinusoid picks into fracture and bedding density counts at 0.1 m (10 cm) intervals down each well. We swept the density grid at 5, 10, and 50 cm and settled on a 10 cm grid with a rolling-average kernel of 5 — fine enough to resolve a discrete fractured zone, smoothed enough to suppress single-pick noise. Those per-well density curves were then correlated across the field and interpolated between wellbores with 2D kriging, computed across 13 vertical wells in Phase 3, to project bedding and fracture intensity into the inter-well volume the image logs never see directly. The output is the artefact a subsurface manager can act on: a stratigraphically aligned density model where a high-fracture interval in one well is correlated, not coincidentally adjacent, to the same interval in its neighbour.

This is the part of the build that turns a per-well computer-vision model into an asset-scale interpretation system — and it is why the headline metric is interpretation productivity rather than raw model throughput. The dividend is not only that AutoFrac and AutoVug pick roughly 5x faster than a manual pass; it is that the picks land in a correlated, kriged framework an interpreter would otherwise assemble by hand.

Proving it on wells the model had not seen

Productivity gains are easy to claim and hard to trust, so the acceptance gate was a blind test, not a validation curve. Adapted models were scored on a continuous 12 m blind zone held entirely out of training — no overlapping patches leaking across the train/test boundary — and stress-tested at a tighter 4 m blind length to confirm the result was not an artefact of a generous window. On those held-out intervals the adapted model held its accuracy envelope: fracture sensitivity around 85% at an 8 cm depth threshold, with the combined fracture model's geometry errors tight where they count — mean absolute error of 1.72 cm on depth, 1.71° on dip, and 9.34° on azimuth. Those residuals sit inside the uncertainty already baked into a carbonate reservoir model, which is the bar that separates "validator-ready" from "research-stage."

What 95% / 90% actually gate

The 95% target-precision figure is a quality floor on the picks the system surfaces — the operator's interpreters were not asked to accept a flood of low-confidence detections. The 90% stratigraphic-success figure gates the harder claim: that an adapted model's picks, once correlated and kriged across wells, reproduce the field's stratigraphy rather than each well's idiosyncrasies. A model can pick well and still tell an incoherent cross-well story; this metric is the one that says it does not.

The MLOps layer that made adaptation routine

None of this is a research result if every new well demands a hand-run experiment. The W2W milestone was as much an MLOps milestone as a modelling one. Adaptation runs were tracked end-to-end — every well, checkpoint, density-grid setting, and blind-zone score versioned — so that porting the model to well N+1 was a parameterised, reproducible job rather than a one-off. The 11-to-14-well reversal in particular hardened into a standing gate: before any new well enters a training pool, its picks are scored for agreement against the existing corpus, and a well that would raise validation class error is quarantined for re-picking rather than merged. The whole stack stayed deliberately lean: a small ResNet-10 backbone and a serving footprint of roughly 8 GB of GPU memory per machine on commodity hardware, which matters enormously to a national operator that wants the engine to run inside its own environment rather than depend on a hyperscale training cluster for every new wellbore.

That production discipline is what makes the headline figures durable rather than anecdotal. A 60% productivity lift that holds only on the engineer's laptop is a slide; the same lift behind a versioned, blind-tested, reproducible adaptation pipeline is a capability the operator's interpretation team can keep using after the engagement ends.

Why the result generalises beyond one field

The argument under W2W is not Middle East-specific, and the small-data failure modes we engineer against recur across the subsurface engagements we run for national and independent operators in the Middle East and the United States. Any asset with multiple imaged wells and a shifting acquisition footprint faces the same feature-distribution gap; the public FORCE 2020 benchmark — 118 wells across the Norwegian Continental Shelf — exists precisely because well-to-well generalisation is the recurring failure mode of subsurface ML. What transfers is the recipe: treat the gap as feature alignment rather than retraining, push geological diversity of consistently-picked wells over raw augmentation, correlate per-well picks into a kriged cross-well framework, and gate everything on blind zones the model never saw. That recipe is what turns a single-well demo into a field-scale interpretation engine.

What well-to-well adaptation actually delivered

W2W adaptation lifted interpretation productivity 60% and accuracy 75% at 95% target precision and 90% stratigraphic success — by aligning a new well's feature distribution onto the source manifold the model already understood, instead of retraining from scratch at every wellbore.
Geological diversity of consistently-picked wells, not data volume, was the lever: across 3 to 14 wells the Hungarian loss fell monotonically from 0.801 to 0.015, but class error bottomed at 0.817% at 11 wells and rose to 2.536% at 14 as less-consistent picks entered — and the 8-to-11-well step moved depth/dip/azimuth by only ~0.007 MAE.
Per-sinusoid picks were aggregated into 10 cm density grids (kernel 5) and correlated across 13 wells with 2D kriging, turning a per-well CV model into an asset-scale, stratigraphically aligned interpretation system.
The gains were earned on blind zones (12 m and 4 m) the model never saw, served at ~8 GB GPU per machine inside the operator's own environment — an MLOps milestone (versioned runs plus a label-consistency gate) as much as a modelling one.

Crossing the Domain Gap: Well-to-Well Adaptation That Lifted Interpretation Productivity 60% and Accuracy 75%

The domain gap is a feature-distribution gap

Why more wells, not more data, was the lever

From a single well's picks to a field-scale stratigraphic surface

Proving it on wells the model had not seen

The MLOps layer that made adaptation routine

Why the result generalises beyond one field

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on