Skip to main content

Case Study

From Field Engagement to Peer Review: Turning a Subsurface-AI Programme into Citable Science

A multi-year subsurface-AI engagement we ran with a mid-sized Middle East carbonate operator produced not just tooling but defensible science: two peer-reviewed journal manuscripts (the fracture paper in SPE Journal) and an EAGE Annual 2024 extended abstract, reviewer-validated on 14 wells and ~1,500 image-log patches.

Tannistha Maitiby Tannistha Maiti
Case study

Most applied-AI programmes end where the dashboard ships. The harder, rarer outcome is a result that an external reviewer who has never met the team — and has every incentive to be skeptical — is willing to put their name behind. Over a multi-year engagement we ran with a mid-sized Middle East carbonate operator, we set out to clear that second bar. The deliverable was working tooling for borehole-image interpretation; the capstone was that the same work survived peer review as two peer-reviewed journal manuscripts — the fracture paper in SPE Journal — and an EAGE Annual 2024 extended abstract, validated on 14 wells and roughly 1,500 image-log patches.

This case study is about that conversion — how a production deep-learning and computer-vision programme becomes citable science, what the review process demanded that an internal pilot never would, and why a national-oil-company-grade engagement should care about the difference.

Two artefacts, two different proofs

The work split into two papers because it solved two genuinely different geological problems, with two different families of model.

The first is a supervised detection problem: find fractures and bedding planes in unwrapped image logs from two different microresistivity imaging tools and recover their geometry. We built GeoBFDT, a Detection-Transformer-derived architecture: a ResNet backbone feeds a transformer encoder–decoder that, via bipartite Hungarian matching, predicts a set of sinusoids end-to-end — depth, dip and azimuth per feature, no masks, no anchor boxes, no non-maximum suppression. That is a model-architecture choice, not a geoscience one, and the reviewers tested it as such.

The second is an unsupervised quantification problem: measure vuggy porosity — the irregular dissolution cavities that dominate flow in a dolomitised carbonate — without any labelled training set. Here a Detection Transformer is the wrong tool; there is nothing to learn a class boundary against. So this workstream is a deterministic computer-vision pipeline: local-variation enhancement, adaptive thresholding, contour extraction, then a cascade of area, circularity, contrastive and vicinity filters that replace the slow path-morphology operator of the prior literature. It computes a vug-percentage curve every 10 cm and a full statistical spectrum — count, area, and a circularity distribution that ran 0.28–0.85 (peaking 0.45–0.70, i.e. semi-circular cavities dominate).

Why two models, not one

Fractures are sparse, labelled, and geometric — a supervised set-prediction transformer fits. Vugs are dense, unlabelled, and morphological — a classical CV pipeline with explicit, auditable filters fits. Forcing one architecture onto both problems is the most common way an applied-AI programme produces a result that does not survive review.

The throughput dividend that funded the science

Before the publication arc, the engagement had to earn its keep operationally — and it did, against the metric that matters to an asset team: interpreter time. Manual sinusoid picking and vug counting are the bottleneck of image-log interpretation; both scale linearly with logged metres and with the depth of the interpretation backlog. The automated pipeline turned interpretation that ran in field-scale weeks into a job measured in hours — a throughput dividend on the order of 10×, with the well-to-well workflow lifting interpreter productivity around 60% and interpretation consistency around 75% on the operator's own internal scoring.

The instinct on seeing a 10× dividend is to cut the team by 90%. That is the wrong read, and it is worth making the counter-argument concrete, because it is exactly the argument that justified pushing for publication rather than quietly banking a cost saving.

THE 10× DIVIDEND · 6–18 WEEKS → HOURS10× acreageinterpreted at 100% of the teamARTICLE’S THESISSpend the 10× dividend — on a smaller team, or on more acreage?Cutting to hold output flat throws the dividend away; holding the team converts every point of it into coverage.← spend on headcount cutsspend on more acreage →team → 10% · acreage → 1×team → 100% · acreage → 10×← drag to allocate · ←/→ keysTeam retainedAcreage covered100%10×of 10× ceilingsame headcount → ten times the interpreted acreage~10× throughput (6–18 weeks → hours) is the article's own · acreage = team × throughput.Headcount-retained axis (10–100%) is the arithmetic inverse of 10× — illustrative; no headcount number is sourced.
The naive read of a 10× productivity jump is 'cut 90% of the team.' The article argues the opposite: at the same headcount, the dividend buys ten times the interpreted acreage. Drag the allocator — spend the dividend on headcount cuts and acreage stays at 1×; hold the team and acreage climbs toward 10×. The single orange marker is the article's own position. The ~10× throughput multiple (6–18 weeks → hours) is sourced; the headcount-retained axis is the arithmetic inverse of 10× and is labelled illustrative, since the article states no specific headcount figure.

Hold the headcount and the same interpreters cover roughly ten times the acreage — which is what let scarce senior-geologist hours flow into the analysis, ablation studies and rebuttals that peer review demands. The science was paid for by the productivity, not traded against it.

What peer review demanded that a pilot never does

An internal pilot is graded by the people who commissioned it. A journal submission is graded by three reviewers and an associate editor who did not, and whose default is rejection. The gap between those two bars is where most applied-AI work quietly fails — and where this programme had to do its hardest engineering. Three demands stand out.

Honest data partitioning under genuine scarcity. With only 14 wells, holding out an entire well for test would have starved training; reviewers pressed hard on whether the split was leaking. The answer was a patch-level partition with documented overlap control — and, critically, the engineering discipline to report it as a limitation rather than hide it. We added a blind-zone validation over a 12 m interval of an unseen well and reported recall around 90% within a 10 cm depth offset, so the generalisation claim rested on data the model had never touched, not on a favourable random split.

Defensible model selection, not a single lucky run. A reviewer asking "why this architecture?" is asking for an ablation, and we had run them. Augmentation was not cosmetic: a geometry-preserving augmentation regime grew the corpus from 236 patches (just 19 sinusoid-bearing, 32 sinusoids total) to 4,212 patches / 2,046 sinusoid-patches / 3,565 sinusoids — a greater-than-tenfold expansion — and classification error fell from effectively 100% to low single digits in lockstep. The well-count ablation told the deeper story: from 3 to 14 wells the Hungarian matching loss fell from 0.801 to 0.015 and class error from 93.115 to a low single digit, evidence that geological diversity, not raw patch count, was the binding constraint. And a small ResNet backbone beat heavier ones on this corpus — more capacity overfit. Each of these is a model-development result that an internal demo would never have surfaced, because an internal demo only has to work once.

Reproducible metrics with stated tolerances. Reviewers do not accept "≈85% accurate." They accept an accuracy defined as a pick landing inside an explicit tolerance window, swept across thresholds. So the fracture model reports sensitivity around 85% at an 8 cm depth threshold tightening to roughly 65% at an unforgiving 3 cm; dip accuracy around 90% at 3°; azimuth around 92% at 15°; with combined-model geometry residuals of mean absolute error 1.72 cm on depth, 1.71° on dip and 9.34° on azimuth.

GEOBFDT · 14 WELLS · ONE FORWARD PASSDEPTHdetection axis · at 3 cm toleranceBELOW USEFULOne pass, three axes — depth is the binding constraint, not geometryPick an axis, step its tolerance: detection lags at a tight window; dip and azimuth are already strong.Depth (F1)Dip (acc.)Azimuth (acc.)Fracturesdetection65%Beddingsdetection63%useful regime≈70% · illustrativeTight window — marginal: below the useful bar.TOLERANCE3 cm3 cm4 cm5 cm← drag / arrow keys to step the depth windowSINGLEFORWARDPASSno masks,no sinusoidre-fittingF1 65/63 @3cm, 75/69 @5cm, 55 @4cm (frac, horizontal); dip 90 @3°; azimuth 92/84 @15° — the article's own · the ~70% useful-regime line is illustrative
GeoBFDT emits the whole (class, depth, dip, azimuth) tuple in one forward pass — but the three axes are not equally hard. Detection along the depth axis is the binding constraint: at a tight 3 cm window fracture F1 is only ~65% (beddings ~63%) and only clears the useful regime for structural work once tolerance loosens to 5 cm (~75% / ~69%); horizontal wells hold ~55% at 4 cm. The geometric axes the interpreter actually fits sinusoids for are already strong at tight tolerance — dip ~90% at 3°, azimuth ~92% (fractures) / ~84% (beddings) at 15°. Pick an axis and step its tolerance: depth is the lever you loosen, dip and azimuth are already past the line. All accuracies and tolerances are the article's own; the dashed ~70% 'useful regime' line is an illustrative reading aid (the article names no exact F1 cutoff).

The honesty cut both ways. When the model was extended to 5 horizontal wells — a genuinely different distribution, where average fracture sinusoid height collapses — performance dropped to roughly 55% at a 4 cm threshold, and we reported it. The vug pipeline was held to the same standard: a mean absolute error of 1.21 cm² against expert picks and roughly 85% alignment with expert interpretation, while running at about 15 seconds per metre against the ~5 minutes per metre of the path-morphology baseline it replaced — an order-of-magnitude speed-up with the accuracy stated, not implied.

The rebuttal as engineering work

Peer review is not a formality you pass; it is a second development cycle. The fracture manuscript drew three reviewers and an associate editor, including one who initially recommended it as unpublishable. The revision that answered them — first submitted in mid-2024, resubmitted that October — was substantial engineering, not prose polish.

Reviewers wanted a head-to-head against Mask R-CNN and YOLO. We declined that specific comparison, and the reason is itself a defensible engineering argument: those models solve a different objective — per-pixel masks or anchored boxes followed by instance grouping and a geometric fit — whereas GeoBFDT regresses dip and azimuth directly. A mask-IoU or box-mAP number is not commensurable with end-to-end depth/dip/azimuth accuracy; reporting it would have flattered a method by measuring it on the task it happens to be built for. Saying so, with the architecture's set-prediction rationale spelled out, is what a reviewer means by novelty defended rather than asserted.

From internal pilot to peer-reviewed result

Before

Pilot-grade evidence

Single favourable split, one model, accuracy quoted without tolerance, no out-of-distribution test

After

Reviewer-grade evidence

Patch-level partition + 12 m blind zone, well-count and backbone ablations, threshold-swept metrics, horizontal-well OOD result reported

2 peer-reviewed journal manuscripts (1 in SPE Journal) + EAGE Annual 2024 abstract, 3 reviewers + 1 associate editor cleared

The other demands were data-engineering and reproducibility work: defining every error metric with explicit equations, documenting the QC and exploratory-data-analysis steps that caught corrupted image scaling, adding blind-well and horizontal-well analyses, and reorganising the manuscript against an 11,000-word limit that pushed the full sensitivity tables into supplementary material. One reviewer comment we could not satisfy — the request to release code and data — was blocked by operator confidentiality, and we said so plainly rather than fudging it. That single unresolved item is, in its own way, the most honest line in the submission.

Why a citable result is a different kind of asset

For an operator, a peer-reviewed result is not vanity. It is a form of technical de-risking that an internal deck cannot provide. A reviewer-validated accuracy figure has been adversarially tested by domain experts with no stake in the outcome; a method that survived three reviewers transfers to an adjacent field with a documented, defensible baseline rather than a folkloric one. And because the metrics are defined with stated tolerances and out-of-distribution caveats, the next team to deploy the model knows exactly where its competence ends — the difference between a tool an asset team trusts and one it quietly stops using.

The breadth matters here too. The architectural arguments — set prediction for overlapping geometry, classical CV for unlabelled morphology — are not region-specific, and the engineering discipline behind them is the same we bring to operators across the Middle East and the United States. What was specific, and what made the science possible, was a partner willing to let the work be tested in the open to the limit that confidentiality allowed.

Turning an applied-AI engagement into peer-reviewed science

  1. The same multi-year engagement we ran with a mid-sized Middle East carbonate operator produced both production tooling and defensible science — two peer-reviewed journal manuscripts (the fracture paper in SPE Journal) and an EAGE Annual 2024 extended abstract, reviewer-validated on 14 wells and ~1,500 image-log patches.
  2. Two problems, two model families: a DETR-derived set-prediction transformer (GeoBFDT) for supervised fracture/bedding detection, and a deterministic computer-vision pipeline for unsupervised vug quantification — forcing one architecture onto both is how applied-AI work fails review.
  3. Peer review is a second development cycle: it forced honest patch-level partitioning plus a 12 m blind-zone test, well-count and backbone ablations, threshold-swept metrics with stated tolerances, and a reported horizontal-well out-of-distribution drop — engineering an internal pilot never produces.
  4. The ~10× throughput dividend (field-scale weeks to hours; ~60% productivity, ~75% consistency on the operator's scoring) funded the science by freeing senior-geologist hours for ablations and rebuttals, rather than being traded against headcount.

References

  1. GeoBFDT fracture/bedding detection performance, ablation, and blind-zone figures derived from internal validation on a 14-well Middle East carbonate dataset acquired with two different microresistivity imaging tools; data and code withheld under operator confidentiality.

  2. Automated vug-quantification computer-vision pipeline: accuracy (MAE 1.21 cm² vs expert picks, ~85% alignment) and throughput (~15 s/m vs ~5 min/m path-morphology baseline) measured on Middle East carbonate image logs.

  3. Carion et al. (2020). End-to-End Object Detection with Transformers (DETR). ECCV 2020. https://arxiv.org/abs/2005.12872

  4. Two peer-reviewed journal manuscripts (the fracture paper in SPE Journal, plus a companion vug manuscript) and an EAGE Annual 2024 (Oslo) extended abstract; the fracture paper was reviewer-validated through three reviewers and one associate editor under the journal's standard double-blind process.

Go to Top

© 2026 Copyright. Earthscan