Why Detection Transformers, Not Mask R-CNN, for Borehole Geology

When two fractures and a bedding plane cross the same two metres of an image log, the hard problem is not finding them — it is deciding which detection belongs to which sinusoid. That assignment problem is where per-pixel segmentation and anchor-box detectors quietly fail, and it is exactly the problem a Detection Transformer is built to solve. For a mid-sized Middle East carbonate operator, we built GeoBFDT, trained on high-resolution borehole image logs from two different microresistivity imaging tools across 14 vertical wells. It treats fractures and bedding planes collectively as sinusoids, and emits depth, dip, and azimuth for each in a single forward pass — no masks, no anchor boxes, no non-maximum suppression.

The problem with masks and anchors on a borehole image

Every planar feature that intersects a wellbore — a fracture, a bedding surface — projects onto the unwrapped image log as a sinusoid. The interpreter's job is to pick each trace, fit a curve, and recover three numbers: depth, dip, and azimuth. In a fractured carbonate, those sinusoids overlap. A 2.2-metre patch can hold dozens of them, crossing and occluding one another, conductive traces (dark) tangled with resistive ones (bright).

This is precisely the regime where the obvious computer-vision tools break down. The borehole-imaging literature has leaned on Mask R-CNN: Liu (2022) reports mIoU around 81.2% and Du (2023) reports precision 96% / recall 92%, both on segmentation-style fracture tasks (Liu, 2022) (Du, 2023). Those are respectable numbers on isolated features, but they inherit two structural liabilities when sinusoids overlap.

The first is the mask itself. Segmentation labels every pixel, then a downstream step must group pixels into instances and fit a curve to each. When two sinusoids cross, the pixels at the intersection belong to both — the mask cannot represent that, and the instance-grouping heuristic has to guess. The second is the anchor-and-NMS machinery that detectors like YOLO and Faster/Mask R-CNN use to propose and de-duplicate boxes. Non-maximum suppression assumes that two highly overlapping detections are duplicates of one object and deletes one of them. For overlapping sinusoids that is the wrong prior by construction: the overlap is the signal, not a duplicate. You spend your hyperparameter budget tuning NMS thresholds to not delete real fractures.

The assignment problem, stated plainly

A mask labels pixels; an anchor detector proposes boxes and then suppresses overlaps. Neither has a native, differentiable way to say “these N predictions correspond one-to-one to those M ground-truth sinusoids.” That correspondence is the whole task.

Set prediction, and why bipartite matching is the right tool

A Detection Transformer reframes detection as set prediction. GeoBFDT keeps that spine: a ResNet-10 backbone extracts features, a 4-layer transformer encoder builds global context across the patch, and a 4-layer decoder turns a fixed bank of learned query embeddings into a set of predictions. Each query emits a class (fracture, bedding, or no-object) plus the regression triplet — depth, dip, azimuth — normalised into a 0–1 range.

The mechanism that makes this work end-to-end is Hungarian matching. During training, every prediction is matched to at most one ground-truth sinusoid (and the rest to “no-object”) by solving a bipartite assignment that minimises a combined classification-plus-geometry cost. The loss is then computed only against that optimal pairing.

Why set prediction survives overlap where suppression fails. Left: a fixed bank of decoder queries. Right: a schematic unwrapped borehole patch carrying overlapping ground-truth sinusoids (fractures and beddings). Drag the slider to raise overlap density and toggle the engine: Hungarian matching re-solves a one-to-one assignment so every real sinusoid keeps its own query (surplus queries route to 'no-object', shown dimmed) — overlap is the signal, nothing is deleted; flip to NMS and rising overlap makes non-maximum suppression read the tightest-overlapping pair as a duplicate and delete a real fracture (the orange trace). This is a structural mechanism illustration — it sources no benchmark numbers; query count, sinusoid count, and overlap geometry are schematic.

Three consequences follow directly, and each removes a failure mode of the segmentation/anchor stack:

No NMS. Because matching is one-to-one by construction, two overlapping fractures are assigned to two different queries. There is nothing to suppress, no overlap threshold to tune, and no mechanism that can delete a real fracture for being too close to another.
No anchors. Queries are learned, not tiled priors over scale and aspect ratio. The model is not biased toward a box geometry that does not describe a sinusoid in the first place.
No masks, no curve fitting. The decoder regresses dip and azimuth directly. The mask-creation pre-step and the post-hoc sinusoid fit — each a separate source of error and tuning debt — simply do not exist in the pipeline.

This is also why we declined a like-for-like benchmark against Mask R-CNN or YOLO. Those models solve a different objective: per-pixel masks or anchored boxes, followed by instance grouping and a geometric fit. GeoBFDT regresses the geometry itself. A mIoU or a box-mAP number is not commensurable with end-to-end depth/dip/azimuth accuracy; comparing them would flatter one method by measuring it on the task it happens to be built for.

What we actually trained

The reservoir interval we started from was brutally sparse: it contributed 32 sinusoids across 236 image patches, only 19 of which contained a sinusoid at all. No transformer trains on that. We attacked the scarcity two ways. First, geometry: the model ingests 800-pixel patches — 2.2 m of borehole, the height inside which more than 95% of fracture and bedding sinusoids fit — extracted with overlap so no trace is clipped at a patch boundary. Second, geometry-preserving augmentation (ColorJitter, GaussianBlur, Sharpen, Gaussian noise, Emboss, MedianBlur) at roughly 10 augmentations per sinusoid-bearing patch grew the corpus from 236 patches / 19 sinusoid-patches / 32 sinusoids to 4,212 / 2,046 / 3,565 — a greater-than-tenfold increase that preserves sinusoid shape while varying contrast and texture.

The model is deliberately small for the data regime: ResNet-10 trained from scratch (no pretrained weights), 4 encoder and 4 decoder layers, feedforward dimension 1,024, dropout 0.2, AdamW at learning rate 0.0004, batch size 128, early stopping after 40 epochs without improvement. The loss combines Focal loss for classification (weight 5) with L1 loss for the depth/dip/azimuth regression (weight 1); inference keeps queries above a 0.5 probability threshold. The choice of a small backbone was not an oversight — it was the result that the ablations forced.

Loss-function choice decides whether the network learns curve continuity. VeerNet tested five losses under identical conditions; only the one whose gradient aligns with the IoU/F1 metric (Lovász-Softmax) wins, and the shipped answer is a two-loss SCE-warmup → Lovász-finetune schedule. Pick a loss to see its ablation verdict, the reason, and a schematic ground-truth-vs-prediction trace; toggle the two-loss schedule (same accuracy, half the wall-clock). The five candidates, verdicts, F1 35%/IoU 30% and the two-loss schedule are the whitepaper's own; the podium bar heights are ordinal (rank sourced) and the prediction thumbnails are schematic.

The ablation evidence is unambiguous, and it is what separates a working model from a plausible-looking one:

Backbone. ResNet-10 beat heavier backbones on this small corpus — ResNet-34 collapsed to a class error of 26.759 against ResNet-10's 0.499. More capacity overfit; less generalised.
Augmentation is not optional. Without it, classification error sat at 100% (the model learned nothing usable); with it, 2.618%. The combined Hungarian loss fell from 0.174 to 0.0135 in lockstep.
Dynamic beats static logs. Training on dynamically normalised image logs over static logs cut class error from 63.45 to 2.536 — the dynamic image carries the contrast the attention layers need.
Wells compound. Across 3 → 6 → 9 → 11 → 14 wells, the Hungarian matching loss fell from 0.801 to 0.015 and class error from 93.115 to a low single digit. Geological diversity, not raw patch count, is what the model is hungry for.

The numbers that matter to an interpreter

Accuracy here is measured the way an interpreter would judge a pick: a prediction counts as correct only if it lands within a tolerance window on depth, dip, and azimuth. At an 8 cm depth threshold — the window inside which interpreters treat a pick as right — the fracture model reached approximately 85% sensitivity, tightening to roughly 65% at an unforgiving 3 cm. Fracture dip accuracy stabilised above 3° at 91% (peaking near 98%), and azimuth above a 15° threshold at 92% (peaking near 96%). Depth precision held at 85%.

An unwrapped borehole image log: azimuth runs 0–360° across, depth runs down, and every fracture plane intersecting the wellbore traces a sinusoid (amplitude tracks dip, phase tracks azimuth). The DETR model predicts each sinusoid's depth; a pick is a hit only if it falls inside the interpreter's depth-match window. Drag the window — fractures flip recalled (teal) / missed (orange), false flags creep in as it loosens, and sensitivity trades against precision. At the 8 cm interpreter window the model recalls ~85%: validator-ready. Sensitivity, corpus and geometry per the case study; the per-pick errors and live precision are an illustrative detection model around that point.

The model-selection story is itself a finding. A fractures-only model reached 80.84% overall dip-intensity accuracy, ahead of the 77.10% the combined fracture-plus-bedding model managed on fractures — because bedding planes dominate the labels and the shared model spends capacity on the majority class. Beddings landed at 73.63% combined and 72.52% standalone. The combined fracture model's geometry errors are tight where it counts: mean absolute error of 1.72 cm on depth, 1.71° on dip, and 9.34° on azimuth. Those residuals sit inside the uncertainty already baked into a carbonate play model, which is the bar for “validator-ready” rather than “research-stage.”

Detection stack: segmentation/anchors vs set prediction

Before

Multi-stage, overlap-fragile

Mask/anchor pipeline: per-pixel masks, instance grouping, NMS threshold tuning, post-hoc curve fit

After

End-to-end, overlap-native

GeoBFDT: one forward pass, Hungarian-matched set of sinusoids with depth/dip/azimuth

~85% fracture sensitivity @ 8 cm; dip 91% @ 3°, azimuth 92% @ 15°

Why this generalises beyond one field

The architectural argument is not field-specific. Anywhere geological features overlap in a 2D image — densely fractured carbonates, faulted intervals, cross-cutting bedding — the assignment problem dominates, and a model that solves it natively with bipartite matching will out-behave one that tunes overlap thresholds after the fact. The end-to-end regression of dip and azimuth removes two error-propagating stages (mask creation and curve fitting) that every segmentation pipeline carries, which is also why transfer to an adjacent field costs far less labelling than a from-scratch segmentation effort.

The honest caveats travel too. The corpus is 14 vertical wells of a single confidential reservoir; horizontal wells, where average fracture sinusoid height drops sharply, are a separate distribution. Attention-map visualisation is not directly interpretable on this architecture, because self-attention runs over compact ResNet feature maps rather than the raw image log — a real limitation for explainability work. And the well-count ablation says plainly that more geological diversity, not more augmentation, is the next lever.

Why set prediction beats segmentation for overlapping sinusoids

Overlapping fractures are an assignment problem; Hungarian bipartite matching solves it one-to-one, so there is no NMS threshold that can delete a real overlapping fracture.
Removing masks and anchor boxes removes two error-propagating stages — instance grouping and post-hoc curve fitting — and lets the decoder regress depth, dip, and azimuth directly in one pass.
Ablations forced the design: a small ResNet-10 backbone, geometry-preserving augmentation (class error 100% to 2.618%), and dynamic over static logs each moved the model from non-functional to ~85% fracture sensitivity at the 8 cm interpreter tolerance.

References

Liu (2022), Mask R-CNN borehole fracture segmentation (mIoU ~81.2%); Du (2023), Mask R-CNN fracture detection (precision 96% / recall 92%) — literature baselines cited in the GeoBFDT manuscript.
Carion et al. (2020). End-to-End Object Detection with Transformers (DETR). ECCV 2020. https://arxiv.org/abs/2005.12872
He et al. (2017). Mask R-CNN. ICCV 2017. https://arxiv.org/abs/1703.06870
GeoBFDT performance and ablation figures derived from internal validation on a 14-well Middle East carbonate dataset of high-resolution borehole image logs from two different microresistivity imaging tools; data and code withheld under operator confidentiality.

Why Detection Transformers, Not Mask R-CNN, for Borehole Geology

The problem with masks and anchors on a borehole image

Set prediction, and why bipartite matching is the right tool

What we actually trained

The numbers that matter to an interpreter

Why this generalises beyond one field

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on