When two fractures and a bedding plane cross the same two metres of an image log, the hard problem is not finding them — it is deciding which detection belongs to which sinusoid. That assignment problem is where per-pixel segmentation and anchor-box detectors quietly fail, and it is exactly the problem a Detection Transformer is built to solve. For a mid-sized Middle East carbonate operator, we built GeoBFDTGeological Beddings and Fractures Detection Transformer — a DETR-derived, mask-free model that predicts a whole set of sinusoids end-to-end, emitting depth, dip, and azimuth per sinusoid in one forward pass., trained on high-resolution borehole image logs from two different microresistivity imaging tools across 14 vertical wells. It treats fractures and bedding planes collectively as sinusoids, and emits depth, dip, and azimuth for each in a single forward pass — no masks, no anchor boxes, no non-maximum suppression.
The problem with masks and anchors on a borehole image
Every planar feature that intersects a wellbore — a fracture, a bedding surface — projects onto the unwrapped image log as a sinusoid. The interpreter's job is to pick each trace, fit a curve, and recover three numbers: depth, dip, and azimuth. In a fractured carbonate, those sinusoids overlap. A 2.2-metre patch can hold dozens of them, crossing and occluding one another, conductive traces (dark) tangled with resistive ones (bright).
This is precisely the regime where the obvious computer-vision tools break down. The borehole-imaging literature has leaned on Mask R-CNN: Liu (2022) reports mIoU around 81.2% and Du (2023) reports precision 96% / recall 92%, both on segmentation-style fracture tasks (Liu, 2022)Liu · 2022Mask R-CNN for borehole fracture segmentation (mIoU ~81.2%)Literature baseline cited in the GeoBFDT manuscript (Du, 2023)Du · 2023Mask R-CNN fracture detection (precision 96% / recall 92%)Literature baseline cited in the GeoBFDT manuscript. Those are respectable numbers on isolated features, but they inherit two structural liabilities when sinusoids overlap.
The first is the mask itself. Segmentation labels every pixel, then a downstream step must group pixels into instances and fit a curve to each. When two sinusoids cross, the pixels at the intersection belong to both — the mask cannot represent that, and the instance-grouping heuristic has to guess. The second is the anchor-and-NMS machinery that detectors like YOLO and Faster/Mask R-CNN use to propose and de-duplicate boxes. Non-maximum suppression assumes that two highly overlapping detections are duplicates of one object and deletes one of them. For overlapping sinusoids that is the wrong prior by construction: the overlap is the signal, not a duplicate. You spend your hyperparameter budget tuning NMS thresholds to not delete real fractures.
The assignment problem, stated plainly
A mask labels pixels; an anchor detector proposes boxes and then suppresses overlaps. Neither has a native, differentiable way to say “these N predictions correspond one-to-one to those M ground-truth sinusoids.” That correspondence is the whole task.
Set prediction, and why bipartite matching is the right tool
A Detection Transformer reframes detection as set predictionPredicting an unordered set of objects in one shot, then matching that whole set to the ground truth at once — rather than scoring boxes independently and de-duplicating afterwards.. GeoBFDT keeps that spine: a ResNet-10 backbone extracts features, a 4-layer transformer encoder builds global context across the patch, and a 4-layer decoder turns a fixed bank of learned query embeddings into a set of predictions. Each query emits a class (fracture, bedding, or no-object) plus the regression triplet — depth, dip, azimuth — normalised into a 0–1 range.
The mechanism that makes this work end-to-end is Hungarian matchingThe Hungarian algorithm finds the single lowest-cost one-to-one assignment between predicted queries and ground-truth sinusoids, minimising a combined classification-plus-geometry cost. The loss is then computed only against that optimal pairing.. During training, every prediction is matched to at most one ground-truth sinusoid (and the rest to “no-object”) by solving a bipartite assignment that minimises a combined classification-plus-geometry cost. The loss is then computed only against that optimal pairing.
Three consequences follow directly, and each removes a failure mode of the segmentation/anchor stack:
- No NMS. Because matching is one-to-one by construction, two overlapping fractures are assigned to two different queries. There is nothing to suppress, no overlap threshold to tune, and no mechanism that can delete a real fracture for being too close to another.
- No anchors. Queries are learned, not tiled priors over scale and aspect ratio. The model is not biased toward a box geometry that does not describe a sinusoid in the first place.
- No masks, no curve fitting. The decoder regresses dip and azimuth directly. The mask-creation pre-step and the post-hoc sinusoid fit — each a separate source of error and tuning debt — simply do not exist in the pipeline.
This is also why we declined a like-for-like benchmark against Mask R-CNN or YOLO. Those models solve a different objective: per-pixel masks or anchored boxes, followed by instance grouping and a geometric fit. GeoBFDT regresses the geometry itself. A mIoU or a box-mAP number is not commensurable with end-to-end depth/dip/azimuth accuracy; comparing them would flatter one method by measuring it on the task it happens to be built for.
What we actually trained
The reservoir interval we started from was brutally sparse: it contributed 32 sinusoids across 236 image patches, only 19 of which contained a sinusoid at all. No transformer trains on that. We attacked the scarcity two ways. First, geometry: the model ingests 800-pixel patches — 2.2 m of borehole, the height inside which more than 95% of fracture and bedding sinusoids fit — extracted with overlap so no trace is clipped at a patch boundary. Second, geometry-preserving augmentation (ColorJitter, GaussianBlur, Sharpen, Gaussian noise, Emboss, MedianBlur) at roughly 10 augmentations per sinusoid-bearing patch grew the corpus from 236 patches / 19 sinusoid-patches / 32 sinusoids to 4,212 / 2,046 / 3,565 — a greater-than-tenfold increase that preserves sinusoid shape while varying contrast and texture.
The model is deliberately small for the data regime: ResNet-10 trained from scratch (no pretrained weights), 4 encoder and 4 decoder layers, feedforward dimension 1,024, dropout 0.2, AdamW at learning rate 0.0004, batch size 128, early stopping after 40 epochs without improvement. The loss combines Focal loss for classification (weight 5) with L1 loss for the depth/dip/azimuth regression (weight 1); inference keeps queries above a 0.5 probability threshold. The choice of a small backbone was not an oversight — it was the result that the ablations forced.
The ablation evidence is unambiguous, and it is what separates a working model from a plausible-looking one:
- Backbone. ResNet-10 beat heavier backbones on this small corpus — ResNet-34 collapsed to a class error of 26.759 against ResNet-10's 0.499. More capacity overfit; less generalised.
- Augmentation is not optional. Without it, classification error sat at 100% (the model learned nothing usable); with it, 2.618%. The combined Hungarian loss fell from 0.174 to 0.0135 in lockstep.
- Dynamic beats static logs. Training on dynamically normalised image logs over static logs cut class error from 63.45 to 2.536 — the dynamic image carries the contrast the attention layers need.
- Wells compound. Across 3 → 6 → 9 → 11 → 14 wells, the Hungarian matching loss fell from 0.801 to 0.015 and class error from 93.115 to a low single digit. Geological diversity, not raw patch count, is what the model is hungry for.
The numbers that matter to an interpreter
Accuracy here is measured the way an interpreter would judge a pick: a prediction counts as correct only if it lands within a tolerance window on depth, dip, and azimuth. At an 8 cm depth threshold — the window inside which interpreters treat a pick as right — the fracture model reached approximately 85% sensitivity, tightening to roughly 65% at an unforgiving 3 cm. Fracture dip accuracy stabilised above 3° at 91% (peaking near 98%), and azimuth above a 15° threshold at 92% (peaking near 96%). Depth precision held at 85%.
The model-selection story is itself a finding. A fractures-only model reached 80.84% overall dip-intensity accuracy, ahead of the 77.10% the combined fracture-plus-bedding model managed on fractures — because bedding planes dominate the labels and the shared model spends capacity on the majority class. Beddings landed at 73.63% combined and 72.52% standalone. The combined fracture model's geometry errors are tight where it counts: mean absolute error of 1.72 cm on depth, 1.71° on dip, and 9.34° on azimuth. Those residuals sit inside the uncertainty already baked into a carbonate play model, which is the bar for “validator-ready” rather than “research-stage.”
Before
Multi-stage, overlap-fragile
Mask/anchor pipeline: per-pixel masks, instance grouping, NMS threshold tuning, post-hoc curve fit
After
End-to-end, overlap-native
GeoBFDT: one forward pass, Hungarian-matched set of sinusoids with depth/dip/azimuth
~85% fracture sensitivity @ 8 cm; dip 91% @ 3°, azimuth 92% @ 15°
Why this generalises beyond one field
The architectural argument is not field-specific. Anywhere geological features overlap in a 2D image — densely fractured carbonates, faulted intervals, cross-cutting bedding — the assignment problem dominates, and a model that solves it natively with bipartite matching will out-behave one that tunes overlap thresholds after the fact. The end-to-end regression of dip and azimuth removes two error-propagating stages (mask creation and curve fitting) that every segmentation pipeline carries, which is also why transfer to an adjacent field costs far less labelling than a from-scratch segmentation effort.
The honest caveats travel too. The corpus is 14 vertical wells of a single confidential reservoir; horizontal wells, where average fracture sinusoid height drops sharply, are a separate distribution. Attention-map visualisation is not directly interpretable on this architecture, because self-attention runs over compact ResNet feature maps rather than the raw image log — a real limitation for explainability work. And the well-count ablation says plainly that more geological diversity, not more augmentation, is the next lever.
Why set prediction beats segmentation for overlapping sinusoids
- Overlapping fractures are an assignment problem; Hungarian bipartite matching solves it one-to-one, so there is no NMS threshold that can delete a real overlapping fracture.
- Removing masks and anchor boxes removes two error-propagating stages — instance grouping and post-hoc curve fitting — and lets the decoder regress depth, dip, and azimuth directly in one pass.
- Ablations forced the design: a small ResNet-10 backbone, geometry-preserving augmentation (class error 100% to 2.618%), and dynamic over static logs each moved the model from non-functional to ~85% fracture sensitivity at the 8 cm interpreter tolerance.
References
-
Liu (2022), Mask R-CNN borehole fracture segmentation (mIoU ~81.2%); Du (2023), Mask R-CNN fracture detection (precision 96% / recall 92%) — literature baselines cited in the GeoBFDT manuscript.
-
Carion et al. (2020). End-to-End Object Detection with Transformers (DETR). ECCV 2020. https://arxiv.org/abs/2005.12872
-
He et al. (2017). Mask R-CNN. ICCV 2017. https://arxiv.org/abs/1703.06870
-
GeoBFDT performance and ablation figures derived from internal validation on a 14-well Middle East carbonate dataset of high-resolution borehole image logs from two different microresistivity imaging tools; data and code withheld under operator confidentiality.