A Depth-Tolerance Confusion Matrix: Scoring Sinusoid Picks at 3 cm and 5 cm

Every object-detection metric you have ever trusted assumes the thing you are detecting is a box. Intersection-over-union measures the overlap of two rectangles. Mean average precision integrates precision over recall using IoU as the matching rule. Even a humble pixel-accuracy score presumes a mask with area. A fracture on a borehole image log has none of these. It is a sinusoid — a one-pixel-thick sine wave traced across an unrolled cylinder of rock — whose entire physical meaning lives in three scalars: the depth at which it crosses the borehole, its dip, and its azimuth. When we built a Detection Transformer to pick these features from two different microresistivity imaging tools during a roughly twenty-month engagement with a mid-sized Middle East carbonate operator we partnered with, the hardest engineering decision was not the architecture. It was the metric. This piece is about why the standard detection metrics fail on sinusoids, and how a depth-tolerance confusion matrix — scored at 3 cm and 5 cm — became the domain-correct way to grade a model whose output a petrophysicist has to sign off on.

Why IoU is meaningless for a sinusoid

Start with the geometry, because the geometry is the whole argument. A planar feature cutting a cylindrical borehole, once that cylinder is unrolled into a flat image, traces a clean sine wave. The amplitude of that wave encodes dip; its horizontal phase encodes azimuth; the vertical position of its centreline is the depth. There is no width and no height to speak of — a fracture is, to first approximation, a curve, not a region.

Now try to apply IoU. To compute the overlap of two sinusoids you would first have to give them area, which means wrapping each in a bounding box. But the bounding box of a sine wave is dominated by its amplitude and the patch height, not by the feature's actual position — two fractures at completely different depths can have nearly identical boxes, and a tiny vertical shift that a geologist would call a misplaced pick barely moves the IoU at all. The metric is blind to exactly the error you care about and sensitive to the geometry you do not. Mean average precision inherits the same blindness, because it uses IoU as its matching rule. And mask-based scores are no better: there is no mask, only a one-pixel curve, so a single pixel of vertical jitter can flip a "perfect" pick to a "miss" under pixel IoU even though the depth is dead on.

There is a second, more insidious problem, and it comes from the sensor itself. The depth resolution of the binary wireline log file feeding the model is finite: one pixel of the unrolled image corresponds to roughly 3 cm of true depth. That means a ±3 cm depth uncertainty is baked into the ground truth before the model ever runs — even a flawless picker cannot be more precise than the raster it learns from. Any metric that demands pixel-exact agreement is therefore measuring the sensor's quantisation, not the model's skill. The metric has to be tolerant by construction, and the tolerance has to be expressed in centimetres of rock, not pixels of image.

The depth-tolerance confusion matrix

The fix is to stop pretending sinusoids are boxes and score them the way a geoscientist actually compares two interpretations: by depth, within a tolerance. We replaced IoU matching with a one-dimensional depth-distance match, and rebuilt the entire confusion matrix on top of it.

The rule is simple to state and physically honest:

True positive. A predicted sinusoid is a true positive if its picked depth falls within a chosen tolerance band of a ground-truth fracture's depth. Concretely, a ground-truth pick matched by a prediction three centimetres shallower is a 3 cm error — about 1.5% of the height of an average ~2 m fracture — and counts as a hit.
False positive. A predicted sinusoid with no ground-truth fracture inside its tolerance band is a false positive — the model invented structure that is not there.
False negative. A ground-truth fracture with no prediction inside its band is a false negative — a real feature the model missed.

From those three counts, precision, recall, and F1 follow in the ordinary way. What is different is that the matching tolerance is now a tunable, physically interpretable knob measured in centimetres, not an IoU threshold measured in dimensionless overlap. And because the assignment between predictions and ground truth has to be one-to-one — you cannot let one lucky prediction claim credit for two nearby fractures — the match is solved as a bipartite assignment, the same Hungarian-style pairing the model already uses in its training loss, but here applied at evaluation time with depth distance as the cost. We swept that pairing tolerance during development across 0, 1, 3, 5, 8, and 12 units before settling on the operating points that mattered for interpretation.

This is the engineering point worth dwelling on: the evaluation metric is not a bolt-on. It is the same set-prediction machinery the DETR head is trained against — focal classification loss weighted 5 against an L1 parameter loss weighted 1, with predictions kept above a 0.5 object probability — re-expressed in physical units at test time. Training objective and acceptance metric speak the same language, which is the only way to avoid the classic failure where a model optimises a loss that has nothing to do with how it will be judged.

Reading the numbers at 3 cm and 5 cm

Once the confusion matrix is built on depth tolerance, the model's true behaviour becomes legible — and it is a story about a single binding axis. At a tight 3 cm tolerance, fracture detection F1 is about 65%, with beddings at 63%. That is marginal: at the resolution limit of the log itself, the model and the human interpreter genuinely disagree on a third of the picks, and much of that disagreement is the ±3 cm sensor quantisation, not model error. Loosen the band to 5 cm — still well inside the depth uncertainty any two human interpreters would tolerate — and F1 jumps to roughly 75% for fractures and 69% for beddings. Loosen further to 6 cm and fractures reach ~78%; by a 9 cm band the combined model clears ~82% for fractures. The depth mean-absolute error sits at just 1.04–1.20 cm at the 3 cm operating point, which tells you the misses are not wild — the model is landing within a centimetre or two of truth, and the metric's strictness, not the model's aim, is what suppresses the score at 3 cm.

GeoBFDT emits the whole (class, depth, dip, azimuth) tuple in one forward pass — but the three axes are not equally hard. Detection along the depth axis is the binding constraint: at a tight 3 cm window fracture F1 is only ~65% (beddings ~63%) and only clears the useful regime for structural work once tolerance loosens to 5 cm (~75% / ~69%); horizontal wells hold ~55% at 4 cm. The geometric axes the interpreter actually fits sinusoids for are already strong at tight tolerance — dip ~90% at 3°, azimuth ~92% (fractures) / ~84% (beddings) at 15°. Pick an axis and step its tolerance: depth is the lever you loosen, dip and azimuth are already past the line. All accuracies and tolerances are the article's own; the dashed ~70% 'useful regime' line is an illustrative reading aid (the article names no exact F1 cutoff).

The crucial design lesson is in the contrast between the axes. The geometric quantities a geologist fits sinusoids for — dip and azimuth — are already strong at tight tolerance: dip accuracy near 90% at a 3° threshold, azimuth near 92% at 15°. It is only the detection axis, depth, that lags at a tight window. A naive global "accuracy" number would have blended these together and hidden the one thing a reservoir engineer needs to know: depth is the lever you loosen, geometry is already past the line. The depth-tolerance confusion matrix surfaces that structure because it scores each axis in its own physical unit instead of collapsing everything into a single box-overlap proxy.

Why this is the right metric, not just a softer one

It would be easy to mistake depth tolerance for grade inflation — pick a generous band, watch the F1 climb, declare victory. That is not what is happening here, and the distinction matters.

A tolerance band is only legitimate when it is anchored to a real physical scale, and ours is anchored twice over: to the sensor (one log pixel ≈ 3 cm, so anything tighter than that is unmeasurable) and to the feature (an average fracture is ~2 m tall, so a 3–5 cm depth error is 1.5–2.5% of the object, the equivalent of a sub-pixel slip in ordinary detection). Reporting F1 at both 3 cm and 5 cm — rather than cherry-picking the flattering one — is what keeps the metric honest. The pair of numbers tells the reader exactly how fast performance recovers as you relax the constraint, which is itself a measurement: a model whose F1 barely moves from 3 cm to 5 cm is making large depth errors, while ours jumps ~10 points, proving the misses cluster just outside the tightest band.

We also stress-tested the metric on geometry it was never tuned on. Across three held-out validation wells — one vertical well, one horizontal well logged with a compact microresistivity tool roughly 10 km away, and a third vertical well about 12 km out — the same depth-tolerance scheme transferred without modification, and the harder horizontal-well geometry held a fracture F1 near 55% at a 4 cm band. A metric that survives a change of well trajectory and tool type is measuring the feature, not the dataset. That portability — the same confusion matrix running unchanged across vertical and horizontal wells, across two different imaging tools — is the production property that lets the metric live in an MLOps acceptance gate rather than a one-off notebook.

Takeaways for the practitioner

If you are porting object detection to any domain where the target is a parameterised physical object rather than a pixel region — sinusoids, well picks, seismic horizons, lab spectra — resist the reflex to reach for IoU and mAP. Ask first what error a domain expert would actually flag, express that error in the expert's own units, and build your confusion matrix on a tolerance in those units. For borehole fractures the unit is centimetres of depth, the tolerances that matter are 3 cm and 5 cm, and the discipline of reporting both is what turns a model score into something a petrophysicist will trust enough to put in the loop. The metric is not a formality you compute after the model is done. It is part of the model, and on this problem it was the part that decided whether anyone believed the rest — the same floor-first discipline we carry across the operators we have worked with, in the Middle East and the United States.

Key takeaways

A fracture on an image log is a one-pixel sinusoid parameterised by depth, dip, and azimuth — it has no box and no mask, so IoU, mAP, and pixel accuracy are structurally the wrong metrics: they are blind to depth error and sensitive to amplitude geometry no one cares about.
The sensor sets a hard floor: one log pixel ≈ 3 cm, so a ±3 cm depth uncertainty is baked into the ground truth. Any metric demanding pixel-exact agreement measures the raster's quantisation, not the model — the metric must be tolerant in centimetres of rock by construction.
The domain-correct metric is a depth-tolerance confusion matrix: a pick within tolerance of a ground-truth fracture is a TP, an unmatched prediction is an FP, an unmatched fracture is an FN. Matching is one-to-one (bipartite), reusing the model's own training-time assignment at test time in physical units.
Reporting F1 at both 3 cm and 5 cm is the honest disclosure: fractures 65%→75% and beddings 63%→69% as the band loosens, with depth MAE only 1.04–1.20 cm at 3 cm. The ~10-point jump proves the misses cluster just outside the tightest band rather than being wild errors.
Scoring each axis in its own unit exposes that depth detection is the binding constraint while geometry (dip ~90% at 3°, azimuth ~92% at 15°) is already strong — a structure a single blended 'accuracy' number would have hidden. The same scheme transferred unchanged across vertical/horizontal wells and two different imaging tools, which is what lets it serve as a production acceptance gate.

A Depth-Tolerance Confusion Matrix: Scoring Sinusoid Picks at 3 cm and 5 cm

Why IoU is meaningless for a sinusoid

The depth-tolerance confusion matrix

Reading the numbers at 3 cm and 5 cm

Why this is the right metric, not just a softer one

Takeaways for the practitioner

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on