When Your Baseline Is Not Ground Truth: Scoring AI Against Subjective Expert Software

The word we crossed out

For most of a project the label on the comparator does not matter. You call it ground truth, everyone knows what you mean, and the arithmetic is the same whatever you write on the axis. On the vug-quantification work it stopped being harmless, and we crossed the word out partway through peer review.

The comparator was a per-interval vug ratio produced by one expert, in one interpretation tool, on a static borehole-image log. It was the best reference available and it was not ground truth in any sense the phrase carries. It was a subjective software reading with a bias we could name: it missed pores. Our detector, working per-vug, kept finding real vugs the expert reading did not record, and the direction of the disagreement was consistent. The comparator under-counted. Calling a reference that systematically misses things "ground truth" is not a rounding error in language; it decides which scoring instruments are even allowed, because most of the familiar ones assume the reference is right.

So we renamed it. Throughout the manuscript, "ground truth" became a borehole-image estimation, an incumbent reading we compared against, not an oracle we were graded by. That single relabel is the subject of this piece. It changed the instrument.

Two instruments the rename disqualified

Once the comparator is a fallible reading rather than truth, the two obvious scoring choices both fail, and they fail for different reasons.

A confusion matrix fails twice. First it is the wrong task: vug quantification predicts a continuous vug ratio, area of pore over area of image, not a discrete class, and a confusion matrix needs classes for its four boxes. Force one by thresholding and you are scoring a task you did not run. Second, and this is what made the choice easy, a confusion matrix counts any detection the comparator did not record as a false positive. When the comparator's known bias is that it misses real pores, that box fills with the detector's correct extra picks. The better the detector does at the thing it was built for, the worse the matrix says it is doing. An instrument that punishes accuracy is not measuring accuracy.

A cross-plot fails once, more quietly, and the quietness is what makes it dangerous. Plot detector against comparator, draw the 45-degree line, and the picture reads as truth-versus-prediction: on the line is right, off it is error. But the line is the comparator, and the comparator is not truth. Every pore the detector correctly caught and the expert missed shows up as scatter below the line, indistinguishable from noise, and the reader's eye is trained to call scatter error. A cross-plot does not lie in its numbers. It lies in its frame, by implying an objective diagonal the data does not have.

We refused both, not because they are bad instruments in general, but because both encode the one assumption we had just crossed out: that the comparator is correct.

What we drew instead

What survives the rename is the comparison overlay: the detector's estimate and the comparator's estimate on the same depth track, green and red, side by side, with no line of truth between them. It asserts nothing about which is right. It shows where they agree and where they part, and hands the parting to a reader who can look at the raw image and decide. Where green sits above red, the honest caption is that our method reports more pore than the incumbent tool did, a difference to investigate, not a false positive to penalise.

Scoring an AI detector when the comparator is a subjective software interpretation of a borehole-image log, not ground truth, and carries a known false-negative bias. The 15-year arc on the left runs from the 2009 state of the art, which admitted no objective evaluation was possible and graded by eye against one interpreter with false picks removed by hand, to the 2024 work that renamed the comparator from ground truth to a borehole-image estimation and refused two instruments on principle. The three cards are those instruments: a confusion matrix (wrong task, and it scores the detector's correct extra picks as false positives), a cross-plot (it implies a 45-degree line of objective truth the comparator does not carry), and a comparison overlay (it makes no truth claim, so it cannot be made to lie). The orange element is the only one that argues: the overlay's honesty stays at 100 percent while you drag the comparator's miss rate up and watch the other two degrade. The arc facts, the two refusals, and the area-ratio circularity band of 0.3 to 1.0 kept over Frechet and Hausdorff because it normalizes to 0 to 1 are sourced from the engagement archive; the swept comparator miss-rate axis is an illustrative control used to show how each instrument responds, not a measured error rate.

The instrument above makes the argument mechanical. Each candidate carries a small honesty meter, and the lever sweeps the one property that decides everything, the rate at which the comparator misses real pores. Push it up and the confusion matrix and the cross-plot both lose honesty, because both read the detector's correct extra picks as error against a reference that is quietly getting worse. The overlay's meter does not move. It never claimed the comparator was right, so nothing about the miss rate can make it wrong. That invariance is the point: when you cannot trust the baseline, the only honest instrument is the one that never trusted it either. (The swept miss rate is an illustrative control; the two refusals and the rename are sourced.)

One smaller decision rode along on the same principle. For the shape gate that separates real pores from image artefacts we kept an area-ratio circularity, contour area over the area of its minimum enclosing circle, normalised to a 0-to-1 band, over a curve-distance metric like Frechet or Hausdorff: a bounded, interpretable number a reader can reason about beats a distance whose scale depends on the object, when the reference is itself uncertain. The gate and the detector's accuracy against the incumbent readings are in the companion pieces on counting vugs and individual-vug quantification, not re-derived here.

The 15-year arc this sits on

The instinct to refuse a fake objectivity is not new, and the field said the honest thing out loud a long time ago. The 2009 state of the art for automated fracture detection in borehole images is a good marker [1]. It is a competent classical pipeline, adaptive histogram equalisation over subregions, an orientation-field estimate on a 15-pixel window, rotated directional filters, a 13-by-13 local-mean binarisation, and a fast Hough transform for the sinusoids. What matters is not the method but two sentences in it. The authors write that earlier approaches "fail to perform well," showing only a few isolated fractures, and then, on evaluation, state it plainly: "we cannot provide a truly objective evaluation of the results, because there is no ground truth." Their evaluation was to eyeball the output against a single experienced interpreter and remove the false picks by hand.

Read that admission next to the rename and the arc is clear. In 2009 the honest move was to say there is no ground truth and grade by eye against one person, then leave the problem standing. Fifteen years later the absence of ground truth is still true and the reference is still one expert's reading in one tool, but the response is different. Instead of quietly treating that reading as truth and reaching for a confusion matrix or a cross-plot, you name what the reference is, a borehole-image estimation with a known miss bias, and choose an instrument that does not depend on it being correct. The rigor did not come from finding ground truth. It came from being precise about its absence and refusing the instruments that pretend otherwise.

Discussion

The general claim is portable. The choice of a scoring instrument is an implicit claim about the comparator, and you should make it explicit first. A confusion matrix claims the comparator is a correct label. A cross-plot claims it is an objective axis. A comparison overlay claims nothing. When the comparator earns those claims, use the sharp instruments; they are more informative. When it does not, and a single subjective reading with a known false-negative bias does not, the sharp instruments do not become slightly optimistic, they invert: they turn the detector's correctness into its penalty. The overlay refuses to compute one headline score, and on this task that refusal is exactly the property you want. We did not solve the ground-truth problem the 2009 paper named, because on a subjective interpretation there is nothing to solve, only something to stop pretending: cross out a word, follow it to its consequences for the instrument, and let the reader see the two curves part rather than hand them a number that had already decided who was right.

Limitations

The honesty meters in the instrument are an argument rendered as a mechanism, not a measurement. The rates at which the confusion matrix and the cross-plot lose honesty as the miss rate rises are illustrative slopes chosen to show the direction and the inversion, not fitted curves; only the two refusals, the rename, the 15-year arc facts, and the area-ratio circularity band are sourced. We also cannot quantify the comparator's true false-negative rate, because doing so would require the very ground truth whose absence is the point; we know the direction of the bias from repeated intervals where the detector caught pores the incumbent reading did not, and from the physics of a single 2D pad reading, but not a clean number. The overlay's invariance is a claim about its framing, not a measured constant: a reader determined to treat the red curve as truth can reintroduce every error the frame was built to avoid. Finally, this is one task's epistemology, a regression against a subjective reading; a task with a genuinely objective reference should use the sharp instruments, and nothing here argues against a confusion matrix or a cross-plot where the comparator has earned the claim each one makes.

References

[1] Zhang, Y., and Xiao, C. Detection of Fractures in Borehole Image. Proc. SPIE, International Symposium on Multispectral Image Processing and Pattern Recognition (MIPPR), 2009. An automated fracture-detection pipeline whose authors state that prior methods fail to perform well and that no truly objective evaluation is possible because there is no ground truth, evaluating instead by eye against one experienced interpreter with false picks removed by hand. https://doi.org/10.1117/12.833568

When Your Baseline Is Not Ground Truth: Scoring AI Against Subjective Expert Software

The word we crossed out

Two instruments the rename disqualified

What we drew instead

The 15-year arc this sits on

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on