The most consequential edit we made to a peer-reviewed manuscript was a find-and-replace. Every instance of the phrase "ground truth", and its abbreviation "GT", came out and "incumbent-estimation" went in. It reads like a cosmetic change. It was not. It was the moment we stopped pretending our reference labels were truth and started calling them what they were: the output of an incumbent interpretation package. That one relabel then forced a chain of metric decisions that a reviewer had been right to press on. This is the story of that chain, from a roughly twenty-month engagement with a mid-sized Middle East carbonate operator, and why label provenance deserves to be the first question you ask about any evaluation, not the last.
The word a reviewer would not let pass
We had built a classical computer-vision pipeline that finds dissolution pores, called vugs, in borehole-image logs and reports how much of each interval they occupy. To evaluate it we compared our per-interval vug percentage against a reference vug percentage. In the first submission we called that reference the ground truth. A reviewer asked the obvious question that we had glossed over: where did the ground truth come from, and how was it annotated?
The honest answer exposed the problem. The reference had not been produced by an expert drawing an outline around each pore. It was a vug ratio computed by the operator's interpretation software, an area fraction per interval and nothing more. There were no per-vug contours, no masks, no object-level labels of any kind on the reference side. An interpreter had run their standard package and it had emitted a single number per depth window. Calling that number "ground truth" borrowed a word that implies human-verified certainty for something that was a second model's estimate.
Provenance is not a footnote, it is a constraint on your metrics
Once you say plainly that the reference is a set of software-derived area fractions with no masks, two evaluation choices that a machine-learning reader expects to see simply cease to be available. This is not a matter of taste. It is what the reference can and cannot support.
Intersection-over-union needs two regions to overlap. It is defined as
where A and B are pixel sets. If the reference side has no per-vug mask, there is no B to intersect. You cannot compute the numerator. IoU is not merely inadvisable here; it is undefined, because one of its two operands does not exist. The same logic retires the confusion matrix. A confusion matrix sorts predictions into discrete classes: true positive, false positive, false negative. Our task produces a continuous vug ratio and compares it against a continuous incumbent estimate. There are no classes to bin, so there are no cells to fill. Reaching for a confusion matrix would mean inventing a classification problem the task does not have.
What survives is the comparison the data actually supports: our area fraction against their area fraction, interval by interval. We report vug-ratio, and we frame every plot as our estimate versus the incumbent estimate rather than our estimate versus truth. On the composite figures, a green curve is our result and a red curve is the incumbent result, and we say in the text that green sitting above red is not a false positive count. It often means our pipeline caught pores the incumbent software missed, because the incumbent software is not the arbiter of correctness. It is the thing being compared against.
Why a bare area comparison is still defensible
There is a fair objection to lean on here. If you drop IoU and the confusion matrix and keep only an area fraction against another area fraction, what stops the comparison from becoming a curve-fitting exercise where you quietly re-tune the detector until its number matches the reference on each well? The answer is that the knobs are frozen, and we can show exactly which knobs and how.
Six parameters are universal constants, identical for every well and every section, never adjusted to make a result look better: the number of intensity modes k = 5, the mode separation delta-m = 5, the centroid difference threshold of 5 pixels that de-duplicates overlapping contours, the 20 percent IoU threshold that merges contours during detection, the adaptive block size of 31 pixels, and the local constant C set to the mean of the patch. A circularity gate from 0.3 to 1.0 filters out elongated fractures and pad-boundary artefacts on the same fixed basis everywhere. Only two thresholds are calibrated per well, a Laplacian-based value and a mean-based value, and even those are set once from one or two representative sections using heat maps and then frozen for the rest of that well. A comparison you can defend is one where the parameters producing your number were not chosen after seeing the reference number. That separation, six frozen universals against two calibrate-once-then-freeze thresholds, is what lets a single area fraction stand as a comparison rather than a fit.
The relabel is the transferable part
None of this is specific to carbonate pores. The pattern shows up any time your labels come out of a tool you did not build. A legacy classifier's outputs become the "gold" set for its replacement. A commercial package's segmentation becomes the reference for an in-house model. A previous team's spreadsheet becomes the target your pipeline is scored against. In each case the reference is another system's estimate, and treating it as truth quietly imports that system's blind spots into your metric and your paper.
The discipline is small and it is mostly naming. Write down where each reference number came from and what artefact it is: a human mask, a software area fraction, a prior model's logits. Let that provenance decide which metrics are even applicable before you pick one. If the reference has no masks, do not report IoU. If the target is continuous, do not draw a confusion matrix. And if the reference is an incumbent tool, name it after the tool, not after truth. We had a hard word limit of 11,000 words on the manuscript, which pushed most of the parameter analysis into supplementary material, but the relabel itself cost nothing but the willingness to concede the point. The paper was stronger for conceding it.
Limitations
This is a single engagement and a single reviewer exchange, on a classical computer-vision vug-quantification pipeline evaluated against one incumbent interpretation package. The vug-ratio comparison we retained still inherits whatever the incumbent software systematically misses, so agreement with it is a bounded claim, not a claim about absolute pore volume. The frozen-parameter argument depends on the two per-well thresholds genuinely being fixed after one or two sections, and a project with sparser representative sections would have less room to make that guarantee. The point generalises to any labels-from-software situation, but the specific parameters and gates here are tuned to borehole-image vug detection and should not be copied as defaults.
References
[1] Jaccard, P. The distribution of the flora in the alpine zone. New Phytologist, 1912 (origin of the intersection-over-union coefficient). https://doi.org/10.1111/j.1469-8137.1912.tb05611.x