The comparator was never a mask
We built a classical computer-vision pipeline that reads a static borehole-image log and reports a vug ratio: the fraction of a depth interval taken up by dissolution pores, computed every 0.1 m along with per-vug area, count, and circularity. The mechanics of that detector, its eight frozen parameters and how it stays within about 1.21 cm2 of expert picks at a 10 cm scale, are covered elsewhere. This piece is about a narrower question that decided how we were allowed to report the results at all: what do you measure a vug detector against, when the only comparator is a human's software-assisted vug ratio?
That sounds like a formality. It is not. The comparator we were handed was a set of vug ratios produced by a reservoir interpreter working in the incumbent borehole-image software: one number per interval, a subjective read, carrying no per-vug contours. Nobody drew a boundary around each pore. The interpreter looked at the image, judged how much of the wall was vuggy, and wrote down a percentage. That is the entire label, and everything downstream follows from it.
Three metrics a reviewer reaches for, and why the labels reject all three
When you present a detection-shaped result, reviewers ask for detection-shaped metrics. We were asked, at various points, for intersection-over-union, for a confusion matrix, and for a cross-plot against ground truth. Each request is reasonable in the abstract and impossible against these labels.
Intersection-over-union is an area-overlap metric: the intersecting area of a predicted region and a reference region divided by their union [1]. To compute it you need a region on both sides, a mask or box around each predicted vug and a matching one around each reference vug. Our comparator has no reference regions. There is no per-vug boundary in a vug-ratio number, so there is no union to divide into. IoU is not a weak metric here, it is undefined. Reporting one would mean inventing reference masks the interpreter never produced, then scoring against our own invention.
A confusion matrix has a different problem. It counts true and false positives and negatives across discrete classes. Our task is regression: predict a continuous vug ratio, compare it to another continuous ratio. There is no class boundary to be right or wrong about. You can force a threshold and manufacture classes, but then you grade a discretisation you chose, not the quantity the pipeline estimates.
The cross-plot is the subtle one, because it looks honest. Plot the interpreter's ratio on one axis and ours on the other, draw the y = x line, and let the scatter speak. The trouble is what the y = x line asserts: that one axis is truth and the other is the estimate under test. Our comparator is not truth. It is a subjective interpretation with a known bias toward missing small pores, made in software that under-resolves secondary porosity. Drawing it as the reference axis of a cross-plot silently promotes a human's opinion to an objective standard. That is the overclaim we refused to make.
What we did instead: relabel the baseline, add a comparison plot
The fix was small and mostly a matter of honesty. First, we stopped calling the comparator "ground truth." Throughout the manuscript, the term "GT" became "incumbent-estimation." The rename is not cosmetic. "Ground truth" tells a reader the labels are correct by construction; "estimation" tells the reader they are one instrument's read, which is what they are. Once the baseline is named for what it is, the pressure to score against it as if it were objective goes away, and so does the temptation to reach for IoU or a cross-plot.
Second, we replaced the cross-plot with a comparison plot: two ratio series over depth, the incumbent estimate and ours, drawn side by side rather than one against the other. Neither series sits on a truth axis. A reader sees where the two agree and where they diverge, and can judge the divergence without being told in advance which one is right. Where our pipeline reports more vugs than the incumbent, that is an observation to explain, not an error to penalise, because in several intervals the extra pores our method caught are real pores the software missed. A comparison plot lets that show; a cross-plot against "ground truth" would have scored those same catches as false positives.
The plot scales are part of the same honesty. The vug-percentage bar plots are fixed to a 0-25% range so no interval is visually inflated by an auto-scaled axis, and the comparison-plot x-axis was reduced to 10% for the one figure (Fig 10) whose intervals never exceeded that, so empty white space does not read as low vug density. These choices change no number, only what the reader concludes.
The metric is a property of the labels, not of the model
There is a general rule under this specific story. The right metric is fixed the moment you know your labels, not the moment you know your model. Our detector could have supported a mask-IoU evaluation perfectly well; it extracts per-vug contours internally and could report them. What it could not do was conjure a reference mask on the comparator side, because that side is a ratio and nothing else. The binding constraint was the label schema, and no amount of model sophistication relaxes it.
This is also why the parameter regime is the same story told from the other direction. The pipeline's constants, k = 5 modes, delta-m = 5 intensity separation, circularity kept to 0.3-1.0, the merge threshold at 20% IoU between overlapping contours, block size 31 with C set to the patch mean, are all frozen across every well and depth section, so the results are not the product of per-well metric tuning [2]. The one metric we report against the incumbent is fixed by the labels; the parameters that produce our estimate are fixed in advance. Neither was chosen to flatter the comparison.
If you take one thing from this, take the discipline of checking your comparator before you pick your score. If the only thing you can compare against is a subjective number with no boundaries, then IoU, confusion matrices, and cross-plots against ground truth are not available to you, whatever the reviewers or your own instinct suggest. What is available is a named, honest baseline and a plot that puts two estimates next to each other and lets the reader see the difference. That is a smaller claim than an objective benchmark, and it is the true one.
Limitations
This is an argument about metric selection, not a validation study, and it inherits that boundary. Because the comparator carries no per-vug masks, we also cannot report a per-vug precision or recall against it: the claim that the extra pores we catch are genuine rests on interval-level agreement and spot inspection, not a boundary-level audit the labels could never support. The comparison plot shows where two estimates diverge but cannot attribute a divergence to a miss by the incumbent versus a false positive by us, since deciding that would itself need the mask-level reference we lack. The plot-scale choices, 0-25% bars and the 10% Fig-10 axis, are honest for these intervals but not universal; a denser interval would need a wider axis, and the point is that the axis follows the data, not auto-scale. Finally, some apparent vugs are acquisition artefacts rather than rock [2], and the circularity gate only filters the elongated ones, so absolute vug ratios from either source should be read as estimates, not measured porosity.
References
[1] Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., and Savarese, S. Generalized Intersection over Union: A Metric and a Loss for Bounding Box Regression. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019: 658-666. Defines intersection-over-union as the ratio of the intersecting area to the union area of two regions, which requires a per-object region on both the prediction and the reference to be computed at all. https://doi.org/10.1109/CVPR.2019.00075
[2] Lofts, J. C., and Bourke, L. T. The recognition of artefacts from acoustic and resistivity borehole imaging devices. Geological Society, London, Special Publications 159(1) (1999): 59-76. A taxonomy of mechanical, electrical, and processing artefacts in borehole image logs, and why a feature that looks like a pore may be an acquisition artefact rather than rock. https://doi.org/10.1144/GSL.SP.1999.159.01.03