The segmenter, VeerNet, was doing well by every number we tracked while training it, and none of those numbers convinced the one person whose opinion decided whether the pipeline shipped. The model produced masks, and we scored the masks against ground-truth masks, and the scores were honest measurements of how many pixels the model got right. But a geoscientist asked to trust a digitised well does not read pixels. They read curves. They want to lay the recovered curve next to the log they already know and see, with their own eyes, whether the two agree over depth. A pixel-space score cannot be inspected that way. It is a proof written in a language the reviewer does not sign in. This is the account of the validation step we built to translate: the mask-to-CSV overlay, where a predicted mask becomes a curve, the curve is resampled onto the same depth axis as the truth, and the error is read off the gap between them, one curve at a time.
The mask was a proof nobody could read
A segmentation mask answers a question about pixels. Given the log image, which pixels belong to curve one, which to curve two, which to background. Training against that question is exactly right, and the loss functions that optimise it, from Dice through the Tversky variant we settled on for its control over the false-negative penalty on thin foreground, are built for it [3]. The encoder-decoder that produces the mask is the standard shape for this kind of dense prediction, contracting to features and expanding back to a per-pixel labelling with skip connections carrying the fine detail across [2]. All of that is defensible and we can defend it.
What it cannot do is tell a petrophysicist that the resistivity curve on well such-and-such was recovered faithfully between two depths. The mask quality number and the question the reviewer actually asks are not the same question, and the distance between them is not rhetorical. A mask can score respectably on pixel counts while the curve it implies wanders off the true reading in exactly the interval the reviewer cares about, because a handful of misplaced pixels on a thin trace move the curve more than they move any pixel-averaged score. So we had a model that was good and a validation story that could not be signed, and until we fixed the second thing the first thing did not matter.
The overlay contract
The fix was to stop reporting on the mask and start reporting on the curve. Every predicted mask was collapsed to a curve, one value of the logged quantity per depth, which is the deliverable the whole system exists to produce. But a predicted curve and a ground-truth log are not directly comparable the moment they come off different rasters: they do not share a depth sampling, they do not have the same number of points, and the depth steps are not aligned. Comparing them point by point requires first putting them on a common footing.
That common footing was a shared depth axis. We resampled both the predicted curve and the ground-truth log onto one regular grid of 300 depth points using a cubic spline, the standard piecewise-cubic interpolant that reconstructs a smooth curve through known samples without the overshoot of a single high-degree polynomial [1]. Three hundred points was the density the validation notebooks settled on: enough that the two traces resolve as continuous curves rather than a coarse staircase, few enough to stay cheap to compute across a validation set. Once both curves live on that grid, they are directly subtractable, and every downstream number, every plot, every verdict comes from that subtraction.
Reading the error off the gap
With the two curves on the same axis, the error is the space between them, and it is measurable two ways that a reviewer understands. Mean absolute error is the average vertical distance between the predicted trace and the truth, in the curve's own units, which is precisely the quantity a person estimates when they eyeball how far the prediction sits from the log; it is the metric to lead with when you want the number to mean what the reader thinks it means [4]. Mean squared error is the same idea with large excursions weighted more heavily, which surfaces the occasional bad interval that MAE would average away. We reported both, per curve.
The numbers told a story with a shape. On the Tversky-loss multiclass run, curve one came out clean: mean MAE of 0.0277 and mean MSE of 0.0021, and on its best example the overlay reached an R-squared of 0.9891, the coefficient of determination saying the predicted trace explained almost all of the variance in the true log. A second example of the same curve was still strong at 0.8126. Curve two was the honest counterweight. Its mean MAE was 0.1241 and its mean MSE 0.0253, roughly four to twelve times curve one's error depending on the metric, and its example landed at an R-squared of 0.5461. That is not a good curve, and the point of the overlay is that it does not pretend to be. The gap is visible, on the same axis, in the same frame as the good one.
The exhibit is that comparison made interactive. Put each of the three sourced cases on the scope and the predicted trace sits over the truth log with the residual drawn as a band between them; the band is thin where curve one tracks its log and wide where curve two loses it. Drag the resample density and watch the reason the axis matters: at a dozen points the overlay is a staircase that hides the error, and at the 300 points the notebooks used the two traces resolve point for point and the residual becomes a verdict you can actually sign. The residual band is the only element that argues, because it is the error a geoscientist puts their name next to.
Why this changed the sign-off
Before the overlay, a review conversation was a negotiation about whether to trust a number. After it, the conversation was a person looking at their own curve with our prediction on top and deciding, per curve, whether the fit was good enough for the use they had in mind. That is a categorically easier thing to approve, and it is also a categorically easier thing to reject, which mattered just as much. The 0.5461 case did not get quietly rolled into an average that flattered the pipeline. It showed up on the scope as a curve we had not yet recovered well, next to a curve we had, and that honesty is what made the good numbers believable.
The overlay also gave us a triage tool we did not have from mask scores alone. A wide residual concentrated in one depth interval points at a local failure, a crossing or a faint trace the segmenter dropped, which is a different bug from a residual spread evenly along the whole curve, which points at a systematic offset. The overlay separates those two at a glance, because the shape of the gap over depth is the diagnosis.
Limitations
The overlay is a validation instrument, not a source of new signal, and it inherits every weakness of its inputs. The ground-truth logs it compares against are themselves digitised or hand-picked, so the residual measures agreement with a reference that is not error-free; a small MAE against an imperfect truth is not the same as a small error against the physical borehole. The cubic-spline resampling assumes the underlying curve is smooth between samples, which is usually fair for petrophysical logs but will round off genuinely sharp features and can introduce mild overshoot near steep transitions, so the residual near a hard step is partly an artifact of the interpolant rather than the model. The 300-point grid is a fixed choice tuned for the wells in this engagement; a much higher-resolution log would want a denser grid, and reporting on a grid coarser than the true sampling can hide real error. The per-curve statistics reported here come from specific examples on one loss configuration and are not a claim about population-level accuracy across the archive. And the whole contract validates the curve, not the depth registration that placed it: a curve that is right in shape but shifted in depth can still overlay poorly, and disentangling a shape error from a depth error is work the overlay flags but does not finish.
References
-
de Boor, C. (1978). A Practical Guide to Splines. Springer-Verlag. https://link.springer.com/book/9780387953663
-
Ronneberger, O., Fischer, P., and Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. https://arxiv.org/abs/1505.04597
-
Salehi, S. S. M., Erdogmus, D., and Gholipour, A. (2017). Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. MLMI 2017. https://arxiv.org/abs/1706.05721
-
Willmott, C. J. and Matsuura, K. (2005). Advantages of the Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE) in Assessing Average Model Performance. Climate Research. https://www.int-res.com/abstracts/cr/v30/n1/p79-82/