Skip to main content

Case Study

Mask-to-CSV Validation With Cubic-Spline Curve Overlays

A segmentation mask is a proof no geoscientist can read. It scores well or badly on pixels, but a reviewer signing off a digitised well does not care about pixels; they care whether the recovered curve tracks the log they can see with their own eyes. This is our account of the validation contract that closed that gap: convert each predicted mask to a curve, resample both the prediction and the ground-truth log onto one shared depth axis of 300 points by cubic spline, overlay them, and read the per-curve error straight off the space between the two traces. The overlay turned an opaque mask quality number into a verdict a person could sign, curve by curve, with the honest cases and the weak ones both visible in the same frame.

Case study

The segmenter, VeerNet, was doing well by every number we tracked while training it, and none of those numbers convinced the one person whose opinion decided whether the pipeline shipped. The model produced masks, and we scored the masks against ground-truth masks, and the scores were honest measurements of how many pixels the model got right. But a geoscientist asked to trust a digitised well does not read pixels. They read curves. They want to lay the recovered curve next to the log they already know and see, with their own eyes, whether the two agree over depth. A pixel-space score cannot be inspected that way. It is a proof written in a language the reviewer does not sign in. This is the account of the validation step we built to translate: the mask-to-CSV overlay, where a predicted mask becomes a curve, the curve is resampled onto the same depth axis as the truth, and the error is read off the gap between them, one curve at a time.

The mask was a proof nobody could read

A segmentation mask answers a question about pixels. Given the log image, which pixels belong to curve one, which to curve two, which to background. Training against that question is exactly right, and the loss functions that optimise it, from Dice through the Tversky variant we settled on for its control over the false-negative penalty on thin foreground, are built for it [3]. The encoder-decoder that produces the mask is the standard shape for this kind of dense prediction, contracting to features and expanding back to a per-pixel labelling with skip connections carrying the fine detail across [2]. All of that is defensible and we can defend it.

What it cannot do is tell a petrophysicist that the resistivity curve on well such-and-such was recovered faithfully between two depths. The mask quality number and the question the reviewer actually asks are not the same question, and the distance between them is not rhetorical. A mask can score respectably on pixel counts while the curve it implies wanders off the true reading in exactly the interval the reviewer cares about, because a handful of misplaced pixels on a thin trace move the curve more than they move any pixel-averaged score. So we had a model that was good and a validation story that could not be signed, and until we fixed the second thing the first thing did not matter.

The overlay contract

The fix was to stop reporting on the mask and start reporting on the curve. Every predicted mask was collapsed to a curve, one value of the logged quantity per depth, which is the deliverable the whole system exists to produce. But a predicted curve and a ground-truth log are not directly comparable the moment they come off different rasters: they do not share a depth sampling, they do not have the same number of points, and the depth steps are not aligned. Comparing them point by point requires first putting them on a common footing.

That common footing was a shared depth axis. We resampled both the predicted curve and the ground-truth log onto one regular grid of 300 depth points using a cubic spline, the standard piecewise-cubic interpolant that reconstructs a smooth curve through known samples without the overshoot of a single high-degree polynomial [1]. Three hundred points was the density the validation notebooks settled on: enough that the two traces resolve as continuous curves rather than a coarse staircase, few enough to stay cheap to compute across a validation set. Once both curves live on that grid, they are directly subtractable, and every downstream number, every plot, every verdict comes from that subtraction.

Reading the error off the gap

With the two curves on the same axis, the error is the space between them, and it is measurable two ways that a reviewer understands. Mean absolute error is the average vertical distance between the predicted trace and the truth, in the curve's own units, which is precisely the quantity a person estimates when they eyeball how far the prediction sits from the log; it is the metric to lead with when you want the number to mean what the reader thinks it means [4]. Mean squared error is the same idea with large excursions weighted more heavily, which surfaces the occasional bad interval that MAE would average away. We reported both, per curve.

The numbers told a story with a shape. On the Tversky-loss multiclass run, curve one came out clean: mean MAE of 0.0277 and mean MSE of 0.0021, and on its best example the overlay reached an R-squared of 0.9891, the coefficient of determination saying the predicted trace explained almost all of the variance in the true log. A second example of the same curve was still strong at 0.8126. Curve two was the honest counterweight. Its mean MAE was 0.1241 and its mean MSE 0.0253, roughly four to twelve times curve one's error depending on the metric, and its example landed at an R-squared of 0.5461. That is not a good curve, and the point of the overlay is that it does not pretend to be. The gap is visible, on the same axis, in the same frame as the good one.

MASK TO CSV OVERLAY · PREDICTED CURVE ON THE TRUTH LOG0.9891R-squared, curve1 ex3Resample both traces onto one depth axis and the mask becomes a gap you can signA · CURVE / EXAMPLE ON THE SCOPEcurve1 ex3R2 0.9891MAE 0.0277MSE 0.0021curve1 ex2R2 0.8126MAE 0.0277MSE 0.0021curve2 ex2R2 0.5461MAE 0.1241MSE 0.0253WHAT A REVIEWER SEESraw maskopaque scoreno per-curve verdictCSV overlayper-curve gapa geoscientist signsB · OVERLAY SCOPE · 300 DEPTH POINTStruth logpredictedresidual (the error you sign)depth window (top to bottom of the resampled log)per-curve verdict legibleRESAMPLE DENSITYcubic spline to 300 depth points300 points12100200300300sourced: R2 0.9891 / 0.8126 / 0.5461, mean MAE 0.0277 & 0.1241, mean MSE 0.0021 & 0.0253, 300 depth points · plotted trace SHAPES are illustrative
The validation step that made a segmentation mask signable. Rather than report a mask quality score a geoscientist cannot read, we resampled both the predicted curve and the ground-truth log onto one shared depth axis of 300 points by cubic spline, overlaid them, and read the per-curve error straight off the gap. Lever A puts one of three sourced Tversky-loss cases on the scope: curve1 example 3 at R-squared 0.9891 (the clean overlay), curve1 example 2 at 0.8126, and curve2 example 2 at 0.5461, the case where the overlay honestly shows a curve the mask could not pin. Lever B drags the resample density: below a usable count the overlay is a coarse staircase no reviewer would sign, and at the 300 points the validation notebooks used the two traces line up point for point. The orange residual band is the only element that argues, because it is the visible error a geoscientist signs off. The per-curve R-squared values, the mean MAE of 0.0277 and 0.1241, the mean MSE of 0.0021 and 0.0253, and the 300 depth points are sourced from the engagement archive; the two plotted curve shapes are illustrative traces generated to the sourced statistics, not the wells themselves.

The exhibit is that comparison made interactive. Put each of the three sourced cases on the scope and the predicted trace sits over the truth log with the residual drawn as a band between them; the band is thin where curve one tracks its log and wide where curve two loses it. Drag the resample density and watch the reason the axis matters: at a dozen points the overlay is a staircase that hides the error, and at the 300 points the notebooks used the two traces resolve point for point and the residual becomes a verdict you can actually sign. The residual band is the only element that argues, because it is the error a geoscientist puts their name next to.

Why this changed the sign-off

Before the overlay, a review conversation was a negotiation about whether to trust a number. After it, the conversation was a person looking at their own curve with our prediction on top and deciding, per curve, whether the fit was good enough for the use they had in mind. That is a categorically easier thing to approve, and it is also a categorically easier thing to reject, which mattered just as much. The 0.5461 case did not get quietly rolled into an average that flattered the pipeline. It showed up on the scope as a curve we had not yet recovered well, next to a curve we had, and that honesty is what made the good numbers believable.

The overlay also gave us a triage tool we did not have from mask scores alone. A wide residual concentrated in one depth interval points at a local failure, a crossing or a faint trace the segmenter dropped, which is a different bug from a residual spread evenly along the whole curve, which points at a systematic offset. The overlay separates those two at a glance, because the shape of the gap over depth is the diagnosis.

Limitations

The overlay is a validation instrument, not a source of new signal, and it inherits every weakness of its inputs. The ground-truth logs it compares against are themselves digitised or hand-picked, so the residual measures agreement with a reference that is not error-free; a small MAE against an imperfect truth is not the same as a small error against the physical borehole. The cubic-spline resampling assumes the underlying curve is smooth between samples, which is usually fair for petrophysical logs but will round off genuinely sharp features and can introduce mild overshoot near steep transitions, so the residual near a hard step is partly an artifact of the interpolant rather than the model. The 300-point grid is a fixed choice tuned for the wells in this engagement; a much higher-resolution log would want a denser grid, and reporting on a grid coarser than the true sampling can hide real error. The per-curve statistics reported here come from specific examples on one loss configuration and are not a claim about population-level accuracy across the archive. And the whole contract validates the curve, not the depth registration that placed it: a curve that is right in shape but shifted in depth can still overlay poorly, and disentangling a shape error from a depth error is work the overlay flags but does not finish.

References

  1. de Boor, C. (1978). A Practical Guide to Splines. Springer-Verlag. https://link.springer.com/book/9780387953663

  2. Ronneberger, O., Fischer, P., and Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. https://arxiv.org/abs/1505.04597

  3. Salehi, S. S. M., Erdogmus, D., and Gholipour, A. (2017). Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. MLMI 2017. https://arxiv.org/abs/1706.05721

  4. Willmott, C. J. and Matsuura, K. (2005). Advantages of the Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE) in Assessing Average Model Performance. Climate Research. https://www.int-res.com/abstracts/cr/v30/n1/p79-82/

Go to Top

© 2026 Copyright. Earthscan