The step that decides whether a curve-prediction score means anything happens before any score is computed, and it is easy to skip because it looks like plumbing. A model that lifts curves off a scanned well log reads pixels and emits a curve sampled at whatever spacing the raster and the network's stride imply. The reference it will be graded against is a digital LAS log, sampled at its own fixed depth step, chosen years earlier by whoever logged the well and having nothing to do with our pixels [4]. Those two depth axes almost never coincide. And two series that do not share a depth index cannot be subtracted, which means the residual between prediction and truth does not exist yet, which means R-squared, mean absolute error, and mean squared error are not merely noisy but undefined. This note is about the contract that fixes that: resample both series onto one common depth grid first, score second. It is a validation-side concern, separate from how VeerNet, the encoder-decoder EarthScan uses to digitise raster logs, is trained or served.
We want to be precise about what problem this is and is not. It is not about the model's accuracy; a perfect predictor still produces a curve on the wrong depth samples relative to the reference, and a naive scorer would either error out or, worse, silently pair up depths that do not correspond and report a meaningless number. It is also not about interpolating to invent data the model never produced. The resample is purely an evaluation convenience: it puts prediction and truth on identical depths so that a residual is well defined at every point, and it does so with a method chosen to add as little of its own shape as possible.
Two depth axes that were never going to match
Start with why the mismatch is the normal case rather than an edge case. A LAS reference curve is a column of values on a regular depth step, but the step is a property of the logging run, not of anything we control. Our prediction comes out of a raster whose sampling after the network is set by the architecture, and there is no reason for those to align: different start depths, different steps, sometimes different spans where the scan clips a track the LAS includes. If you zip the two arrays together by position you are comparing our value at one depth to the reference at a different depth, and the error you compute is dominated by that offset rather than by the model. The comparison has to be re-expressed on a shared axis before it says anything about the model at all.
The shared axis we used is a fixed grid of 300 interpolated depth points spanning the overlap of the two curves. Three hundred is enough to represent the curve shape a petrophysicist reads without pretending to a resolution neither series really has, and fixing it as a constant across every validation example is what makes the metrics comparable between examples: a score computed on 300 points for one well and 300 for another is measuring the same thing, where two different native step counts would not be. The grid is the contract. Both the prediction and the reference are lifted onto it, and only then is a residual taken.
Cubic splines, because the resampler should be quiet
Given that you must resample, the question is with what. The requirement is unusual for an interpolant: it should be as opinionated as possible about smoothness and as unopinionated as possible about everything else, because any shape the interpolant invents between known points becomes part of the residual and gets charged to the model. Nearest-neighbour resampling introduces stair-steps that are pure artefact. Linear interpolation is honest but kinks the curve at every knot, and those kinks are not in either the prediction or the truth. A cubic spline threads a piecewise-cubic through the known samples with continuous first and second derivatives, so the resampled curve is smooth through the knots without the overshoot a single high-degree polynomial would produce across the whole span [1]. That second-derivative continuity is the property we actually want: it means the value the grid reads between two native samples is a smooth, local blend rather than a fabricated wiggle, so the residual reflects the model and the reference, not the resampler. The routine is standard and we used the standard one rather than writing our own [2].
There is a discipline point hiding here. The spline is applied to both series, prediction and reference alike, with the same grid and the same method. Resampling only one of them, or the two with different methods, would bias the comparison by whatever the two methods disagree about. The contract is symmetric on purpose, so that anything left in the residual is a real disagreement between what the model said and what the log said, at a depth where both are now defined.
What the honest numbers were
Once both curves live on the common grid, the metrics compute cleanly and can be read for what they are. On the validation set the best coefficient of determination we recorded was an R-squared of 0.9891, from the Tversky-loss run on curve 1, and the lowest errors on the resampled comparison were an MAE of 0.0132 and an MSE of 0.0004. Those are the headline numbers, and the reason we trust them is not that they are high but that they are computed on a grid where a residual is actually meaningful. A high R-squared taken from a positionally-zipped comparison would be an accident of how the two step sizes happened to line up; the same number on the 300-point grid is a statement about the model.
The per-curve picture is less flattering and more useful. Reconstructing the curves from the multiclass masks and comparing them as CSV series under Dice loss, the mean absolute error was 0.11 for curve 1 and 0.12 for curve 2. Those sit well above the best single-run MAE of 0.0132, and the gap is the point: the best number is a peak on the most favourable loss and curve, while the per-curve CSV figures are the everyday accuracy across the harder multiclass reconstruction. Both are honest because both are on the shared grid; they simply answer different questions, one about the ceiling and one about the working average. Reporting only the first would flatter the model, and reporting only the second would understate the best case. The resample is what lets us put them side by side without comparing apples to a different well's oranges.
It is also on this grid that the choice between MAE and MSE stops being cosmetic. With residuals now defined pointwise, an absolute-error summary tracks the typical depth miss while a squared-error summary is pulled around by the few largest misses and folds in the variance of the error rather than its average [3]. We read MAE as the number a geoscientist would feel and kept MSE as the tail check, but the precondition for reading either is the same: both curves on one axis first.
Limitations
This is an evaluation contract, and it should not be asked to carry more than that. The 300-point grid, the peak R-squared of 0.9891, the lowest MAE of 0.0132 and MSE of 0.0004, and the CSV MAE of 0.11 and 0.12 are the real archive figures, but the two curve shapes drawn in the exhibit are illustrative geometry chosen to show the alignment mechanism, not a plotted well; the resample and the residual it enables are the sourced parts. A cubic spline is the right default here but it is still an assumption, and where a native series is very sparse or genuinely discontinuous the spline will smooth across a feature that was real, so the resampled residual can flatter or penalise the model in ways that are properties of the interpolant, not the prediction; the honest response is to keep the grid no finer than the coarser series can support, which is part of why 300 rather than thousands. The grid also only spans the overlap of the two depth ranges, so any interval the scan clipped or the LAS omitted is simply not scored, and a model could be silently good or bad there without the metric noticing. And none of this speaks to whether the validation split was representative, whether the synthetic training logs covered the field failure modes, or whether a curve that scores well on the grid is actually usable downstream, which remain the questions that decide whether the model is any good.
The score starts at the shared axis
The habit this left us with is to treat the resample as part of the metric rather than as setup for it. A number computed before both curves share a depth axis is not a worse version of the real score, it is a different quantity that happens to have the same name, and the difference is exactly the offset between two sampling schemes that were never designed to agree. Put both series on one fixed grid with a quiet interpolant first, and R-squared and MAE go back to meaning what they are supposed to mean: a statement about the model, measured where the model and the truth are both defined.
References
[1] de Boor, C. A Practical Guide to Splines, revised ed. Applied Mathematical Sciences 27, Springer (2001). The standard treatment of spline interpolation, including the cubic spline whose second-derivative continuity gives a smooth resampled curve without ringing between knots. https://link.springer.com/book/9780387953663
[2] Virtanen, P., et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17 (2020), pp. 261-272. The interpolation routines, cubic splines among them, used to place two differently sampled series on a shared axis in practice. https://www.nature.com/articles/s41592-019-0686-2
[3] Willmott, C. J., and Matsuura, K. Advantages of the Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE) in Assessing Average Model Performance. Climate Research 30 (2005), pp. 79-82. Why an absolute-error and a squared-error summary of the same residuals answer different questions. https://www.int-res.com/abstracts/cr/v30/cr030079
[4] Canadian Well Logging Society. LAS Version 2.0: A Digital Standard for Logs. The format context for why a reference curve arrives on a fixed but arbitrary depth step that seldom matches a model's pixel-derived sampling. https://www.cwls.org/products/#products-las