Most write-ups about subsurface machine learning spend their pages on the model. The interesting failure lives one layer down, in a place nobody photographs: the depth reference. On a roughly twenty-month engagement with a mid-sized Middle East carbonate operator we spent real weeks not on architecture but on one unglamorous question, whether three files that all claim to describe the same borehole agree on where anything is. They did not. The image log, the interpreter's Excel pick sheet, and the derived-property file each carried their own depth reference, and the offsets between them were small enough to miss and large enough to poison every label we trained on. This note is about that offset, why it does not behave the way you expect, and the reconciliation discipline we put in front of the model before any accuracy number was worth reporting.
Three files, three ideas of where a feature is
Start with the physical fact that sets the floor. In this dataset one pixel of the unrolled image corresponded to about 3 cm of depth, a consequence of the finite resolution of the binary wireline log the image comes from. So a plus-or-minus 3 cm depth uncertainty is present in every pick before any model runs, because you cannot localise a feature more precisely than one raster cell. That number, 3 cm, is not a tuning knob. It is the smallest depth difference the data can even represent, and it becomes the yardstick everything else is measured against.
Now the three files. The image log is a raster indexed in its own depth channel. The interpreter works in separate software and hands back an Excel sheet of picks, each row a measured depth with a dip and azimuth. A third derived-property file carries computed quantities against yet another depth column. Each file is internally consistent; the problem is between them. If the Excel reference and the image-log reference disagree by even a few centimetres, nothing throws an error, nothing looks wrong at coarse scale, and the pipeline runs to completion. The labels it produces are simply wrong, and wrong in a way that stays invisible until a downstream metric refuses to move.
Why the offset displaces the label instead of blurring it
The mechanism is specific, and it comes straight from how we built the training labels. We did not hand-annotate pixel masks. We synthesised the ground truth by reading the operator's feature-listing Excel and reconstructing each sinusoid from its measured dip and azimuth, then placing that reconstructed curve at the depth the Excel gave it. In one Phase-1 well that meant 1,501 picks, all bedding, reconstructed and stacked at their listed depths across roughly 2,228 to 2,890 m. The label a model learns for a feature is therefore anchored entirely to the Excel depth reference.
Follow that through. If the Excel reference sits 9 cm below the image-log reference, every reconstructed sinusoid is placed 9 cm away from the feature the image actually shows. The label does not become fuzzy around the true position. It moves, cleanly and completely, to a depth where there is nothing, and the network learns to associate a correctly-shaped curve with the wrong place. It does so confidently, because the label is internally crisp. A blur you can sometimes train through. A rigid displacement you cannot, because the target itself is lying about location with a straight face.
This is the point where the whole thing is easiest to get wrong and hardest to notice, so it is worth seeing it move.
The metric only means something after the references agree
Here is why registration is not housekeeping you do before the real work, but the precondition for the real work having any meaning. We evaluate the detector in physical depth, not in an abstract overlap score, because a sinusoid has no bounding box a mask metric could score. The definitions we agreed with the client's expert interpreter are blunt on purpose. A correctly-shaped sinusoid predicted at the wrong depth counts as a false positive. A true positive is the least-depth-error match within 2 cm. Dip and azimuth mean-absolute-error are computed on the true positives only, so a model that misplaces its curves never even reaches the dip-accuracy table.
Read those two rules next to an unregistered offset and the trap is obvious. A 9 cm reference offset guarantees that a shape the model got exactly right is scored as a false positive, because it lands more than 2 cm from where the ground truth claims the feature is. Precision and recall collapse for a reason that has nothing to do with the model and everything to do with the depth column, and the network is learning fine while the numbers look terrible.
Because human depth picks carry uncertainty of their own, the agreed evaluation allowed permissible relaxations of 2, 4, and 6 cm in depth, set with the interpreter rather than assumed by us. Those relaxations are a negotiated tolerance on top of a reconciled reference, not a substitute for reconciliation. A 6 cm relaxation over a 9 cm silent offset still scores a right answer as wrong.
The evidence that made this non-negotiable
The reason we could not wave the mismatch away as noise is that the picks were already sparse and irregular before any offset was added. When we plotted the distance between adjacent human picks against depth across a whole vertical well, most spacings sat near zero, but the distribution had outliers reaching about 28 m. One horizontal well was worse: sampling-interval interrogation of its raw pick sheet showed gaps up to about 175 m, and when we flagged it the interpreter confirmed he had deliberately focused on fractures over beddings in that interval. That is a legitimate interpretation choice, not an error, but it means the label field is honestly incomplete and honestly uneven.
Sparse, uneven labels and a silent depth offset compound each other. With picks already far apart, a reference shift can walk a reconstructed sinusoid past the nearest real feature entirely, so it no longer competes for a match. You cannot separate a model that is missing features from a model whose labels have been slid off them unless you have first proven the references agree. The QC scatter that exposed the 28 m outliers was as load-bearing as any architecture decision, because it told us label quality was the thing under test.
What reconciliation looks like in practice
The discipline is not exotic. It is a gate you refuse to skip. Before a well enters the training set, its Excel and derived-file depth references are checked against the image-log reference at known tie points, the offset is measured, and the labels are shifted onto the image-log raster. We treated depth-range disagreement between files as a first-class data-QC signal, alongside range checks on raw pixel values, because a static and a dynamic version of the same log arriving with depth coverage differing by metres is the same failure wearing a different mask.
The floor sets the acceptance criterion. Once the offset between two references is at or under one pixel, 3 cm here, there is nothing left to fix, because you have reached the resolution of the data and any residual disagreement is indistinguishable from raster noise. Reconciliation has a natural stopping point rather than a chase after false precision. Get every file onto the same reference to within one pixel, and only then let the accuracy numbers speak.
The lesson we keep relearning
None of this is glamorous, which is exactly why it breaks quietly. A model architecture gets reviewed, argued over, and ablated; a depth column gets trusted. The failure we spent weeks on was not in the network and not in the loss. It was in the assumption that three files describing one borehole shared one idea of depth. They did not, the offset was a handful of centimetres, and it was enough to make a correct model look broken. Depth-reference reconciliation, gated by the 1-pixel-equals-3-cm floor, is the work that has to happen before an accuracy figure carries any information at all.
Limitations
The 3 cm-per-pixel figure, the false-positive and true-positive definitions, the 2, 4, and 6 cm permissible relaxations, the label-synthesis-from-Excel mechanism, and the pick-gap figures of roughly 28 m and 175 m are all specific to this Middle East carbonate engagement and its tools; the exact numbers will differ on other datasets, though the failure mode does not. The interactive figure draws an illustrative sinusoid and lets you inject an offset by hand, so its curve shape and the specific centimetre value you drag are not measured quantities. We report the mechanism and the discipline, not a controlled study isolating the contribution of registration to final accuracy, because in practice reconciliation and the rest of the pipeline matured together rather than as separable experiments. Finally, the permissible relaxations were negotiated with one expert interpreter for one operator and should be re-agreed, not copied, on any new engagement.