The assumption a synthetic benchmark cannot see
A synthetic image-log benchmark makes a quiet promise before a single model touches it: wherever an event exists in the image, a label exists in the ground truth. The generator that drew the sinusoid also wrote its depth, dip and azimuth, so density is total and uniform by construction. Nothing is missed, because nothing is human. This is the property that makes a synthetic set convenient, and it is also the property that makes an accuracy number computed on it optimistic in a way the number itself cannot report.
Real wells break the promise on the first pass through the raw picks. We work on borehole image logs from a mid-sized Middle East carbonate operator, and the ground truth we train and grade against is a set of expert picks recorded well by well: for each interpreted sinusoid, a depth and an orientation, entered by hand by an interpreter reading the image. When we plotted the distance between each pick and the next one below it down a whole well, the plot did not look like the dense uniform wall a synthetic set assumes. Most spacings sat near zero, as expected in a densely worked interval. But the tail ran long. Adjacent picks in one well sat as far as 28 m apart. That single number is the reason this note exists.
Reading the spacing between picks as a signal
The habit we want to install is small: before trusting any per-well accuracy figure, read the spacing between the picks that produced it. Pick spacing is not a nuisance to smooth away. It is a measurement of how the ground truth was made, and it carries information the accuracy number does not.
The raw pick sheets make the point concretely. In one well the picked depths jump from 917 m straight to 1109 m, then the interval from 1109 to 1313 m carries only 5 picked sinusoids, where a comparably thick worked interval elsewhere in the same well carries roughly 50 over 900 to 1764 m. Other jumps sit in the record the same way: 1731 m to 2054 m, 2096 m to 2318 m, 2898 m to 3100 m. A separate read of the sampling interval alone, the depth step between consecutive picks with no orientation attached, shows gaps of about 7, 10 and 27 m opening around 1750, 2000 and 2400 m, intervals where no sinusoids were picked at all.
There are two innocent explanations and one that matters, and spacing is how you tell them apart. Either the rock genuinely had nothing to pick over those metres, or the interpreter did not work that interval to the same depth. A 27 m stretch with no picks in a carbonate that is densely fractured a hundred metres above is not obviously the first case. When we sent these spacing questions back to the interpreter, the answer was sometimes that the interval was deliberately deprioritised, and once, on a horizontal well, that the picks were concentrated on fractures over beddings on purpose, opening gaps as wide as 175 m. Spacing exposed the annotator's style before any model was trained: a gap wide enough to be improbable is a flag on the label, not a fact about the reservoir.
Density is not the same number as count
It is tempting to summarise a label set by its count. This well has so many thousand picks; that patch carries so many sinusoids. Count is easy to report and it is the wrong summary, because two wells with the same total pick count can have completely different density profiles, and density is what governs whether a per-well score means anything.
Two facts from the same archive show why count hides the problem. First, at the patch level the labels are dense to the point of collision: more than 95 percent of the training patches carry about 70 sinusoids each, and about 5 percent carry more, which means the common case is not a clean single event but a patch crowded with intersecting sinusoids that a detector has to separate. Second, at the well level the labels are sparse and uneven in exactly the way the spacing plot showed. The same corpus is simultaneously label-dense inside a worked patch and label-sparse across the depth of a well. A single count cannot express both, and an evaluation that quietly assumes uniform density inherits the assumption without stating it.
The consequence for an accuracy claim is direct. A per-well F1 is computed over the picks that exist. If an interval was never worked, a model that correctly finds a sinusoid there is scored against no ground truth, or worse, penalised as a false positive because the interpreter's silence reads as absence. The score is a statement about the labelled fraction of the well, not about the well. On a densely, uniformly labelled synthetic set the labelled fraction is the whole thing, so the distinction never surfaces, which is why a model can look calibrated on the benchmark and drift the moment it meets a field well whose labels are sparse [1]. The label errors that reorder benchmark rankings in the wider literature are the same failure from a different angle: when the ground truth is uneven, the ranking measures the ground truth as much as the model [2].
What the supply side does to the problem
Annotation density is not only a property of how a given well was picked. It is also downstream of how much labelled data exists at all, and here the field reality is starker than a benchmark ever admits. The study ran on 14 wells in total. For the phase that most needed more annotated wells, the plan requested 25 wells for annotation and 8 arrived. You cannot hold out a whole well for testing when you have 14, and you cannot average away an uneven density profile when the wells that would balance it were never delivered.
This is why density has to be a first-class dataset property rather than a footnote. When labelled wells are scarce and each one is unevenly picked, the honest move is to measure and report the density, not to quote a single accuracy figure as though the ground truth behind it were uniform. A per-well F1 at a stated depth tolerance is a fine number, but it should travel with the spacing profile of the well it was computed on, so a reader can see how much of the well the number covers. We build this into our own reporting: the depth-tolerance confusion matrix we score against is paired with the pick-spacing read of the same well, so a strong score over a densely worked interval is never quietly generalised to the metres nobody picked.
Why this bounds the claim rather than just complicating it
The framing we are arguing for is not that field labels are noisy and one should be careful, which is true of everything and helps no one. It is sharper. Annotation density sets a ceiling on what a per-well accuracy number can mean, and the ceiling is measurable in advance from the picks alone, before any model runs.
If the widest gap between adjacent picks in a well is 28 m, then any claim about the model's behaviour over those 28 m is unsupported by that well's ground truth, full stop. If a phase delivered 8 of 25 requested wells, then a study-wide accuracy averaged over the delivered wells is an average over a biased, sparse sample, and the honest error bar is wide. If more than 95 percent of patches carry about 70 intersecting sinusoids, then a metric that assumes one event per region is measuring a different task than the one the data poses. None of these are model criticisms. They are statements about the ground truth that a synthetic benchmark, dense and uniform by construction, is structurally unable to make. The work of this note is to move those statements to the front, where they belong, and to make the reading that produces them, pick spacing down depth, a standard QC step rather than an afterthought.
Limitations
This is a data-quality note grounded in one operator's archive, and its boundaries follow from that. The spacing figures we quote (adjacent picks up to 28 m apart, the specific jumps from 917 to 1109 m and the others, the sampling-interval gaps of roughly 7, 10 and 27 m around 1750, 2000 and 2400 m) are real values read from the raw pick sheets and the QC figures of a single carbonate study; the pattern they show, dense inside patches and sparse across depth, we expect to recur in field image-log work generally, but we have not measured it across operators, basins or tool vendors, and the exact numbers will differ elsewhere. The claim that a wide pick gap flags annotator style rather than reservoir absence rests on interpreter feedback for specific intervals, not on an independent re-pick of the whole well, which would be the stronger test and which we did not run. The patch-occupancy figure of about 70 sinusoids in more than 95 percent of patches is a property of our patching geometry and would move under a different patch size. In the instrument, the metre values and the sinusoid and well counts are sourced from the engagement archive, while the vertical scatter of dots inside each labelled depth band is illustrative jitter for legibility, not a measured per-pick position, and it is flagged as such on the canvas. Finally, we frame density as a bound on what an accuracy claim can mean; we do not offer a single corrected metric that folds density in, because the right correction depends on what the downstream consumer of the picks can tolerate, and that is a decision for the reservoir team, not a constant we can publish.
References
[1] Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do ImageNet Classifiers Generalize to ImageNet? International Conference on Machine Learning (ICML), 2019. Rebuilds fresh test sets by replicating the original collection process and finds large accuracy drops, evidence that a leaderboard number can overstate how a model behaves once the input drifts from the exact sample the benchmark froze. https://arxiv.org/abs/1902.10811
[2] Northcutt, C. G., Athalye, A., and Mueller, J. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. NeurIPS Datasets and Benchmarks Track, 2021. Audits ten widely used benchmarks and shows that label errors in the test set alone can reorder model rankings, so a headline accuracy can be an artefact of how the ground truth was made rather than of the model. https://arxiv.org/abs/2103.14749