The hardest question in the whole peer review was one sentence long: do your splits leak? It is the right question to ask of any model trained on borehole-image data, because the failure it points at is real and quiet. Two patches cut from the same well a few centimetres apart are near-duplicates. Shuffle patches into train and test at random and you can land one patch in training and its almost-identical neighbour in test, so the model is graded on rock it has effectively already seen, and the score comes back flattering and wrong. The textbook fix is to make the well the unit you split on, so every patch from a borehole stays on one side of the line. On this engagement we did not use it, and this piece is the honest account of why, and of what we did instead so the numbers still mean what they claim.
The question a reviewer is really asking
When a reviewer asks whether a split leaks, they are asking whether the reported score would survive on data the model never touched. The leakage mechanics on depth-sorted logs, and the grouped cross-validation methods that close the channel, are covered in our research piece Splitting by Well, Not by Row; we will not re-derive them here. That piece makes the case for well-level splitting on a public benchmark of 118 wells, where you can spend wells freely. The interesting problem is the opposite regime: what an honest team does when the well-level split is the correct default and it is out of reach.
Our corpus was 14 vertical wells from a single Middle East carbonate field, logged with borehole-image tools. Of those, 11 carried consistent bedding picks and fed the combined and beddings-only datasets; all 14 fed the fractures-only dataset. That is the whole world of data. There was no held-back region to draw a fresh well from, and no second field to borrow one.
Why one well is a luxury you cannot buy at 14
Hold out a whole well for test and you pay a fixed tax on the corpus. With W wells, reserving one costs
At the 118-well public benchmark that tax is under one percent and no one notices. At W equal to 14 it is about 7%, and for the 11-well combined dataset closer to 9%. That is not a rounding error. Because every well in a carbonate field carries its own structural character, the well you happen to reserve swings the test distribution as much as the training distribution. Run the same experiment holding out a different well and both numbers move, so the score reports the luck of the draw as much as the skill of the model. At single-field scale, a one-well test set is not a measurement. It is a coin flip you have paid a training well to perform.
There was a second cost. A whole-well holdout would have left roughly a dozen wells to train a from-scratch detection stack that already runs on a 236-patch base dataset before augmentation. Thinning that base to buy a test well trades away the one thing the small-data regime cannot spare, which is training signal. So the choice was never patch-shuffle versus a cleaner alternative. It was patch-shuffle versus a test number nobody could trust, built on a training set nobody could afford to thin.
Two gates instead of a purist split
Choosing the shuffle does not make the leakage question go away. It shifts the burden of proof onto you to show, by other means, that the reported score is not the inflated one a leak produces. We leaned on two gates, and the shuffle was only defensible because both passed.
The first gate is validation-test parity. A leaked split has a signature: it flatters whichever set the model was tuned against, so the validation number pulls ahead of a truly independent test number. If validation and test track each other tightly across thresholds, the split is behaving. On the beddings-only model, F1 at a 3 cm depth threshold came in at 60.63 on validation and 60.66 on test, a gap of three hundredths of a point. The combined model held the same shape, with test F1 at 3 cm of 62.78 for beddings and 64.20 for fractures. Parity that tight is not what a leak looks like; a leak shows up as a validation score running ahead of test, and here the two are indistinguishable.
The second gate is a continuous blind zone, and it carries the weight. Rather than scatter test patches through the corpus, we set aside a single unbroken 12 m interval on a well the model never trained on, with no overlap with any training patch, inside a blind zone spanning 25 m. Continuity matters: a contiguous held-out interval cannot share a near-duplicate neighbour with training the way a shuffled test patch can, because there is no shuffled neighbour to share. On that blind zone the fracture model recovered about 90% of the picks within a 10 cm offset, and it repeated those picks run to run. That is the number that answers the reviewer, because it is measured on rock the model provably never saw, in a geometry that forecloses the exact leak the shuffle is accused of. The full protocol and what that recall means to an interpreter are in our case study on proving the model generalises on an unseen well.
The rubric below is the whole argument in one view. Drag the well count and watch the holdout cost climb along the 1/W curve; at 14 wells the verdict marker sits above the affordability line, so the split routes to a shuffle, and the two gates on the right are the price that shuffle has to pay.
The rubric other teams should use
The decision is not patch-shuffle good, well-split bad, or the reverse. It is gated, and the gate order matters.
Start with the well count. If wells are plentiful, hold out whole wells and stop reading; the holdout cost is negligible and the well-level split is both the honest default and the easy one. As the count falls, the 1/W tax rises until a single-well test set stops being a measurement and becomes noise. Where that line sits depends on how much between-well variation your field carries, which is why we draw it as an illustrative threshold rather than a universal constant. Below the line, a patch-level shuffle is legitimate, but only if you clear both gates. Check validation-test parity across thresholds; a validation number running ahead of test is the leak announcing itself, and it disqualifies the shuffle. Then build a continuous blind zone on an unseen well or an unseen contiguous interval, with no patch overlap, and report performance there as the real headline. If both hold, the shuffle is defensible and you say so plainly. If either fails, you are back to finding the wells for a proper holdout, whatever it costs.
What this reframes is where the burden of proof lives. A well-level split earns trust by construction; you do not argue for it. A patch-level shuffle earns trust only by evidence, and the evidence has to be the kind a leak cannot fake. Validation-test parity and a continuous blind zone are that kind, which is why they, and not a purist appeal to well-splitting we could not afford, are the honest defence at 14 wells.
Limitations
The affordability line in the rubric is a decision aid, not a measured constant. The well count at which a single-well holdout stops being trustworthy depends on the between-well variance of the specific field, and we have not estimated that variance formally; the 1/W cost curve is exact arithmetic, but the threshold drawn across it is illustrative. The parity and blind-zone figures are from one carbonate field of 14 wells, and both gates are necessary rather than sufficient: passing them is evidence against leakage, not a proof of generalisation to a different field, tool, or structural style. A continuous blind zone tests one unseen interval on one well, so it speaks to that well's rock and no further; broader generalisation still needs more fields. Finally, tight validation-test parity is consistent with an honest split but does not by itself rule out a leak shared equally by both sets, which is precisely why we pair it with a held-out zone the model never saw rather than relying on parity alone.