Most of the time, one more well of training data helped. Our fracture detector for a mid-sized Middle East carbonate operator got steadily better as wells came in, with the usual shape: sharp gains early, diminishing returns later. Then we added a fifteenth well and validation F1 at a 5 cm depth threshold fell, from about 60 percent to about 57. Nothing had broken. The pixel-range check passed, the labels loaded, the training curve looked normal. A well we had every reason to trust had made the model worse.
This piece is about why, and about the thing that mattered more than well count once we were past a certain amount of data: the style the ground truth was picked in.
The uptick we had already seen once
The clue was in the well-count ablation we had run months earlier. Adding wells drove class error down hard and then flattened, but the last step went the wrong way: class error fell to a floor around 0.82 and then rose again at the largest well count, a small non-monotone bump we noted and moved past. The full curve, and why the last well behaves differently from every well before it, is a separate case study; here it is enough that the trend has a real up-tick at the tail, not a smooth asymptote. (See How Many Wells Is Enough? A Well-Count Ablation for Fracture Detection.)
The fifteenth-well F1 drop was the same phenomenon in a different metric. Once we lined the two up, the question stopped being "is more data always good" and became "what is different about the wells at the tail." The answer was not in the pixels. It was in the picks.
A clean well, picked in a different hand
The image-log data in the fifteenth well was fine. What differed was the human ground truth attached to it. When we pulled the raw picks and checked the sampling interval, the well had long stretches with no annotations at all, gaps running up to 175 m in a single interval. That is not a data-quality defect. The interpreter had deliberately concentrated on fractures in that well and left beddings largely unmarked, which is a reasonable thing for a geologist to do when the fractures are what the study cares about in that section.
For a person reading the log, that emphasis is invisible and harmless. For a network learning a joint distribution over beddings and fractures across all wells, it is a distribution shift. The first fourteen wells had been picked in a broadly consistent style, dense in both classes, so the model had settled its class balance and its notion of a "typical" interval around that style. The fifteenth well arrived with a different balance baked into its labels: heavy on fractures, sparse on beddings, with multi-metre stretches the model reads as "nothing here" even where beddings exist. Averaged into the same objective, that pulled the model off the operating point the earlier wells had found, and the three-point F1 drop landed exactly where the two label styles disagreed.
The instrument puts the two facts side by side on purpose. On the left, the before-and-after: a fourteen-well model near 60 F1, and the same model near 57 once the style-different well is pooled in. On the right, the well-count class-error trend with its non-monotone tail. They are the same story told twice. Neither says "more data is bad." Both say the fifteenth well was not the same kind of thing as the first fourteen, and the network could tell even though we could not, until we looked at the picks.
This is a different failure from feeding a model too much synthetic data, which we hit separately when an aggressive augmentation run ballooned one dataset and degraded the result; that mechanism, overlap and augmentation drowning the real signal, is covered in The 92k-Patch Trap. Here the extra data was real, from a real well, and still hurt, because the label style behind it was inconsistent with the rest.
Why averaging punishes the minority style
The mechanism generalises past borehole images. A detection model trained on pooled wells minimises a single loss averaged over every labeled example. When most wells share a picking style, that style is the majority and the loss rewards fitting it. A well picked in a minority style pulls the gradient toward a different class balance and a different sense of what an empty interval means. If the minority is small, the model mostly ignores it and you lose a little. If it is systematically different, as ours was with its fractures-over-beddings emphasis and its 175 m gaps, the pull is enough to move the validation number the wrong way.
There is a threshold hiding in this. Below some amount of data you are starved, and almost any additional well helps because the model needs coverage more than consistency. Above that threshold the model already has enough of the majority style to be well-fit, and the marginal well no longer buys coverage; it buys either reinforcement of the existing style or contamination of it. Past the threshold, annotator-style homogeneity dominates raw well count. That is the sentence we would tell a past version of ourselves.
The fix was to stop pooling
The instinct when a model regresses is to add more data or tune harder. Neither addresses a label-style split, because the problem is not that the model is under-trained, it is that it is being asked to fit two distributions at once. The fix was to stop pooling and route by class onto style-consistent well sets.
Concretely, we split into separate models. Beddings and the combined dataset were built from the eleven wells whose picks were consistent for both classes. Fractures, where more wells carried usable picks, got their own model on fourteen wells. Each model then saw one label style, and the minority-style contamination went away because there was no longer a shared objective for it to contaminate. The beddings-only model validated at F1 60.63 at a 3 cm threshold and around 69 at 5 cm, in the range the combined model had held before the fifteenth well disturbed it, without the drop. The gain was not from more data. It was from removing the style conflict the pooling had created.
That is the counter-intuitive part worth carrying. We improved the model by training it on fewer, more homogeneous wells per head rather than more wells in one pool. The separate-model split is not elegant, and it costs you a second training run and a second checkpoint to serve, but it respects a fact about the data that a single pooled objective cannot: two interpreters, or one interpreter in two moods, do not produce interchangeable ground truth, and a network averaging over them pays for the difference.
What we check now before adding a well
The lasting change was procedural. A new well is no longer trusted just because its pixels are clean and its labels load. Before it goes into a training pool we profile its picks: class balance, pick density, largest sampling gap. A well whose profile sits far from the pool's is a candidate for its own model, not the shared one. This is upstream of any accuracy metric, cheap to run, and it would have flagged the fifteenth well before it cost us three F1 points.
None of this argues against inter-model consistency as a goal; the case for AI picks over variable human interpretation still holds, and we make it in Consistent, Auditable Interpretation. The point here is narrower and, for anyone assembling a training set well by well, more immediately useful: once you have enough data, the next well helps only if it was picked in the same style as the rest. Count is not the variable that matters at the tail. Style is.
Limitations
The fifteenth-well regression is a single validation result on one operator's data; the F1 endpoints near 60 and 57 are the sourced validation figures for that experiment, not a repeated measurement across seeds. The recovered per-class numbers are the sourced beddings-only figures, F1 60.63 at 3 cm and around 69 at 5 cm; where this piece describes the split-by-class recovery at 5 cm it is anchored to that bedding figure rather than a separately reported 5 cm number, and the instrument captions it as such. The threshold at which style begins to dominate count is described qualitatively; we did not run a controlled sweep to locate it, and it will vary with class balance, model capacity, and how strongly the minority style diverges. Finally, "picking style" here bundles class emphasis, pick density, and sampling gaps together, which a more careful study would separate; we treat them as one because in this well they moved together.