A 15-Year Arc of Evaluation Rigor in Borehole-Image AI: From No Objective Metric to Blind-Well F1

Abstract

The usual way to tell the story of automated borehole-image interpretation is as a rising accuracy curve: better features, then better architectures, then transformers, each step a higher number. This piece tells the same fifteen years as a different curve, the one that matters more for whether a number can be trusted. It is the rising bar for what an accuracy figure has to survive before it means anything. In 2009 the bar was on the floor: a paper on fracture detection said openly that no objective evaluation was possible because there was no ground truth, and graded its output by eye against one interpreter [1]. By 2020 a fast-region detector reported AUC of 98% for fractures and 90% for breakouts, but on roughly three hundred thousand synthetic images, and with a true positive counted whenever a predicted centroid landed within 50 pixels for fractures or 150 pixels for breakouts of the label [2]. The bar we work to now is blind-well F1 at a 5 cm depth tolerance, computed from depth-tolerance confusion matrices against thresholds negotiated with the client's expert interpreter. On that test our separate-Bb model scores 67% on validation and 60% on a held-out well, with 75% recall at 5 cm across 16 wells. The central claim is that the smaller, later number is the honest one, because the test underneath it got harder in every dimension. The real frontier is evaluation rigor, not one more accuracy point.

The year the field admitted it could not grade itself

Start at the least flattering place, because it is the most honest. A 2009 paper on automatic fracture detection in borehole images contains two admissions that frame everything the following fifteen years fixed. The first is about the prior art it compared against: those methods "typically show just a few, isolated, detected fractures, indicating that these methods indeed fail to perform well." The second is about its own results, and it is the sentence to carry through this whole piece: the authors "cannot provide a truly objective evaluation of the results, because there is no ground truth" [1].

Read that literally. It is not a caveat about a hard dataset. It is a statement that the evaluation instrument did not exist. What stood in for it was one experienced interpreter looking at the output and judging whether the picks looked right, with acknowledged false picks that needed manual post-processing afterward. That is not a weakness of that particular team; it was the state of the field. Borehole images are ambiguous, an interpreter's picks are themselves interpretive, and nobody had yet built the depth-registered, tolerance-aware apparatus you would need to turn "looks right" into a number two people could argue about. When there is no held-out score, there is nothing to inflate and nothing to deflate. There is only an expert's confidence, which is exactly what a metric is supposed to replace.

Synthetic images and a loose ruler

The next stop on the arc looks like a leap forward, and in one sense it is. By 2020, a fast-region convolutional detector was reporting AUC of 98% for fractures and 90% for breakouts on acoustic borehole images, against 81% and 73% for classical methods [2]. A single objective number, comfortably in the high nineties, is precisely the thing the 2009 work said it could not produce. But the number is only as strong as the test that produced it, and this test had two features worth stating plainly.

First, the training corpus was roughly three hundred thousand synthetic images, built by quilting backgrounds and inserting smoothed sinusoids and dark shapes, of which on the order of ten thousand carried fractures and ten thousand carried breakouts [2]. Synthetic data is a legitimate and often necessary move, but a score measured on it answers a narrower question than it appears to: how well the model separates events from a generator it has effectively memorised the conventions of, rather than how well it reads a real operator's scan. Second, the ruler was loose. A predicted detection counted as a true positive if its box centroid fell within 50 pixels of the ground truth for a fracture, or within 150 pixels for a breakout [2]. A tolerance measured in tens or hundreds of pixels forgives a great deal of positional error, and in a task where the whole point is where the feature sits in depth, forgiving position is forgiving the hard part.

None of this makes the 98% wrong. It makes it a number about a specific, permissive test: single-event acoustic classification, on synthetic images, with a coarse centroid match. The honest reading is not that the model is bad but that the figure and the test have to be quoted together, because the figure without the test is an invitation to be fooled.

This piece walks the whole fifteen-year arc, so it treats that 98% as one waypoint on a longer line. The single-number autopsy, why the 98% and a real blind-well F1 differ by nearly two orders of magnitude once you put the 50-pixel tolerance and a 2 cm depth window on the same ruler, is a separate study of its own: "The 98% AUC That Wasn't." Read that one for the deep dive on the single figure; read this one for where that figure sits in the arc.

The waypoint: most of the image is valueless

Between the loose synthetic AUC and the blind-well bar sits a reframing that changed how we thought about the target. Ren and colleagues, writing on how models handle data they were not trained to recognise, give the open-set problem its sharp form: in an image-interpretation task, most of what the model sees belongs to none of the trained classes, and for the model that content is, in effect, valueless [3]. On a borehole image this is not an edge case; it is the majority of every frame. A resistivity image is mostly matrix, most of which is neither a fracture nor a bedding plane nor a vug, and a detector that has only ever been rewarded for finding the rare events has no honest way to say "nothing here" about the vast background.

This matters for evaluation because it explains why a high score on a benchmark that only shows the model event-bearing patches can collapse in the field. The test never asked the model to reject the background, so the score never measured the skill that real logs demand most. Reading the borehole-image task as an open-set problem, where correctly abstaining on valueless data is part of the job, is what pushes evaluation toward held-out wells and confusion matrices that count the negatives, not just the hits.

A fifteen-year arc of how borehole-image AI was judged, read across three evaluation regimes. In 2009 there was no held-out score at all: the authors stated plainly that there was no ground truth, and evaluation meant eyeballing the output against one experienced interpreter (SPIE 2009). By 2020 the field reported AUC 98% for fractures and 90% for breakouts, but on roughly three hundred thousand synthetic images and with a true positive counted whenever a predicted box centroid landed within 50 pixels for fractures or 150 pixels for breakouts of the ground truth (Dias et al., 2020). Today the bar is blind-well F1 at a 5 cm depth tolerance: a separate-Bb model scores 67% on validation and 60% on a held-out well, with 75% recall at 5 cm across 16 wells, all read off depth-tolerance confusion matrices against thresholds negotiated with the client's expert interpreter. The orange meter on the right is the only element that argues: evaluation slack, an ordinal reading of how much each regime's test could hide, collapses from total to thin across the arc while the reported number falls, because the honest number is the smaller one earned on the harder test. Ren et al. 2020's open-set framing of 'valueless data', that most of a resistivity image belongs to none of the trained classes, sits mid-arc as the waypoint between loose synthetic AUC and blind-well F1. Every plotted percentage is sourced; the slack heights are an illustrative ordinal reading of test permissiveness, not a measured quantity.

The bar we work to now

The current standard is where the arc lands, and it is deliberately unforgiving. We evaluate on a blind well, a continuous held-out zone the model never saw in training, and we score F1 at a 5 cm depth tolerance rather than a pixel count in the tens or hundreds. The 5 cm is not arbitrary either. It was negotiated with the client's expert interpreter as the depth error that still counts as the same feature, which ties the metric to what a petrophysicist would actually accept rather than to whatever tolerance flatters the model. The scoring instrument is a depth-tolerance confusion matrix: a prediction matches a label only if it falls within the agreed depth window, and everything else is counted honestly as a false positive or a miss.

Under that regime our separate-Bb model, the one we split out from the combined model in early 2023 so that bedding and fracture picks would not add label noise to each other, scores F1 of 67% on validation and 60% on the blind well, with 97% of its true positives within 2 degrees on dip and 90% within 25 degrees on azimuth. Across 16 wells, fracture recall at 5 cm reaches 75%, up an accuracy-evolution ladder from 10% at 3 wells and 40% at 8 wells. Set the 60% blind-well figure next to the 98% synthetic AUC of 2020 and the naive reading is that the field went backward. The correct reading is the opposite. The 60% is earned on a held-out well, at a tight depth tolerance, on real logs, counting the background the model has to reject. It is a smaller number describing a much harder test, which is what a maturing field's honest metrics look like.

The one mechanic this piece shares with our benchmark-literacy note, that a held-out score is only meaningful if held-out means genuinely unseen rather than merely set aside from the same barrel, we do not re-derive here; that argument lives in "How to Read a Benchmark Without Being Fooled by It." The tolerance mechanic, why the same 60% blind-well F1 cannot be subtracted from the 98% synthetic AUC once both are placed under the acceptance rules they were computed on, we also leave to its own study in "The 98% AUC That Wasn't." The point that is specific to this arc is temporal: the profession moved from having no held-out concept at all, through a held-out set drawn from the same synthetic generator, to a blind well from a different part of the field, and each move made the number both smaller and more trustworthy.

Why the falling number is the story

Put the three regimes on one line and the pattern is clear. Evaluation slack, the amount a test can hide, fell from total in 2009, when there was no objective score to hide behind, through high in 2020, when a permissive synthetic benchmark with a coarse ruler could carry a figure into the high nineties, to thin today, when a blind well at 5 cm leaves very little room between what the model does and what the number says. As the slack fell, the headline number fell with it, and that co-movement is the whole argument. A field that reports higher numbers on ever-easier tests is not improving; it is inflating. A field whose numbers fall as its tests harden is doing the opposite, and that is the shape our fifteen years actually trace.

For anyone buying or building a borehole-image model, the practical takeaway is to read the test before the score. Ask what data the number was measured on, real or synthetic, seen or genuinely blind. Ask how tolerant the matching rule is, and whether that tolerance was set by a domain expert or by convenience. Ask whether the background the model has to reject was ever part of the evaluation. A 60% that survives those questions is worth more than a 98% that does not, and the difference between them is not model quality. It is evaluation rigor, which is the part of this field that was still being built while the accuracy line was busy rising.

Limitations

This is a reading of an arc, not a controlled cross-method benchmark, and it inherits that boundary. The three regimes we place on the line come from three different bodies of work on different sensors and tasks: the 2009 paper and our own current work concern resistivity-style borehole images and fracture and bedding detection, while the 2020 comparator is an acoustic-image detector reporting AUC on a largely synthetic corpus. Because the sensors, the events, and even the metrics differ, the numbers are not directly commensurable, and we do not claim they are; the comparison is about the rigor of each evaluation regime, not a like-for-like accuracy race, and the falling headline number is evidence about test difficulty rather than a measured head-to-head. The "evaluation slack" in the instrument is an ordinal, illustrative reading of how permissive each regime's test was, not a quantity we computed; only the plotted percentages, the 2009 admission, the 2020 AUC and pixel tolerances, and our own blind-well and multi-well figures are sourced. Our blind-well metrics are from a specific engagement on a specific carbonate reservoir, and a single held-out well, however honest, is still one well; broader external validity would need blind evaluation across many operators and basins, which is exactly the direction the rigor argument points and exactly what no single project has yet delivered. Finally, we read three landmark points, not the whole literature, and a reader can surely name work that sat closer to blind-well rigor earlier than the arc's midpoint suggests.

What to carry from the arc

The honest history of borehole-image AI is not a rising accuracy line but a rising bar for what an accuracy number has to survive. Read as evaluation rigor, the field's real progress is a falling headline number on ever-harder tests.
In 2009 a fracture-detection paper stated there was no ground truth and no objective evaluation was possible, grading output by one interpreter's eye. There was no held-out score at all, so there was nothing to inflate and nothing to trust.
By 2020 a detector reported AUC of 98% for fractures and 90% for breakouts, but on roughly 300,000 synthetic images and with a true positive counted whenever a predicted centroid fell within 50 or 150 pixels of the label. The figure is real; the test underneath it was permissive, and the two have to be quoted together.
The current bar is blind-well F1 at a 5 cm depth tolerance, read off depth-tolerance confusion matrices against expert-negotiated thresholds. Our separate-Bb model scores 67% validation and 60% blind, with 75% recall at 5 cm across 16 wells; the smaller number is the honest one because the test is harder in every dimension.
Reading borehole images as an open-set problem, where most of the image is 'valueless data' belonging to no trained class (Ren et al. 2020), is the waypoint that pushed evaluation toward held-out wells and confusion matrices that count the background, not just the hits.

The smallest habit this arc would install is a single question to ask of any borehole-image accuracy figure before trusting it: what test produced this number, and how much could that test hide. A high score on a permissive, synthetic, loosely matched benchmark is silent about field performance in a way its own number will never reveal, and a lower score earned on a blind well at an expert's tolerance is the more valuable of the two precisely because it had less room to flatter.

References

[1] Zhang, X., and Xiao, C. Automatic Detection of Fractures in Borehole Images. Proceedings of SPIE, 2009. States plainly that no objective evaluation of the detected fractures is possible because there is no ground truth, and evaluates output by visual comparison against one experienced interpreter. https://doi.org/10.1117/12.827541

[2] Dias, L. O., Bom, C. R., Faria, E. L., Valentin, M. B., Correia, M. D., Marcio de Albuquerque, Marcelo de Albuquerque, and Coelho, J. M. Automatic detection of fractures and breakouts patterns in acoustic borehole image logs using fast-region convolutional neural networks. Journal of Petroleum Science and Engineering, 2020. Reports AUC of 98% for fractures and 90% for breakouts, trained largely on synthetic acoustic images, with a true positive counted when a predicted box centroid falls within 50 pixels (fractures) or 150 pixels (breakouts) of ground truth. https://doi.org/10.1016/j.petrol.2020.107099

[3] Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B. B., Chen, X., and Wang, X. A Survey of Deep Active Learning. Tsinghua Science and Technology (TST), 2020. Framing reference for the open-set and background-rejection problem in image interpretation, where most of an image belongs to none of the trained classes and is, for the model, valueless data. https://doi.org/10.26599/TST.2019.9010020

A 15-Year Arc of Evaluation Rigor in Borehole-Image AI: From No Objective Metric to Blind-Well F1

Abstract

The year the field admitted it could not grade itself

Synthetic images and a loose ruler

The waypoint: most of the image is valueless

The bar we work to now

Why the falling number is the story

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on