The 98% AUC That Wasn't: What Synthetic-Data Benchmarks Hide in Image-Log AI

Abstract

A benchmark number is only as strong as the acceptance rule behind it, and acceptance rules are rarely printed next to the score. A frequently cited result in borehole-image machine learning trained a fast-region convolutional network almost entirely on synthetic acoustic images, 300,000 of them, and reported an area-under-curve of 98% on fractures and 90% on breakouts, beating classical baselines of 81% and 73% [1]. Read quickly, that 98% looks like a target our own fracture work should clear. Read carefully, the two tasks are not the same test. The synthetic result counts a detection correct when a predicted box centroid lands within 50 pixels of ground truth for fractures, or 150 pixels for breakouts. Our task ran on real resistivity logs where one pixel is 3 cm of depth, the agreed match window with the human interpreter was 2 cm, events intersect inside one patch, the tool corrupts the image, and the human picks that form ground truth can sit as much as 28 m apart. On a horizontal blind well the model had never seen, fracture F1 fell from about 82% on the standard test split to about 63%. The honest comparison is not 98% against 82%. It is a synthetic classification score under a 50-pixel tolerance against a real regression-and-detection score under a 2-cm tolerance, and the second number is the one worth publishing.

Two results that look adjacent and are not

The prior result is a good piece of engineering and we cite it as the closest published comparator to our own [1]. Its pipeline is lean: a VGG-16 transfer backbone, stochastic gradient descent at batch size 2 for 30,000 iterations, about 18 hours of training on a single consumer GTX 1050, and roughly 8 seconds of inference per metre. The corpus is where the two projects diverge, because it was built by simulation. Image Quilting produced acoustic-looking backgrounds, and smoothed sinusoids and dark shapes were inserted as fractures and breakouts, giving 300,000 images of which 10,000 carried a fracture and 10,000 a breakout. On that data the network reached an AUC of 98% for fractures and 90% for breakouts, against 81% and 73% for the classical methods it was measured against.

Our task shared the vocabulary, fractures and breakouts on borehole-image logs, and almost nothing else. We worked on real resistivity logs from a mid-sized Middle East carbonate operator, not simulated acoustic ones. A resistivity image is a conductivity profile carrying real tool artifacts, and most of any image belongs to none of the trained classes, the open-set problem Ren and colleagues named directly: five valuable-data classes against a large background, with manual interpretation run at 1:10 to 1:20 scale to be reliable [2]. Our ground truth came from a human interpreter whose picks, on some wells, sat up to 28 m apart, so the supervision was sparse and uneven. And the output was not a yes-or-no class. The model had to place each event at the correct depth and then regress its dip and azimuth. That is detection plus regression on corrupted, sparsely labelled real data, not binary classification on clean synthetic data.

The hidden term is the tolerance

The number that makes the two scores incomparable is the acceptance tolerance, and it hides in a single sentence of each method. The synthetic result defines a true positive as a predicted centroid within 50 pixels of ground truth for fractures, or 150 pixels for breakouts [1]. Our evaluation defines a true positive as a match within 2 cm of depth, agreed with the interpreter as a permissible human-error window, on images where one pixel is 3 cm.

Put those on the same ruler. A 50-pixel box at 3 cm per pixel is 150 cm, a metre and a half of depth. A 2-cm window is under a single pixel. The synthetic bench accepts a prediction that is a metre and a half off as correct; the real task rejects anything more than 2 cm off. Before either model sees any data, the two acceptance rules differ by a factor of roughly 75. A score of 98% earned under the loose rule and a score of 63% earned under the tight one are not measuring the same competence, and subtracting one from the other produces a difference that means nothing.

Two scores placed under the tolerances they were actually computed on. On the left, the synthetic acoustic bench: 300,000 simulated images, an AUC of 98% on fractures and 90% on breakouts against classical baselines of 81% and 73%, with a true positive counted whenever a predicted centroid lands within 50 pixels (fractures) or 150 pixels (breakouts). On the right, the real resistivity task: one pixel of borehole-image is 3 cm of depth, the agreed human-error match is 2 cm (under a single pixel), human picks fall as much as 28 m apart, events intersect, and the model must regress depth, dip and azimuth rather than classify one event. The bars show the four numbers side by side; the orange bar is the only one that argues, the blind-well fracture F1 that falls from about 82% headline to about 63% on the horizontal blind split. The tolerance dial makes the hidden term visible: drag the centroid acceptance box from the synthetic 50-150 pixel range down to the real-task 2 cm and watch how many orders of magnitude of forgiveness the headline was standing on. Every number is sourced (Dias et al. 2020, JPSE 191:107099; Ren et al. 2020, Tsinghua Sci and Tech 25(2); the engagement's blind-well metrics); nothing here is illustrative.

The figure puts both scores under the tolerances they were actually computed on. The two synthetic bars carry their 50-pixel and 150-pixel acceptance boxes as labels; the two real bars carry the 2-cm one. The dial sweeps the acceptance box from the synthetic 50-to-150-pixel range down to the real-task window, and the running read-out reports how many times more forgiving the loose rule is. The orange bar is the only element arguing: the blind-well fracture F1 of about 63%, the number a synthetic headline never has to face, because a synthetic benchmark can set its own tolerance and its own noise.

Reporting a test-split score on real data beats reporting a synthetic one, but it is still not enough, because a random test split leaks. Adjacent depths in one well share the same tool response, the same borehole geometry, and often the same interpreter habits, so a model tested on held-out patches from wells it trained on is partly grading its own memory. The fair question is whether it transfers to a well it has never seen, drilled in a different geometry and picked in a different session.

That is the split where our fracture F1 fell from about 82% to about 63%. The drop is not a bug to hide; it is the actual generalisation number, and the one an operator deciding whether to trust the tool should be shown. It also lines up with the mechanism. A horizontal well changes the apparent geometry of every sinusoid the model learned on vertical wells, sparse and unevenly spaced picks give the network fewer and noisier anchors, and intersecting events in one patch are exactly the case a one-event synthetic image never contains. None of those pressures exist on a simulated acoustic bench, which is why a synthetic AUC can sit so high while a real blind-well F1 sits 20 points lower on a task that looks like the same one.

For readers who want the geometry behind the sinusoid a fracture traces on an unrolled image, the apparent curve follows

y(\theta) = A\,\sin(\theta + \phi) + d, \qquad A = \tan(\mathrm{Dip}),\ \theta = 0\ldots 359^\circ

so a small error in dip or azimuth moves the whole curve, and a metre-and-a-half depth tolerance forgives errors that a 2-cm window treats as misses. The tolerance is not a detail of the scoring code; it is the definition of what the model is being asked to do.

What we report instead

The practice that follows is narrow and we hold to it. We publish the blind-well number and the real-data number, and we print the tolerance next to each score so a reader can see which test it passed. We do not put a synthetic headline in the abstract as if it were the result, and we do not compare our blind-well F1 to a synthetic AUC without stating that the acceptance rules differ by nearly two orders of magnitude. When we cite the synthetic comparator, we cite it as what it is: a clean, single-event classification result under a loose centroid tolerance, useful as a lineage marker and not as a bar to clear [1].

The open-set literature points at the same discipline from the other side. Ren and colleagues show that a resistivity image is mostly background no classifier was trained on, so a score that ignores the background flatters the model [2]. A synthetic benchmark removes the background, removes the intersecting events, removes the sparse labels, and loosens the match window, and each removal lifts the number. The real task reverses all four, and the real number is what we owe the reader.

Limitations

This paper is an argument about comparability, not a re-benchmark of the prior result, and it carries that boundary. We did not re-run the synthetic acoustic pipeline on our resistivity data, nor re-run our detector on the synthetic acoustic set, so the claim that the two scores are incomparable rests on their published acceptance rules and data descriptions rather than on a controlled head-to-head. The 50-pixel-to-2-cm conversion assumes the pixel-to-depth scale of our resistivity images, one pixel to 3 cm, and the exact multiple by which the synthetic bench is more forgiving would shift on acquisitions with a different scale; the fact that the tolerances differ by a large factor holds regardless, the precise factor does not. The blind-well figures of about 82% and about 63% are single-split numbers on one horizontal well and one threshold of 0.55, not a distribution over many held-out wells, and a broader blind-well study would state the generalisation gap more tightly than one split can. Finally, the two prior works we lean on describe acoustic and resistivity imaging respectively, and some of the difference between them is physics rather than evaluation choice; we have tried to separate the tolerance argument from the modality argument, but a reader should keep both in view.

References

[1] Dias, L. O., Bom, C. R., Faria, E. L., Valentin, M. B., Correia, M. D., de Albuquerque, M. P., de Albuquerque, M. P., and Coelho, J. M. Automatic detection of fractures and breakouts patterns in acoustic borehole image logs using fast-region convolutional neural networks. Journal of Petroleum Science and Engineering 191 (2020): 107099. Trains a fast-RCNN largely on synthetic acoustic images and reports AUC 98% (fractures) and 90% (breakouts) on simulated data against classical baselines of 81% and 73%, with a true positive defined by a centroid distance tolerance of 50 pixels (fractures) and 150 pixels (breakouts). https://doi.org/10.1016/j.petrol.2020.107099

[2] Ren, S., Han, X., Wang, C., Guo, R., Sun, Y., and Fan, Y. Valuable Data Extraction for Resistivity Imaging Logging Interpretation. Tsinghua Science and Technology 25(2) (2020): 281-293. Argues that most of a resistivity image belongs to none of the trained classes and proposes three binary pre-filtering strategies ahead of a five-class geological feature classifier; notes manual interpretation must run at 1:10 to 1:20 scale. https://doi.org/10.26599/TST.2019.9010020

The 98% AUC That Wasn't: What Synthetic-Data Benchmarks Hide in Image-Log AI

Abstract

Two results that look adjacent and are not

The hidden term is the tolerance

Why the blind well is the number that counts

What we report instead

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on