A reviewer of our carbonate vug-quantification paper asked a fair question: how does your method stack up against the published path-morphology approach? We agreed it was worth answering. Then we tried to answer it, and ran into a wall that has nothing to do with method quality and everything to do with how our field publishes. None of the studies we cite release their code. Not the rival vug method, not the fracture methods, not ours. So the honest benchmark we could actually run was small, indirect, and in the end unusable. This is the story of that attempt, because the attempt is more instructive than any accuracy number we could have printed.
The comparison a reviewer wanted
The rival method is a classical computer-vision pipeline built on path-morphology operators. It separates rock pore space into matrix, fractures, and vugs, and it works on dynamic image logs, where a sliding-window rescaling sharpens the contrast of thin features. Our method is also classical, no deep learning, but it works on static logs on purpose. Dynamic normalisation sharpens fractures and beddings, which for our task is noise: we are counting dissolution pores, and the raw static electrical measurement keeps the conductive vug blobs distinct instead of dressing up every hairline crack around them. So the two methods do a related thing on deliberately different inputs. That mismatch matters later.
To compare them properly you need one of two things: the rival's code, so you can run it on your wells, or your labelled data handed to them so they can run theirs. We had neither, and neither did they.
Running a method on a screenshot
With no code to install and no dataset to share, we did the only concrete thing available. The rival paper prints a figure of its dynamic-log input and the threshold maps it produces. We took a screenshot of that figure, fed the image into our own pipeline, and looked at what our thresholding produced against what theirs did on the same visible pixels.
The result was encouraging in the narrow way a screenshot can be: our pipeline produced comparable threshold maps on that single figure. For a moment the comparison looked feasible. But a screenshot of one published figure is not a benchmark. It is a few hundred pixels lifted out of a paper, with no depth registration and no raw intensity range. You can eyeball agreement. You cannot report a number and call it fair.
The request that went unanswered
So we did the correct next step. We wrote to the authors and asked for permission to use their figure properly and for the raw data behind it, so we could run a real comparison on shared ground.
We never heard back.
That is the whole event, and it is the load-bearing one. No refusal to argue with, no licensing terms to negotiate, no dataset arriving with strings attached. Silence. And because none of the cited studies open-source their code either, there was no fallback path, no repository to clone, no reference implementation to reproduce the maps ourselves. The comparison had exactly one door, and no one answered it.
Two reasons to exclude, either one sufficient
We dropped the comparison from the manuscript, and it is worth being precise about why, because there were two independent reasons and only one of them is the silence.
The first is access: no permission, no raw data, no code. On its own that ends the comparison.
The second would have ended it even if the authors had written back the next morning. Their method reads dynamic logs; ours reads static. A head-to-head on that basis is unfair to both. Sharpening designed to make fractures pop is the exact contrast our vug detector is built to avoid, so scoring the two on the same interval would measure the input choice, not the methods. The interactive above lets you grant the permission that never came and watch the comparison fail anyway on this static-versus-dynamic mismatch. Silence was never the only thing standing in the way; it was just the first.
Excluding a comparison a reviewer asked for is uncomfortable. The alternative, printing a number derived from a screenshot and a guess at the other method's inputs, would have been worse. A benchmark you cannot reproduce is not evidence. It is decoration with error bars.
The mirror: our own code is closed too
This is where the story stops being a complaint about other people. We could not get their code. Nobody can get ours.
Our pipeline is closed under a confidentiality agreement with the operator whose producing-field data trained and validated it. When a different reviewer, on the fracture side of the same programme, asked us to post code and data to a public repository, we did not refuse on principle. We went back to the client and formally asked to release it. We were declined. The data is tied to active producing fields, the corresponding author is a doctoral candidate bound by the same agreement, and the answer was no. (We have written elsewhere about the mechanics of getting peer-reviewed work out at all under a national-oil-company NDA; this piece is about the benchmarking consequence, not the publishing one.)
So the wall cuts both ways. We could not read their method; a stranger cannot read ours. Every group in this corner of the field is running on data it cannot share and code it cannot post, and then citing each other across that gap with printed figures as the only common currency. The reproducibility problem is not that any one lab is secretive. It is structural. When the underlying data is somebody's producing reservoir, closed code is the default, and honest benchmarking degrades to what we did: a screenshot, a polite email, and silence.
What honest benchmarking actually looked like
Strip the story to its shape and it is a short chain. We reach as far as a public figure lets us. We get an encouraging but unusable result. We ask for the real thing. We hear nothing. We exclude the comparison and say so in the paper. Then we admit our own code is behind the same kind of wall.
None of that is satisfying, but it is honest, and it repeats across almost any pair of groups in subsurface image-log research today. The useful takeaway is not that we behaved well by dropping a weak comparison, though we would do it again. It is that the field's evidence base is quietly capped by its data agreements. Until operators release even anonymised evaluation sets, or until neutral shared benchmarks exist that no single client owns, comparisons between published methods will keep bottoming out at the screenshot and the unanswered email. We can report our own numbers cleanly. We cannot yet place them next to anyone else's without a wall in the way, and neither can they with ours.
Limitations
This account is deliberately about a process, not a result, so it prints no comparative accuracy figures on purpose, because the only comparison we could run was not reportable. The screenshot experiment produced qualitatively comparable threshold maps on a single published figure; we make no quantitative claim from it, and it should not be read as evidence that either method outperforms the other. The static-versus-dynamic mismatch we describe is specific to the vug-detection task and does not generalise to fracture or bedding detection, where dynamic logs are the correct input. Finally, both the rival method's closed code and our own closed code reflect the norms of one industry and one set of data agreements; the reproducibility gap we describe may be narrower in fields where evaluation data is public.
References
[1] Li, X.-N., Shen, J.-S., Yang, W.-Y., Li, Z.-L. (2019). Automatic fracture-vug identification and extraction from electric imaging logging data based on path morphology. Petroleum Science, 16, 58-76. https://doi.org/10.1007/s12182-018-0282-6