Skip to main content

Blog

The Reviewer Said Our Results Were Too Beautiful to Be Believed. Here's How We Answered.

A reviewer looked at Figure 9 of our fracture-detection paper, called it too beautiful to be believed, and demanded the code and data on GitHub. A confidentiality agreement made that impossible. This is the first-person account of what we added instead: failure-case figures, sensitivity tests, dip-intensity metrics, and full workflow diagrams, plus a novelty defense we would not trade away. Credible when the artifact stays closed.

Tarry Singhby Tarry Singh8 min read
EarthScan insight

The line that stopped me was five words long. A reviewer had read our fracture-detection paper, looked at Figure 9 (the qualitative panel where the model's predicted sinusoids sit almost perfectly on the ground-truth picks), and written that the result was "too beautiful to be believed." Then, in the same breath, the demand: put the code and the data on GitHub so a third party can rerun it. I read it twice. My first instinct was to argue that the figure was real, that we had not cherry-picked, that the overlay was exactly what the model produced. That instinct was wrong, and I am glad I ignored it.

This was a Detection-Transformer model we built with a mid-sized Middle East carbonate operator, picking fractures and bedding planes on borehole image logs. The work sat behind a confidentiality agreement covering producing-field data, so opening the repository was not a decision we got to make. We asked the operator for permission to release code anyway. They declined. So the reviewer's request and our contract were in direct conflict, and no amount of pleading closed that gap. The general playbook for publishing under that kind of constraint we have written up separately (see Publishing Peer-Reviewed AI Research Under a National-Oil-Company NDA); this piece is the narrower, more uncomfortable scene: one reviewer, one accusation, and the specific set of things we added to earn belief without the artifact.

Why "prove it's real" and "give us the repo" are not the same request

The accusation and the demand look like one thing but they are two. "Too beautiful to be believed" is a claim about credibility. "Put it on GitHub" is a proposed remedy. We could not supply the remedy, so we had to satisfy the underlying claim another way. A reviewer who suspects a result wants to know it was not staged, that it holds off the training distribution, and that the numbers would survive an adversarial read. A repository is one way to let them check. Under an NDA it is not an available one.

What we owed the reviewer, then, was severity of evaluation they could inspect without touching a single confidential byte. A figure that is genuinely too clean is a fair thing to distrust when it arrives alone. The fix is not to defend the pretty figure. It is to surround it with the ugly ones.

The four things we added instead of the code

We rewrote the revision around four disclosures, each of which is ordinary machine-learning hygiene made load-bearing by the fact that it was all the reviewer would get.

First, failure-case figures. The original submission showed the model working. The revision showed it missing. We added panels where predictions were absent, where the parameter error ran higher, where a busy fractured interval confused the set-prediction head. A model that only ever appears in its best light invites exactly the suspicion the reviewer voiced. Putting the misses on the page is the cheapest credibility you can buy, and we had been too proud to spend it the first time.

Second, sensitivity tests. The single most persuasive artifact under confidentiality is a curve that no memorising model could fake. Sweeping the number of training wells, classification error falls from around 93% at three wells to roughly 1% at nine, then settles near 2.5% on the full fourteen-well fractures-only model. A model that had memorised patches would not shed two orders of magnitude of error as unrelated geology accumulates. That behaviour is the argument.

Third, dip-intensity metrics. Rather than reporting one headline accuracy, we broke performance out by dip band, so a reader could see where the model was strong and where it weakened. A metric disaggregated by the physics of the feature is much harder to game than a single averaged number, and it tells an honest reader something a scalar cannot.

Fourth, full training and inference workflow diagrams. We drew the whole pipeline end to end, from raw log to matched sinusoid parameters, so the method could be understood and criticised in detail even though it could not be rerun. Reproducible in spirit, if not in a git clone.

ONE REVIEWER EXCHANGE, FRACTURE PAPER FIRST REVISIONA reviewer called Figure 9 too beautiful to be believed and asked for the code.The confidentiality agreement blocked release, so we answered with disclosed rigour instead.THE CHALLENGE"Too beautifulto be believed"Demand: code + dataon GitHub, blockedWHAT WE ADDED INSTEAD OF THE CODE01Failure-case figureswhere the model missed or ran higher error02Sensitivity testserror swept as unrelated wells accumulate03Dip-intensity metricsaccuracy broken out by dip band04Workflow diagramsfull training + inference pipeline drawnAND WHAT WE REFUSED, ON PRINCIPLENo pretrained weightsResNet-10 trained from scratch, basic blocksGeoBFDT integrationown detection transformer, not an off-shelf headNo YOLO / Mask R-CNN duelmasks segment then need manual dip picks: category errorTHE EVIDENCE, RESTATED AT FULL RIGOUR60708090345678depth-matching threshold (cm)~65% @ 3 cm~85% @ 8 cmdrag the depth-matching thresholdat 8.0 cm tolerance~85% detectionReproducible in spirit: the reader audits the model without ever seeing the wells or the code.
One peer-review exchange from the fracture paper's first revision, read as an arc. A reviewer called Figure 9 too beautiful to be believed and demanded the code and data on GitHub; the confidentiality agreement with the operator blocked any release. Rather than argue, the team added four disclosed moves in the rebuttal, shown in the teal arc: failure-case figures where the model missed or ran higher error, sensitivity tests, dip-intensity metrics, and full training and inference workflow diagrams. The novelty defense, in the lower strip, is what the team refused on principle: no pretrained weights (a ResNet-10 backbone trained from scratch with basic blocks), the custom GeoBFDT integration, and a refused head-to-head against YOLO or Mask R-CNN, because mask methods segment first and then still need manual dip and azimuth picking, a different objective. The orange challenge node is the only element that argues. The right-hand readout is the evidence those moves rest on: detection climbs from about 65 percent at a 3 cm depth-matching threshold to about 85 percent at 8 cm, with the depth-tolerance lever sweeping between the two points. Only the 3 cm and 8 cm detection figures are sourced from the reviewer letters; the dashed line between them is an illustrative guide, not measured intermediate points.

The instrument above reads that exchange as an arc. The orange node is the accusation. The teal arc is the four moves. The panel on the right is the evidence those moves rest on: detection climbing from about 65% at a 3 cm depth-matching threshold to about 85% at 8 cm, the two numbers we restated in the rebuttal so the claim carried a range, not a single flattering point.

The novelty defense we would not trade away

There was a second front to this review, and here we pushed back rather than added. The reviewer's framing implied our result was suspiciously good for something assembled from standard parts. Our answer was to be precise about what was standard and what was not.

We used no pretrained weights. Open-source backbones are trained on natural images, and the features they learn (edges of faces, textures of everyday objects) do not transfer to the resistivity texture of a carbonate borehole wall. So the ResNet-10 backbone was trained from scratch, using basic residual blocks rather than bottleneck blocks, deliberately kept small to resist overfitting on a fourteen-well corpus. The transformer, the loss, the matching are known ingredients, but they were integrated into our own detector (we call it GeoBFDT) with a modified architecture and a bespoke evaluation strategy. "Common components" is not the same as "off-the-shelf result," and we said so.

The sharper refusal was a benchmark. The reviewer effectively asked us to run a head-to-head against YOLO or Mask R-CNN. We declined, and not out of fear of the comparison. Those are mask-based methods: they segment a fracture, and then a human still has to pick dip and azimuth off the predicted mask by hand. Our model regresses depth, dip, and azimuth directly, end to end, with no mask and no manual picking. The objectives are different enough that a single accuracy number comparing them would be a category error dressed up as rigour. We had made the same argument about why detection transformers, not mask networks, fit this problem in the first place (the capability-matrix piece lays out that reasoning), and we held the line here.

The sinusoid geometry is why direct regression is possible at all. A planar feature crossing the borehole unrolls into a sine curve whose amplitude encodes dip and whose phase encodes azimuth:

Sinusoid fit: amplitude to dip, phase to azimuth
y=Asin ⁣(π180x+φ)+offsety = A \cdot \sin\!\left(\tfrac{\pi}{180}\, x + \varphi\right) + \text{offset}

A mask method throws that structure away and reconstructs it by hand afterward. Ours predicts the parameters. Comparing the two on one scalar hides the thing that actually matters.

What the exchange taught me

The reviewer was, in the end, doing their job well. A too-clean figure standing alone should be distrusted. The mistake was ours: we had led with beauty and buried the rigour. The revision did not make the model better. It made the model's honesty visible. And the credibility we recovered came entirely from evidence we were always allowed to disclose, never from the artifact we were contractually forbidden to release.

That is the transferable lesson for anyone shipping confidential industry AI into peer review. When a reviewer says a result is too good to be true and asks for a repository you cannot provide, do not argue the figure. Add the failures. Add the sweeps that a memorising model could not survive. Disaggregate the metric along the physics. Draw the whole pipeline. Defend your design choices where they are genuinely principled, and refuse the comparisons that would only manufacture a misleading number. The paper was accepted. The code never moved.

Limitations

This is one exchange from one paper, recounted from the author's side, and reviewer motives are inferred from the written record rather than confirmed. The detection figures (about 85% at 8 cm and about 65% at 3 cm) are point restatements from the rebuttal; the curve drawn between them in the instrument is an illustrative guide, not measured intermediate points. Failure-case figures and disaggregated metrics raise the cost of faking a result but do not prove correctness the way an independent rerun would. Our refusal to benchmark against mask-based methods rests on an objective-mismatch argument that a reader may reasonably weigh differently. And "reproducible in spirit" is a weaker guarantee than reproducible in fact, which is precisely the trade a confidentiality agreement forces.

References

[1] Carion, N. et al. (2020). End-to-End Object Detection with Transformers. https://arxiv.org/abs/2005.12872

[2] Lin, T.-Y. et al. (2017). Focal Loss for Dense Object Detection. https://arxiv.org/abs/1708.02002

Go to Top

© 2026 Copyright. Earthscan