The line that stopped me was five words long. A reviewer had read our fracture-detection paper, looked at Figure 9 (the qualitative panel where the model's predicted sinusoids sit almost perfectly on the ground-truth picks), and written that the result was "too beautiful to be believed." Then, in the same breath, the demand: put the code and the data on GitHub so a third party can rerun it. I read it twice. My first instinct was to argue that the figure was real, that we had not cherry-picked, that the overlay was exactly what the model produced. That instinct was wrong, and I am glad I ignored it.
This was a Detection-Transformer model we built with a mid-sized Middle East carbonate operator, picking fractures and bedding planes on borehole image logs. The work sat behind a confidentiality agreement covering producing-field data, so opening the repository was not a decision we got to make. We asked the operator for permission to release code anyway. They declined. So the reviewer's request and our contract were in direct conflict, and no amount of pleading closed that gap. The general playbook for publishing under that kind of constraint we have written up separately (see Publishing Peer-Reviewed AI Research Under a National-Oil-Company NDA); this piece is the narrower, more uncomfortable scene: one reviewer, one accusation, and the specific set of things we added to earn belief without the artifact.
Why "prove it's real" and "give us the repo" are not the same request
The accusation and the demand look like one thing but they are two. "Too beautiful to be believed" is a claim about credibility. "Put it on GitHub" is a proposed remedy. We could not supply the remedy, so we had to satisfy the underlying claim another way. A reviewer who suspects a result wants to know it was not staged, that it holds off the training distribution, and that the numbers would survive an adversarial read. A repository is one way to let them check. Under an NDA it is not an available one.
What we owed the reviewer, then, was severity of evaluation they could inspect without touching a single confidential byte. A figure that is genuinely too clean is a fair thing to distrust when it arrives alone. The fix is not to defend the pretty figure. It is to surround it with the ugly ones.
The four things we added instead of the code
We rewrote the revision around four disclosures, each of which is ordinary machine-learning hygiene made load-bearing by the fact that it was all the reviewer would get.
First, failure-case figures. The original submission showed the model working. The revision showed it missing. We added panels where predictions were absent, where the parameter error ran higher, where a busy fractured interval confused the set-prediction head. A model that only ever appears in its best light invites exactly the suspicion the reviewer voiced. Putting the misses on the page is the cheapest credibility you can buy, and we had been too proud to spend it the first time.
Second, sensitivity tests. The single most persuasive artifact under confidentiality is a curve that no memorising model could fake. Sweeping the number of training wells, classification error falls from around 93% at three wells to roughly 1% at nine, then settles near 2.5% on the full fourteen-well fractures-only model. A model that had memorised patches would not shed two orders of magnitude of error as unrelated geology accumulates. That behaviour is the argument.
Third, dip-intensity metrics. Rather than reporting one headline accuracy, we broke performance out by dip band, so a reader could see where the model was strong and where it weakened. A metric disaggregated by the physics of the feature is much harder to game than a single averaged number, and it tells an honest reader something a scalar cannot.
Fourth, full training and inference workflow diagrams. We drew the whole pipeline end to end, from raw log to matched sinusoid parameters, so the method could be understood and criticised in detail even though it could not be rerun. Reproducible in spirit, if not in a git clone.
The instrument above reads that exchange as an arc. The orange node is the accusation. The teal arc is the four moves. The panel on the right is the evidence those moves rest on: detection climbing from about 65% at a 3 cm depth-matching threshold to about 85% at 8 cm, the two numbers we restated in the rebuttal so the claim carried a range, not a single flattering point.
The novelty defense we would not trade away
There was a second front to this review, and here we pushed back rather than added. The reviewer's framing implied our result was suspiciously good for something assembled from standard parts. Our answer was to be precise about what was standard and what was not.
We used no pretrained weights. Open-source backbones are trained on natural images, and the features they learn (edges of faces, textures of everyday objects) do not transfer to the resistivity texture of a carbonate borehole wall. So the ResNet-10 backbone was trained from scratch, using basic residual blocks rather than bottleneck blocks, deliberately kept small to resist overfitting on a fourteen-well corpus. The transformer, the loss, the matching are known ingredients, but they were integrated into our own detector (we call it GeoBFDT) with a modified architecture and a bespoke evaluation strategy. "Common components" is not the same as "off-the-shelf result," and we said so.
The sharper refusal was a benchmark. The reviewer effectively asked us to run a head-to-head against YOLO or Mask R-CNN. We declined, and not out of fear of the comparison. Those are mask-based methods: they segment a fracture, and then a human still has to pick dip and azimuth off the predicted mask by hand. Our model regresses depth, dip, and azimuth directly, end to end, with no mask and no manual picking. The objectives are different enough that a single accuracy number comparing them would be a category error dressed up as rigour. We had made the same argument about why detection transformers, not mask networks, fit this problem in the first place (the capability-matrix piece lays out that reasoning), and we held the line here.
The sinusoid geometry is why direct regression is possible at all. A planar feature crossing the borehole unrolls into a sine curve whose amplitude encodes dip and whose phase encodes azimuth:
A mask method throws that structure away and reconstructs it by hand afterward. Ours predicts the parameters. Comparing the two on one scalar hides the thing that actually matters.
What the exchange taught me
The reviewer was, in the end, doing their job well. A too-clean figure standing alone should be distrusted. The mistake was ours: we had led with beauty and buried the rigour. The revision did not make the model better. It made the model's honesty visible. And the credibility we recovered came entirely from evidence we were always allowed to disclose, never from the artifact we were contractually forbidden to release.
That is the transferable lesson for anyone shipping confidential industry AI into peer review. When a reviewer says a result is too good to be true and asks for a repository you cannot provide, do not argue the figure. Add the failures. Add the sweeps that a memorising model could not survive. Disaggregate the metric along the physics. Draw the whole pipeline. Defend your design choices where they are genuinely principled, and refuse the comparisons that would only manufacture a misleading number. The paper was accepted. The code never moved.
Limitations
This is one exchange from one paper, recounted from the author's side, and reviewer motives are inferred from the written record rather than confirmed. The detection figures (about 85% at 8 cm and about 65% at 3 cm) are point restatements from the rebuttal; the curve drawn between them in the instrument is an illustrative guide, not measured intermediate points. Failure-case figures and disaggregated metrics raise the cost of faking a result but do not prove correctness the way an independent rerun would. Our refusal to benchmark against mask-based methods rests on an objective-mismatch argument that a reader may reasonably weigh differently. And "reproducible in spirit" is a weaker guarantee than reproducible in fact, which is precisely the trade a confidentiality agreement forces.
References
[1] Carion, N. et al. (2020). End-to-End Object Detection with Transformers. https://arxiv.org/abs/2005.12872
[2] Lin, T.-Y. et al. (2017). Focal Loss for Dense Object Detection. https://arxiv.org/abs/1708.02002