A reviewer of our fracture-detection paper asked for something reasonable on its face: show the attention maps, painted back onto the borehole-image log, so a geologist can see which pixels the model looked at when it placed a fracture. We could not do it. Working out why turned into the clearest way to explain a design decision we had made much earlier and never fully justified in print: our detector runs no non-maximum suppression at all. The interpretability limit and the missing post-processing stage are the same architecture fact seen from two sides. This piece is about that fact.
Most of the mechanics of the detector we have written up before. For the sinusoid-to-label derivation and why a Detection Transformer regresses (depth, dip, azimuth) rather than painting masks, see the "Borehole Image Logs 101" primer; we will not re-derive it here. The narrow claim of this piece is more contrarian: a detector is not end-to-end because someone stamped the phrase on the diagram, but because of a specific stage it does not contain.
The stage a classic detector cannot delete
Take the YOLO and R-CNN families, the detectors most people reach for first. They share a shape. The ResNet backbone produces features, a head emits many candidate boxes, and because those candidates pile up many-to-one on every real object, the boxes overlap heavily. Something has to prune them. That something is non-maximum suppression: sort the candidates by confidence, keep the top one, delete every other box that overlaps it beyond a threshold, repeat. It works. It is also a separate algorithm that runs after the network, with its own threshold and no gradients flowing through it, so the loss can never reach back and teach the network about the choice it just made.
The network never learns that two of its boxes were redundant. It emits redundancy on purpose, trusts a hand-tuned procedure to clean up, and that procedure's threshold becomes one more knob you tune by hand and hope generalises. Calling such a pipeline end-to-end is a stretch: the gradient path stops at the box head, and the last decision about what the detector outputs is made outside the model.
What we put in its place
The Detection Transformer removes the pileup at the source. Instead of thousands of anchored candidates, the decoder carries a small fixed set of learned queries, and each query is responsible for at most one object. There is no forest of overlapping boxes to thin out, so there is nothing for non-maximum suppression to do. The redundancy that NMS exists to remove never gets generated.
The redundancy does not vanish for free, though. You still have to decide which query answers for which ground-truth fracture, and you have to do it in a way the loss can flow through. That is what bipartite Hungarian matching does. At each training step the matcher finds the single lowest-cost one-to-one assignment between the set of predictions and the set of ground-truth fits, and the loss is computed against that assignment. A prediction that matches a real fracture is pulled toward it; every unmatched query is pushed to the no-object class. The tie-breaking that a classic detector defers to a post-processing threshold happens here, inside the training loop, differentiable end to end. In our engagement letters we stated it plainly: no non-maximum suppression is used, unlike YOLO and R-CNN, and that is what makes the method end-to-end. The claim is not decoration. It names the stage we removed and the mechanism that replaced it.
So "end-to-end" has an operational test. Trace the gradient from the loss to the input; if it passes through every decision the detector makes about its final output, the detector is end-to-end, and if the last decision is a thresholded prune the loss cannot see, it is not. Ours passes because there is no prune. It is worth heading off one confusion here: we report at an inference probability threshold of 0.5, and that is not a smuggled NMS. It is a single per-prediction boundary on each already-assigned query, object versus no-object; it prunes nothing across queries and breaks no ties between boxes. Removing NMS did not remove all thresholds, only the one that had to run outside the model and could not be trained.
The same fact, pointed at interpretability
Now the reviewer's request. Self-attention is where a transformer decides what relates to what, so an attention map should, in principle, tell you where the model looked. The catch is where in the pipeline attention lives. It does not run on the log. It runs on the feature map the ResNet backbone produces, and that map has shape [Batch, 256, 50, 23]. The input borehole-image patch is 800 pixels tall and 360 wide. By the time attention runs, the 800 x 360 image has been compressed to a 50 x 23 grid of 256-channel vectors. Every spatial cell of that grid stands in for a wide tile of the original log.
Do the arithmetic and the interpretability limit stops being a matter of taste. The raw patch is 288,000 pixels; the attention grid is 1,150 cells, so one attention cell corresponds to on the order of 250 log pixels. There is no faithful way to paint a 50 x 23 attention weight back onto an 800 x 360 image, because the map you would be painting was computed on a grid two orders of magnitude coarser than the thing you want to draw on. Attention rollout, the standard recipe, averages the heads, adds an identity matrix to keep each token's own signal, and multiplies the layer matrices recursively across the 4 encoder and 4 decoder layers. Every one of those operations stays on the compressed grid it started on. Rollout gives you a clean heat map over 1,150 cells; it cannot give you a meaningful one over 288,000 pixels, because the information to place a weight that precisely was compressed away at the backbone, on purpose, so the transformer would train on the few wells we had.
This is the join we did not expect when we started writing the rebuttal. The compression that makes the feature map small enough to attend over efficiently is the same compression that severs the path back to the log. And the query-set formulation that lets us attend over that small map, rather than sliding boxes across the full image, is exactly the formulation that dispenses with non-maximum suppression. One architectural commitment, made for tractability on a 14-well dataset, produced both the end-to-end property we wanted and the interpretability limit we had to own.
One likely objection deserves a direct answer: our pipeline has preprocessing, and plenty of it, so how can it be end-to-end? Dynamic normalisation, sentinel-value imputation, and the apparent-to-true dip and azimuth correction all run before the network, yet none of it disqualifies the method. End-to-end is a statement about the learned part, about whether the gradient reaches every decision the model makes about its output. A detector with almost no preprocessing but an NMS threshold at the end is not end-to-end, however clean its front looks. The test is the gradient path, not the number of steps.
What we would tell the next team
If you are choosing a detector for a scientific measurement, where the output is a parameter vector a downstream user will trust, the presence or absence of non-maximum suppression is not a minor implementation detail. It decides whether the last thing your model says is something the model was trained to say. We skipped NMS, took on Hungarian matching in exchange, and got a detector whose every output decision is inside the differentiable graph. The cost surfaced later, at review time, as an attention overlay we could not honour, because the same compression that made the architecture trainable put attention two orders of magnitude away from the log. We would make the trade again, but state it up front: a reviewer will eventually ask to see the attention, and the answer was fixed the moment you chose to compress the image in the ResNet backbone.
Limitations
The end-to-end argument is about the gradient path, not a claim of superiority over NMS-based detectors on every metric; we did not benchmark head-to-head against a tuned YOLO or R-CNN, because the mask-and-pick objective of those families differs from direct parameter regression. The interpretability limit is specific to spatial attention rollback onto the log, not a claim that the model is uninterpretable in other ways. The 250-log-pixels-per-cell figure is a first-order area ratio, not a measured receptive field; the true effective receptive field of a cell is broader still. All numbers reflect a single engagement with a mid-sized Middle East carbonate operator across 14 vertical wells, and choices that follow from a small-well dataset may not transfer where data is abundant enough to fine-tune a heavier ResNet backbone.
References
[1] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. European Conference on Computer Vision (ECCV). https://arxiv.org/abs/2005.12872
[2] Abnar, S., and Zuidema, W. (2020). Quantifying Attention Flow in Transformers. Annual Meeting of the Association for Computational Linguistics (ACL). https://arxiv.org/abs/2005.00928