When a Detection Transformer picks a fracture on a borehole image log, the geologist behind it wants the one thing the metrics do not answer: where on the image did the model actually look? A depth, a dip, and an azimuth fall out of the network with three-decimal confidence, but a sinusoid pick no human can interrogate is a pick no petrophysicist will sign. Attention rollout is the technique that closes that gap. It folds the raw, layer-by-layer self-attention of a transformer encoder into a single saliency map you can lay over the image strip — so the model can point, in effect, at the sine wave it traced. In a roughly twenty-month engagement with a mid-sized Middle East carbonate operator, interpretability was not a nicety bolted on at the end of the fracture-detection pipeline; it was the difference between a research demo and something an interpretation team would trust. This piece explains how rollout works, the exact tensor gymnastics it needs, and — just as important — the one place on a borehole image log where it quietly stops working, and what we did instead.
Why a single attention layer lies to you
A transformer encoder is a stack of self-attention layers. Each layer, for every position in its input, produces a probability distribution over all the other positions — how much should I attend to you? It is tempting to grab the attention matrix from the last layer, reshape it, and call it an explanation. That picture is almost always misleading.
The problem is that attention composes. The output of layer one is the input to layer two, and what layer two attends to is already a blend of everything layer one mixed together. A token that looks like it receives little direct attention in the final layer may nonetheless be enormously influential, because its information was relayed forward through earlier layers. Reading one layer in isolation tells you about one hop in a long chain of message-passing — not about the information flow from the original image patches to the final representation the model decides on. For a fracture pick, the question that matters is the end-to-end one: which input patches, all the way down at the image, drove this output? A single layer cannot answer it.
There is a second, subtler trap. Modern transformers carry a residual connection around every attention block — the layer adds its attention-mixed output back to its input rather than replacing it. So the effective mixing at each layer is not the attention matrix alone; it is the attention matrix plus an identity term that passes each token's own signal straight through. Ignore the residual and you systematically overstate how much tokens move information between positions and understate how much each token simply keeps to itself.
The rollout recipe
Attention rollout, introduced by Abnar and Zuidema, is the minimal correction for both problems at once. The recipe is short enough to state in one breath, and in our pipeline it was three lines of tensor code.
- Average the heads. Multi-head attention gives you several attention matrices per layer — one per head, each a different learned relation. Rollout collapses them by taking the mean across heads, yielding a single attention matrix per layer. (Averaging is the conservative default; you can also take the max per position if you want to chase the most attentive head, but the mean is what we used.)
- Add the identity, then renormalise. To account for the residual connection, add an identity matrix to each layer's averaged attention and re-normalise the rows so they sum to one again. Concretely: take half the attention plus half the identity. This bakes the "a token also attends to itself" pathway directly into the mixing matrix, so the rollout reflects the network's actual information flow, residual included.
- Recursively multiply across layers. Now propagate. Start at the first layer's corrected matrix and multiply it, layer by layer, by each subsequent corrected matrix. Because matrix multiplication chains the per-layer distributions, the running product at layer k is the total attention flow from the input positions all the way to layer k. Multiply through every encoder layer and you have a single matrix describing how each output position draws on every input image patch.
That is the whole algorithm: average the heads, add an identity matrix, recursively multiply across layers. What you get out is a row — the rollout for the query position you care about — that you reshape back onto the spatial grid of the image and render as a heatmap. The bright regions are the patches the model's attention ultimately concentrated on; for a fracture query, a well-behaved model lights up the sinusoid.
The instrument above is the mechanism, not a metric: a residual encoder feeds a self-attention transformer bottleneck whose long-range span links a fracture in the upper third of a track to its continuation in the lower third — exactly the cross-position relay that a single CNN receptive field cannot see and that rollout is built to surface. No numbers ride on it; the geometry is schematic, because rollout's value is qualitative by construction — it tells you where, not how much.
The tensor gymnastics that actually bite
The recipe is clean on a whiteboard. On a real Detection Transformer it lands you in shape-bookkeeping that is worth walking through, because it is where most first attempts break.
A DETR-style model runs a convolutional backbone first, then flattens the resulting feature map into a sequence for the transformer encoder. Take the canonical reference configuration: a ResNet-50 backbone with a downsampling factor of 32 turns an input image into a feature map of shape [1, 2048, 25, 34] — a 25×34 grid of 2048-channel feature vectors. The encoder flattens that grid into a sequence of 25 × 34 = 850 tokens. Self-attention is all-pairs, so each encoder layer's attention matrix is 850 × 850: every one of the 850 spatial locations attending to every other.
Rollout's recursive multiply happens in that 850 × 850 space. The result is still 850 × 850 — but to see it as a saliency map you must invert the flattening. You pick the row for the query position, drop it back onto the grid, and end up reshaping the [850, 850] attention into [25, 34, 25, 34]: a 25×34 map of 25×34 attention images, one heatmap per spatial query. Index the query you care about, upsample its 25×34 heatmap to the original image resolution, and overlay. (undefined, undefined) · The two reshape operations — [850] → [25, 34] and [850, 850] → [25, 34, 25, 34] — are the entire trick to going from "a matrix the model uses" to "a picture a geologist reads," and getting the row/column ordering wrong is the classic way to produce a beautiful, confident, and completely meaningless heatmap.
This same hook-based extraction — register a forward hook on the encoder's attention module, capture the matrices, run rollout offline — is how the reference DETR tooling visualises encoder self-attention, and it is the approach our fracture model's interpretability layer was built on. Nothing about it is geoscience-specific; it is generic transformer plumbing. Which is exactly why the place it fails on a borehole image log is so instructive.
The honest caveat: where rollout stops working
Here is the part that does not appear in the tutorials, and the part that mattered most in the field.
Rollout visualises encoder self-attention — the attention computed over the transformer's input tokens. But in a DETR-derived detector, those tokens are not the borehole image. They are the backbone feature map. By the time the data reaches the encoder, a ResNet has already compressed the original log — in our fracture model the high-resolution borehole image log enters at roughly 800 × 360 pixels and emerges from the backbone as a feature tensor of shape [batch, 256, 50, 23]. The self-attention runs over that 50 × 23 grid of abstract features, not over the raw image strip the geologist is actually looking at.
That has a hard consequence we had to be straight about when the work went through peer review: you cannot cleanly plot encoder-attention rollout back onto a raw borehole image log, because the attention does not live in the log's coordinate frame — it lives in a heavily downsampled, learned feature space whose 50 × 23 cells each smear together a sizeable depth window of the original image. Upsampling that grid produces a blurry region-of-interest, not the crisp "the model traced this sine wave" overlay the rollout demos promise on a clean ImageNet photo. The technique is sound; the resolution and the coordinate mismatch are what defeat it on this particular signal. Claiming a pixel-accurate attention map over a raw borehole image log would have been an overclaim, and we said so.
So the discipline is twofold. First, be precise about what you are visualising — encoder self-attention over backbone features is a real and useful diagnostic for whether the model's representation is spatially coherent, but it is not a per-pixel explanation of a pick. Second, when stakeholders need genuine pick-level interpretability on the log itself, lean on the things that do live in the log's frame: the regressed sinusoid parameters drawn back onto the image, the depth-thresholded true-positive overlays a petrophysicist can verify against ground truth, and per-patch failure cases shown as images. Rollout earns its place as an internal model-debugging instrument — it tells the ML team whether attention is concentrating sensibly — rather than as the customer-facing "proof" of a pick.
What this buys an engineering team
Treated honestly, attention rollout is one of the cheapest high-leverage tools in a subsurface-AI toolkit: a few lines of post-hoc code with zero training cost, no architecture change, and no extra labels. It surfaces whether a transformer encoder is attending to geologically plausible structure or to edge artefacts and tool-banding — the kind of silent failure a classification metric will happily hide. And working out the [25, 34, 25, 34] reshape forces you to understand your model's spatial bookkeeping end to end, which pays for itself the next time a tensor-shape bug masquerades as a modelling problem.
The deeper lesson is one we kept relearning across this programme: interpretability is a systems property, not a single method. Rollout answers "where did the encoder attend, in feature space?" — a real question with a real answer. It does not answer "which pixels of the log produced this pick?", and pretending otherwise is how teams lose the geoscientists they are trying to win. Pick the explanation whose coordinate frame matches the claim you want to make.
Key takeaways
- A single transformer attention layer is a misleading explanation: attention composes across layers and a residual connection routes each token's own signal forward, so one layer shows one hop, not end-to-end information flow from the input patches.
- Attention rollout corrects both at once — average the heads, add an identity matrix and renormalise (to model the residual), then recursively multiply the per-layer matrices to get total attention flow from input to output.
- The tensor bookkeeping is the real work: a ResNet-50 backbone (downsample factor 32) yields a [1,2048,25,34] feature map, flattened to 850 tokens, so each encoder self-attention matrix is 850×850 and reshapes to [25,34,25,34] — one heatmap per spatial query — for overlay.
- Honest caveat: rollout visualises encoder self-attention over backbone features, not the raw log. In a DETR-style fracture model the ~800×360 high-resolution borehole image log is compressed to a [batch,256,50,23] feature grid, so attention cannot be cleanly plotted back onto a raw borehole image log at pixel resolution — an overclaim we declined to make.
- Use rollout as an internal model-debugging diagnostic (is attention spatially coherent?), and reserve customer-facing pick-level interpretability for things that live in the log's own coordinate frame: regressed sinusoid overlays, depth-thresholded true-positive maps, and per-patch failure images.
References
[1] Abnar, S., and Zuidema, W. Quantifying Attention Flow in Transformers. ACL (2020). The original attention-rollout formulation: average heads, add the residual identity, and recursively multiply layer attention matrices to estimate input-to-output attention flow. https://arxiv.org/abs/2005.00928
[2] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. End-to-End Object Detection with Transformers (DETR). ECCV (2020). Source of the encoder self-attention extraction and the [2048, 25, 34] feature-map / 850-token reshape this piece works through. https://arxiv.org/abs/2005.12872