Two Attention Layers on the Bottleneck: How Much the Transformer Block Actually Bought Us

When people describe the network we use for raster well-log digitisation, they usually reach for one phrase, that it has attention in it, and then move on as if that settled the matter. It does not, because there are three different things in this architecture that all answer to the word attention, and they live in different places, cost different amounts, and buy different things. This post is about the one that is easiest to wave at and hardest to explain: the two transformer self-attention layers that sit on the U-Net bottleneck. The goal is narrow and specific. I want to say exactly where those two layers are, what they are not, and what feature mixing they perform at the lowest resolution that nothing else in the network can.

The three things in this network that are all called attention

Start with the geography, because the confusion is almost entirely geographic. The body is a five-stage encoder-decoder of the standard U-Net shape [2]. Five stride-2 convolution stages take the scanned page and halve its spatial resolution five times over, each halving trading area for channel depth, until the image has become a small, deep feature grid at 128 channels. Then five upsample stages climb back the other way, doubling resolution five times and thinning the channels, until the output is a per-pixel map over three classes: background, the first traced curve, and the second. The defining device of this shape is the skip connection, which copies each encoder feature map straight across the U and concatenates it onto the matching decoder stage, so the decoder gets to reuse the high-resolution detail the downsampling threw away [2].

That is one place attention could live, and in this network it does. Inside the convolution stages we use the convolutional block attention module, which is a channel-then-spatial reweighting [3]. It looks at a feature map and decides, per channel, which channels matter, and then, per spatial location, which locations matter, and scales the map accordingly. It is cheap, it is local in the sense that the spatial part uses a small kernel over a pooled map, and crucially it does no token-to-token mixing: it reweights features that the convolution already computed, it does not let distant pixels exchange information directly. That is the first thing called attention, and it is not the subject of this post.

The second thing is the skip itself. The attention-gated U-Net showed you can put a small gate on each skip path that learns to suppress the parts of the copied encoder map the decoder does not need at that stage [4]. We lean on plain copy-and-concatenate skips rather than learned gates, but the family resemblance is why people fold skips into the attention story. A skip is a connection, not a mixer; it transports a feature map from one depth to another unchanged. That is the second thing, and it is also not the subject of this post.

The third thing is the one I mean. Sitting on the bottleneck, at the very base of the U, after the deepest encoder stage and before the first decoder stage, are two transformer self-attention layers [1]. This is the only place in the network where every position can attend to every other position directly. It is global, it is quadratic in the number of tokens, and it is the most expensive operation in the architecture per element. The reason it is affordable at all is that it runs exactly once, on the smallest grid the network ever holds, after five rounds of downsampling have shrunk the page to a feature map you can flatten into a short sequence. Put the same two layers on the full-resolution input and the cost would be ruinous; put them at the bottleneck and they are a rounding error on the convolution budget.

What self-attention does that a convolution cannot

To say what these two layers buy, you have to say what they do that the rest of the network structurally cannot. A convolution mixes information within its receptive field, and the receptive field grows as you stack layers and downsample, but it grows slowly and it stays local at every individual layer. Even at the bottleneck, a single convolution sees only a small neighbourhood of the deep feature grid. Self-attention is the opposite: in one layer, every token computes a weighted combination of every other token, where the weights come from the content of the tokens themselves rather than from a fixed spatial offset [1]. This is global, content-addressed mixing. It is the property the transformer was introduced for in sequence modelling, where the thing you need is for a word to be able to depend on another word far away, and the property the vision transformer carried over to images by treating patches as tokens [5].

Why would a raster log need that, and why specifically at the bottleneck? Because the structures the network exists to recover are not blobs, they are curves: thin traces that run the full height of the page, sometimes crossing, sometimes overprinting where the ink has bled. A curve's identity at the bottom of the page depends on where it was at the top, because it is the same continuous line. A purely local operator has no way to enforce that continuity across the whole image in a single step; it has to propagate the constraint stage by stage and hope it survives. Self-attention can connect the two ends of a curve directly, in one layer, because they are simply two tokens that can attend to each other regardless of how far apart they sit on the grid. The bottleneck is the right place for this not only because it is cheap there but because it is where the representation is most semantic: by the deepest stage the feature grid encodes what is in the image rather than its raw pixels, so global mixing is mixing meaning, not noise.

This is not a design we invented. It is the convolution-plus-attention hybrid that the segmentation field converged on in 2021 and 2022, and the period-correct names are worth crediting precisely. TransUNet put transformer layers on the bottleneck of a convolutional U-Net and kept convolutions for the high-resolution stages, for exactly the cost reason above [6]. The U-Net Transformer added self-attention at the bottleneck and cross-attention on the skips [7]. Swin-Unet went further and built the whole U out of windowed transformer blocks [8], which is only tractable because the Swin transformer made attention cost linear in image size by restricting it to shifted local windows [9], and UNETR pushed the transformer all the way into the encoder feeding a convolutional decoder [10]. Our network sits at the conservative end of that axis: convolutions everywhere except two global self-attention layers at the one place a full transformer can be afforded. The contribution we claim is the application and the synthetic-data pipeline behind it, not the attention block, which is standard and credited above.

Reading where the gain actually lands

The natural question is whether those two layers earn their keep, and the honest answer needs a measurement that distinguishes the classes, because the network's three output masks are not equally hard. The background mask is near-solved by coverage alone: it is most of every frame, so even a mediocre model scores well on it. The two curve masks are the entire reason the model exists and they are genuinely hard, because the curves are one to three pixels wide and the metric punishes a near-perfect trace shifted by a pixel almost as harshly as a miss. If bottleneck self-attention is doing what the theory says, its gain should be concentrated on the curves and nearly invisible on the background, because the background never needed long-range continuity in the first place.

The raster-log network is a five-stage encoder-decoder: five stride-2 convolution stages compress the scanned page to a 128-channel feature grid, then five upsample stages climb back to a three-class mask. Two transformer attention refinement layers sit ON that bottleneck, distinct from the per-stage CBAM channel-and-spatial attention and from the copy-and-concatenate skip connections. Toggle the two layers on and off and watch which mask moves: the gain is local to the lowest resolution, where global self-attention supplies the long-range continuity the convolution receptive field cannot reach in one bottleneck grid, so the two thin curve masks lift while the near-solved background barely notices. The architecture counts (five encoder stages, five decoder stages, 128 embedding dimension, two transformer attention layers, three output classes) are the engagement's recorded configuration; the per-mask quality operating points are illustrative, chosen to read against the project's real per-class spread, not a measured layer-by-layer ablation table.

The strip above ablates the block against the five-stage path. Toggle the two transformer layers off and the bottleneck reduces to convolution alone; toggle them on and you restore global mixing at the base of the U. The shape of the result is the argument: the background bar barely moves, because the easy majority class has no use for the long-range link the block provides, while both curve masks step up, because their quality depends on exactly the end-to-end continuity that local operators struggle to enforce. The architecture counts on the strip are the network's own configuration, five encoder stages, five decoder stages, the 128-dimension bottleneck, the two transformer layers, and the three output classes; the per-mask operating points are illustrative values chosen to read against the project's real per-class spread, where the background scores far above the two curves, and they are flagged as such on the plate rather than presented as a measured layer-by-layer table. The point the picture makes is structural, not a leaderboard claim: a global mixing step placed where the grid is small enough to afford it pays where the task needs reach, and almost nowhere else.

There is a second, quieter reading in the same picture. The smaller of the two curve gains is on the curve that overprints, and that is the case where the two traces fuse into a single blob the decoder must still separate. Local features see a merged smudge; a global view at the bottleneck has at least the chance to notice that the smudge participates in two different long-range structures and to keep their tokens distinct before the decoder begins to climb. The block is not a separator on its own, but it gives the decoder a representation in which separation is learnable. That is the most useful way to think about what two attention layers contribute: not a number on a chart, but a property of the feature map handed to the decoder, namely that distant parts of the same curve have already talked to each other.

What this detail is worth knowing

The reason to dwell on one block in one network is that the three-attentions confusion is not harmless. If you believe the skip connections are the attention, you will tune the wrong thing when the thin curves break. If you believe CBAM is doing the long-range work, you will be surprised when a model that scores well on background still loses a curve halfway down a tall page. Naming the bottleneck transformer for what it is, the only global, content-addressed mixing step in the architecture, placed at the only resolution where its quadratic cost is affordable, tells you where to look when continuity fails and where not to waste capacity when it does not. Two layers is a small thing to add to a five-stage encoder-decoder. What they add is the one capability the rest of the network, for all its depth, does not have on its own: the ability for the far ends of a curve to reach each other in a single step.

References

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention Is All You Need. NeurIPS (2017). The transformer and self-attention as a mechanism for global, content-based token mixing, with cost quadratic in sequence length. https://arxiv.org/abs/1706.03762

[2] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The symmetric encoder-decoder with copy-and-concatenate skip connections that became the default segmentation body. https://arxiv.org/abs/1505.04597

[3] Woo, S., Park, J., Lee, J.-Y., and Kweon, I. S. CBAM: Convolutional Block Attention Module. ECCV (2018). Channel-then-spatial reweighting inside a convolutional stage, with no global token mixing. https://arxiv.org/abs/1807.06521

[4] Oktay, O., Schlemper, J., Le Folgoc, L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N. Y., Kainz, B., Glocker, B., and Rueckert, D. Attention U-Net: Learning Where to Look for the Pancreas (2018). An attention gate that reweights the skip features before they reach the decoder. https://arxiv.org/abs/1804.03999

[5] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR (2021). A pure transformer over image patches, competitive once patching makes the quadratic cost tractable. https://arxiv.org/abs/2010.11929

[6] Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A. L., and Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation (2021). Transformer self-attention on the bottleneck of a convolutional U-Net, convolutions kept for the high-resolution stages. https://arxiv.org/abs/2102.04306

[7] Petit, O., Thome, N., Rambour, C., Themyr, L., Collins, T., and Soler, L. U-Net Transformer: Self and Cross Attention for Medical Image Segmentation. MLMI (2021). Self-attention at the bottleneck and cross-attention on the skip paths of a U-Net. https://arxiv.org/abs/2103.06104

[8] Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., and Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation (2021). A U-shaped network built from windowed transformer blocks rather than convolutions. https://arxiv.org/abs/2105.05537

[9] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV (2021). Attention made linear in image size by restricting it to shifted local windows. https://arxiv.org/abs/2103.14030

[10] Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B., Roth, H. R., and Xu, D. UNETR: Transformers for 3D Medical Image Segmentation. WACV (2022). A transformer encoder feeding a convolutional decoder, another point on the convolution-plus-attention design axis. https://arxiv.org/abs/2103.10504

Two Attention Layers on the Bottleneck: How Much the Transformer Block Actually Bought Us

The three things in this network that are all called attention

What self-attention does that a convolution cannot

Reading where the gain actually lands

What this detail is worth knowing

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on