A Plain Tour of Segmentation Architectures, From FCN to Modern U-Nets

Pixel-wise segmentation is one of those problems that sounds settled until you try to build a network for it. The goal is plain enough: hand the model an image, get back a label for every pixel. The difficulty is hidden in a tension that every design in this lineage is really an answer to. A network that understands an image well has thrown away its spatial detail, because understanding comes from pooling and striding the resolution down until a few hundred channels summarise the whole frame. A network that keeps every pixel sharp has never zoomed out far enough to know what it is looking at. Segmentation needs both at once, fine location and coarse meaning, and the family tree of architectures is a sequence of increasingly clever ways to have them together.

This is a guided tour of that tree. We will not chase benchmark numbers or stack a leaderboard. We will follow one thread instead: how each design carries high-resolution detail forward, and how it climbs back from a tiny feature map to a full-resolution mask. Read in that frame, the jump from a 2015 fully convolutional network to a modern attention-gated U-Net stops looking like a pile of unrelated tricks and starts looking like a single conversation, each paper answering the one before it.

The image that became a label map

The first move that made everything else possible was almost embarrassingly simple to describe and genuinely hard to see at the time. A classification network ends in dense layers that collapse the whole image into one verdict. Long, Shelhamer, and Darrell replaced those dense layers with convolutions, so the network stopped emitting a single class and started emitting a coarse grid of classes, one per region. [1] That reframing, from "what is in this image" to "what is at each location," is the founding act of modern segmentation, and the architecture got a name to match: the fully convolutional network.

There was an immediate problem, and it is the problem the rest of this story keeps revisiting. The output grid was coarse, because the network had downsampled hard to build up meaning. Upsampling that grid back to the input size with a single crude step produced a blurry, blobby mask that lost every thin edge. The fully convolutional network's own answer was the first skip connection: take the coarse, semantically rich prediction near the end of the network, and add it to one or two earlier feature maps that still held some spatial precision. The combination was sharper than either alone. Pause on this, because the skip connection introduced here, the idea of routing early high-resolution information around the bottleneck and fusing it back in late, is the single most important structural device in everything that follows.

Two ways to climb back to full resolution

Once you accept that the network must shrink the image and then grow it back, the design question becomes concrete: how, exactly, do you grow it back? Two answers appeared almost together in 2015, and the difference between them is genuinely instructive.

One school built a heavy, learned decoder. Noh and colleagues mirrored the encoder with a deep stack of deconvolution and unpooling layers, letting the network learn its own way back up to full resolution rather than relying on a fixed interpolation. [4] The decoder became as substantial as the encoder, a full inverse of the contraction.

SegNet took a leaner, more memory-conscious route to the same goal. Badrinarayanan, Kendall, and Cipolla noticed that the expensive thing about a learned decoder is storing all those encoder feature maps to copy across. So they stored almost nothing: only the indices of where each max-pooling operation had picked its winner. [2] On the way back up, the decoder used those stored indices to place activations back in exactly the spots they came from, then ran small convolutions to fill in the now-sparse map. The payoff was crisp object boundaries at a fraction of the memory, because the precise location of every pooled feature was preserved without keeping the feature itself. Two philosophies, then, for the upsampling half: copy rich features and learn the climb, or copy only positions and reconstruct. Hold that distinction, because the architecture that won this corner of the field chose a third option.

The shape that the field settled on

U-Net is the design most people picture when they hear "segmentation network," and it earned that status by answering the skip-connection question more generously than anyone before it. Ronneberger, Fischer, and Brox, working on biomedical microscopy where labelled data is painfully scarce, built a symmetric architecture: a contracting encoder on the way down, a mirror-image expanding decoder on the way up, and a skip connection at every single stage. [3] Crucially, those skips do not merely add a coarse prediction late, as the fully convolutional network did, and they do not pass only indices, as SegNet did. They copy the entire encoder feature map and concatenate it to the matching decoder stage, so the decoder at each resolution has both the upsampled semantic context from below and the full, untouched high-resolution detail from the encoder side.

Two consequences made U-Net the workhorse. First, thin structure survives. A one-pixel-wide membrane, or a faint curve on a scanned sheet, is exactly the kind of detail that vanishes in a pooled bottleneck and reappears intact through a concatenated skip. Second, it learns from very few examples, because the architecture itself encodes such a strong prior about how location and meaning relate that the network does not have to learn that relationship from scratch. The biomedical origin is not a quirk to skip over. A neuron membrane and a printed log curve are the same target in different clothing, a thin connected line winding through a noisy field, which is why U-Net travelled so readily from microscopy into document-image and geoscience work.

Teaching the skips to pay attention

By 2018 the U-Net frame was settled enough that the interesting work moved inside it rather than around it. The skip connection that made U-Net great also carried a liability: it copied everything, including large stretches of irrelevant background, and then asked the decoder to sort the useful detail from the noise on its own. The refinement was to make the skip selective.

Oktay and colleagues introduced the attention gate. Before a skipped feature map reaches the decoder, a small learned gate inspects it alongside a signal from the coarser decoder stage and produces a soft spatial mask, suppressing the regions that the deeper, more semantic part of the network considers irrelevant. [6] The skip still carries high-resolution detail, but now it carries weighted detail, focused on the object and quiet about the background. Around the same time, CBAM offered a complementary and even cheaper form of the same instinct: a small module that reweights a feature map first by channel and then by space, sharpening what a convolutional stage emphasises without changing its shape. [7] Both are descendants of the broader idea that a network should be able to weigh some of its own activations more heavily than others, the principle that self-attention had formalised for sequences the year before. [8]

These attention variants did not replace U-Net; they upgraded one specific component of it, the skip, from a faithful copier into a discriminating one. That is the pattern of the whole period. Nothing in this lineage gets thrown away. The encoder-decoder spine from the fully convolutional network, the symmetric skips from U-Net, the learned-versus-index decoder debate from SegNet and the deconvolution networks, the selective gating from the attention work: each refinement keeps the previous skeleton and sharpens one joint.

A click-through tour of the segmentation architecture family tree, framed by the two design questions every dense-prediction network answers: how it carries high-resolution detail forward (the skip-connection design) and how it climbs back to full resolution (the upsampling design). Fully convolutional networks (2015) sum a coarse map with a couple of earlier feature maps and upsample with transposed convolution; SegNet (2015) passes only max-pool indices and unpools into them; U-Net (2015) copies and concatenates the full encoder feature map at every stage; attention U-Nets (2018) gate those skips so the decoder attends to what matters. The line ends at the encoder-decoder we use for raster well-log digitisation, which we call VeerNet: five encoder stages, five decoder stages, a 128-channel feature depth, and two self-attention layers on the bottleneck. Method names, years and design notes are documented prior art; the small encoder and decoder block glyphs are illustrative node positions, not a measured layer dump.

Where this thread runs out, and into our work

Trace the family tree to its tip and a practical architecture falls out of it rather than being invented against it. For reading curves off scanned raster well logs, the design we use, which we call VeerNet, is squarely a member of this family and inherits its answers wholesale. It keeps the U-Net answer to the detail question, copy-and-concatenate skips at every stage, because a printed log curve is precisely the thin, connected target that those skips were built to preserve. It keeps the symmetric encoder-decoder answer to the resolution question, a stack of stride-2 convolutions descending and a mirror stack of upsample blocks climbing back. The one addition is where the lineage was already pointing: a short pair of self-attention layers placed on the compressed bottleneck, where the feature map is small enough that all-pairs comparison is affordable, giving the network the long-range continuity reasoning that lets a single curve stay coherent down a metres-long sheet. Five encoder stages, five decoder stages, a feature depth of 128 channels, and two attention layers, which is U-Net's body with an attention-refined heart.

We are not claiming a new branch of the tree. We are claiming a sensible position on an existing one, chosen because the problem and the lineage line up. A free-form curve on a degraded scan is a thin-structure segmentation target whose meaning at any point depends on points far away, and the field had already produced exactly the components that target needs: skips that protect fine detail, an encoder-decoder that reconciles location with meaning, and attention that reaches across the page. Knowing the family tree is what tells you that, and it is also what saves you from reinventing a decade of careful, composable work that already solved the parts you were about to solve again.

Read end to end, the tour resolves into one continuous argument about a single trade-off rather than a catalogue of named models. The fully convolutional network posed the question, SegNet and the deconvolution networks explored the upsampling half, U-Net answered the skip half so well that it became the default body, and the attention work taught that body to look where it mattered. Anything you build today, for microscopy or for a filing cabinet full of paper logs, is a choice about which of those answers to keep and which one joint to sharpen next.

Key takeaways

Segmentation is one trade-off in disguise: a network needs coarse meaning (from downsampling) and fine location (from full resolution) at once, and every architecture in this lineage is an answer to how to have both.
The 2015 fully convolutional network reframed classification into dense per-pixel labelling and introduced the founding device, the skip connection, fusing a coarse semantic prediction with earlier high-resolution feature maps.
The upsampling half split into two ideas in 2015: copy rich features and learn the climb (deconvolution networks), or copy only max-pooling indices and unpool into them (SegNet), trading memory for crisp boundaries.
U-Net answered the skip question most generously, copying and concatenating the full encoder feature map at every stage, which preserves thin structure and learns from scarce labels, making it the default segmentation body.
Attention U-Net (2018) and CBAM upgraded the skip from a faithful copier into a selective one with learned gating; none of the prior structure is discarded. Our raster-log network, VeerNet, sits at the tip of this thread: a five-stage encoder-decoder with copy-and-concatenate skips and two self-attention layers on the bottleneck.

References

[1] Long, J., Shelhamer, E., and Darrell, T. Fully Convolutional Networks for Semantic Segmentation. CVPR (2015). The shift from per-image classification to dense per-pixel labelling, with skip fusion of coarse and fine activations. https://arxiv.org/abs/1411.4038

[2] Badrinarayanan, V., Kendall, A., and Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation (2015). The decoder that upsamples with stored max-pooling indices instead of copied feature maps. https://arxiv.org/abs/1511.00561

[3] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The symmetric encoder-decoder with copy-and-concatenate skip connections that learns from scarce labels. https://arxiv.org/abs/1505.04597

[4] Noh, H., Hong, S., and Han, B. Learning Deconvolution Network for Semantic Segmentation. ICCV (2015). A deep stack of learned deconvolution and unpooling layers as the decoder. https://arxiv.org/abs/1505.04366

[5] Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (DeepLabv3+). ECCV (2018). Atrous convolution for multi-scale context inside the encoder-decoder frame. https://arxiv.org/abs/1802.02611

[6] Oktay, O., Schlemper, J., Le Folgoc, L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., Glocker, B., and Rueckert, D. Attention U-Net: Learning Where to Look for the Pancreas (2018). The attention gate that reweights skip features before they reach the decoder. https://arxiv.org/abs/1804.03999

[7] Woo, S., Park, J., Lee, J.-Y., and Kweon, I.S. CBAM: Convolutional Block Attention Module. ECCV (2018). A lightweight channel-then-spatial attention block that drops into a convolutional stage. https://arxiv.org/abs/1807.06521

[8] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. Attention Is All You Need. NeurIPS (2017). Self-attention as a mechanism for long-range dependency, later folded onto vision backbones. https://arxiv.org/abs/1706.03762

A Plain Tour of Segmentation Architectures, From FCN to Modern U-Nets

The image that became a label map

Two ways to climb back to full resolution

The shape that the field settled on

Teaching the skips to pay attention

Where this thread runs out, and into our work

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on