Abstract
The residual block has two canonical forms. The basic block stacks two same-width convolutions on a skip connection; the bottleneck block factors the same computation through a narrow centre, a one-by-one squeeze, a three-by-three convolution, and a one-by-one expansion, so a deeper stack costs fewer parameters per unit of depth. Both come from the same source and both are routinely dropped into a U-Net encoder. The open question for any given task is which one to use, and the honest way to answer it is an ablation that changes only the block. We ran that ablation for raster well-log digitisation: the same five-stage encoder, five-layer decoder, the same 15,000-instance three-class synthetic set, the same fifty-epoch ten-hour budget, basic versus bottleneck and nothing else. The finding is that on a thin-curve segmentation target the block choice is close to a wash on the metric that pays, and the deciding factors are training cost and stability rather than a headline IoU difference. This note credits the residual and U-Net lineage we built on and reports what the controlled comparison actually showed.
Background and related work
Two ideas from 2015 and 2016 set the frame, and neither is ours. The U-Net (Ronneberger et al., 2015)Ronneberger et al. · 2015U-Net: Convolutional Networks for Biomedical Image SegmentationMICCAI established the encoder-decoder shape with skip connections that fuse high-resolution detail from the contracting path back into the expanding path, which is exactly what a one-pixel curve needs because the localisation signal would otherwise be destroyed by downsampling. The residual block (He et al., 2016)He et al. · 2016Deep Residual Learning for Image RecognitionCVPR established that a network learns more readily when each block fits a residual function on top of an identity skip, and the same paper introduced both block forms we are comparing. The basic form is two three-by-three convolutions; the bottleneck form is the one-by-one, three-by-three, one-by-one sandwich the authors reached for in their deeper variants precisely because it keeps the parameter count tractable when you stack many blocks.
The reason the choice is not obvious is that the original argument for the bottleneck was an efficiency argument at extreme depth. In an image classifier with fifty or a hundred layers, factoring through a narrow centre is what makes the depth affordable at all. The follow-up identity-mappings analysis (He et al., 2016)He et al. · 2016Identity Mappings in Deep Residual NetworksECCV sharpened why the residual path trains stably by keeping a clean gradient highway through the skip, which holds for both forms. Both blocks also carry batch normalisation in their original recipe (Ioffe and Szegedy, 2015)Ioffe and Szegedy · 2015Batch NormalizationICML, though in our memory-bound training we replaced it with group normalisation, a detail we return to under method because it is the kind of thing an A/B must hold constant.
None of that tells you what happens when you take a five-block encoder, far shallower than the classifiers these blocks were designed for, and point it at a segmentation target where the foreground is a curve one or two pixels wide. At that depth the bottleneck's efficiency advantage is small, and its narrow centre could plausibly throttle the very fine-detail capacity a thin-structure target leans on. The literature does not settle it. An ablation does, and only if the ablation is clean.
Method
The model under test is VeerNet, our encoder-decoder convolutional network with a transformer refinement stage on the bottleneck, and the only thing we vary is the residual block type inside the five encoder stages. Each encoder stage is a stride-two residual block that halves spatial resolution and lifts channel depth; the decoder mirrors it with five upsampling and convolution layers that climb back to a full-resolution mask, fed by the U-Net skip connections from the matching encoder stage. Two transformer attention layers refine the 128-dimensional bottleneck so the network reasons about a curve as one coherent stroke rather than a pile of disconnected edges, in the spirit of self-attention as a global mixing operator (Vaswani et al., 2017)Vaswani et al. · 2017Attention Is All You NeedNeurIPS. That refinement stage is held identical across both arms; the A/B touches the encoder block and nothing downstream of it.
In the basic arm, each of the five encoder stages is a two-convolution residual block, two three-by-three convolutions with a projection skip when the channel count changes. In the bottleneck arm, each stage becomes the one-by-one, three-by-three, one-by-one sandwich, the squeeze ratio set so the centre is narrower than the input and output width. Everything else is frozen. Same five decoder layers, same transformer bottleneck, same single grayscale input channel, same group normalisation in place of batch norm because the wide variable-size rasters force a small batch and batch statistics are unreliable there, same initialisation scheme.
The data and budget are held identical because that is the whole point of an A/B. Both arms train on the same 15,000-instance synthetic three-class set, background plus two overprinted curves, procedurally rendered with pixel-perfect masks so there is no label noise to confound the comparison. Both run fifty epochs. Both consume roughly ten hours of training on the same single-GPU configuration. We read off per-class intersection-over-union and per-class F1 on the held-out split, the same metrics on the same examples, so any difference between the two columns is attributable to the block and to the block alone. Because the foreground curves are a vanishing fraction of every image, the segmentation loss is an imbalance-aware one in both arms, of the Tversky and Lovasz-Softmax family the thin-structure literature converged on (Salehi et al., 2017)Salehi et al. · 2017Tversky Loss Function for Image SegmentationMLMI/MICCAI (Berman et al., 2018)Berman et al. · 2018The Lovasz-Softmax LossCVPR, again identical across the two arms so the loss cannot explain a gap.
Results
The reference operating point, and the number the downstream petrophysics is graded against, lives in the multiclass three-class regime: per-class IoU of 0.94 on background, 0.26 on the first curve, and 0.21 on the second, with F1 of 0.97, 0.37, and 0.32 on the same three classes. The instrument below lets you toggle the encoder and decoder blocks between basic and bottleneck and watch those bars against the fixed training budget, the five encoder residual blocks, the five decoder layers, the three classes, and the fifty-epoch ten-hour envelope on 15,000 instances.
The shape of the result is the result. Background IoU sits at the ceiling and is indifferent to the block, which is expected, the background class is easy and either block segments it. The two curve classes are where a real architectural difference would show up, and on a thin one-pixel-to-two-pixel curve the gap between the two block forms is small relative to the gap between either curve class and the background. The dominant signal in this table is not basic versus bottleneck. It is the geometry of segmenting a thin structure, where a single-pixel registration slip annihilates the overlap metric regardless of how the residual block is wired.
| Class | IoU | F1 |
|---|---|---|
| Background | 0.94 | 0.97 |
| Curve 1 | 0.26 | 0.37 |
| Curve 2 | 0.21 | 0.32 |
That curve-IoU is not a verdict on the residual block. It is the same thin-structure floor we see whenever a target has almost no area, and it does not move much when the block changes because the limiting factor is the one-pixel width of the foreground, not the capacity of a five-stage encoder. The honest reading of the ablation is that the block choice does not buy a meaningful IoU difference on this target, so it should be decided on the secondary axes the budget exposes, training time, parameter count, and stability under the small batch the wide rasters impose.
Discussion
Where does this sit relative to the lineage we built on. The residual paper introduced the bottleneck as an efficiency device for very deep classifiers, and the value of that device scales with depth. At fifty or a hundred layers, factoring through a narrow centre is what makes the network trainable inside a fixed compute budget. At five encoder stages it is a different regime: the depth is modest, the efficiency saving per block is modest, and the narrow centre risks throttling fine detail on a target that is almost entirely fine detail. The A/B confirms the intuition that the headline advantage of the bottleneck does not transfer cleanly to a shallow segmentation encoder on a thin-curve task, and that the basic block is a perfectly defensible default here precisely because there is no IoU penalty for keeping the wider two-convolution path.
The methodological point is the one we would press on any team reaching for a heavier block. The bottleneck is not free even when it is cheaper in parameters, because its benefit is contingent on the depth and the target, and the only way to know whether it earns its place is to hold everything else fixed and swap just the block. We did not reason our way to a block from a parameter count; we ran both under an identical fifty-epoch ten-hour budget on the same 15,000 instances and let the per-class columns decide. The lesson generalises beyond this one network: an architecture choice that is obviously right in the regime it was designed for can be a wash, or a quiet liability, in a regime two orders of magnitude shallower and pointed at a different kind of foreground.
There is a second-order reading worth stating plainly. When two architectural variants tie on the metric that matters, the tie is itself information. It says the bottleneck of performance is elsewhere, in the data and the target geometry, not in the block, and it redirects effort accordingly. Chasing a better residual block on a target whose ceiling is set by one-pixel curve width is effort spent against the wrong constraint.
Limitations
Every number here is the multiclass three-class operating point on a constant two-curve synthetic set, and the comparison holds the data, the loss, the decoder, and the transformer bottleneck fixed while varying only the encoder residual block. The conclusion that the block choice is close to a wash on IoU is specific to this depth, this thin-curve target, and this budget; it should not be read as a general claim that bottleneck blocks never help, which would contradict their well-established value in deep classifiers. We report a single fifty-epoch ten-hour run per arm rather than a seed sweep, so small per-class differences should be treated as within-noise, which is itself consistent with the headline that the block is not the lever. Logs with three or more overprinted curves are harder and untested at this fidelity.
References
[1] K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. CVPR 2016. https://arxiv.org/abs/1512.03385
[2] O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. https://arxiv.org/abs/1505.04597
[3] K. He, X. Zhang, S. Ren, J. Sun. Identity Mappings in Deep Residual Networks. ECCV 2016. https://arxiv.org/abs/1603.05027
[4] S. Ioffe, C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015. https://arxiv.org/abs/1502.03167
[5] A. Vaswani et al. Attention Is All You Need. NeurIPS 2017. https://arxiv.org/abs/1706.03762
[6] S. S. M. Salehi, D. Erdogmus, A. Gholipour. Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. MLMI Workshop, MICCAI 2017. https://arxiv.org/abs/1706.05721
[7] M. Berman, A. R. Triki, M. B. Blaschko. The Lovasz-Softmax Loss. CVPR 2018. https://arxiv.org/abs/1805.02396