We spent most of the project teaching the model the world through synthetic data: a procedural generator that drew clean paper logs across a wide range of widths and degradations, fifteen thousand of them, so the network would meet variance in training rather than in production. It worked well on the inputs it had imagined. Then a box of real field photographs arrived, scans taken with a phone over a desk lamp, the paper yellowed and curling, the light falling unevenly across the sheet, and the model that aced its own held-out renders started missing pieces of the curve. The synthetic pipeline had covered a great deal. It had not covered this.
The instinct in that situation is to go back to training: add the missing corruptions to the generator, render more, retrain. We did some of that. But retraining is slow and expensive, and we had a faster lever available that does not touch the weights at all. It changes only how a single prediction is assembled at inference. This is test-time augmentation, and on exactly these messy field photographs it bought back accuracy that the training set alone had left on the floor.
Two different levers that share a name
Augmentation at training time and augmentation at inference time are easy to conflate because they use the same transforms, and they are not the same thing. Training-time augmentation expands the distribution the model learns from: you flip and shift and warp the training images so the network sees more variety and generalises better. The effect lives in the learned weights, and once training is done it is fixed.
Test-time augmentation leaves the weights untouched. It takes one input you want a prediction for, generates several transformed copies of it, runs each through the same frozen model, transforms each output back into the original coordinate frame, and combines the results. The model never learns anything new. What changes is that a single, possibly unlucky forward pass is replaced by a small vote, and the vote is steadier than any one member of it. The first is about what the model knows. The second is about how much you trust a single look, and what you do to hedge it.
Where the recipe comes from
The idea is old enough that it predates the phrase. The convolutional network that started the modern era averaged its predictions over several patches of the input and their horizontal reflections at test time, treating the average as the answer. [1] A few years later the practice was named and standardised: deep recognition networks routinely reported a test-time number that came from averaging multiple crops and their flips, an evaluation-time gain understood to be separate from whatever augmentation went into training. [2] The ten-crop protocol, ten cropped and flipped views per image, folded into one prediction, became boilerplate in the residual-network era. [3]
For our problem the relevant lineage is the segmentation one, because we are not averaging class scores over an image; we are averaging masks. The careful formulation here comes from medical imaging, where test-time augmentation was given a clean definition for dense prediction: apply a transform to the input, predict, then apply the inverse transform to the predicted mask before aggregating, so every vote is registered back into the same pixel grid. [4] That un-transformation step is the part people forget, and it is the part that makes the difference between a sharper mask and a smeared one. Our segmenter sits in the U-Net family that this dense-prediction work assumes, an encoder-decoder whose skip connections are what let it hold a one-pixel curve in the first place. [5]
How we actually used it
Our transforms were deliberately conservative, because a well log is not a natural photograph and most augmentations would corrupt the thing we care about. A vertical flip is meaningless: depth runs down the page, and reversing it inverts the measurement. Rotation past a degree or two shears the depth axis. So we kept to the two transforms that genuinely preserve a curve's identity on a log sheet. The first is a horizontal flip, which mirrors the curve across the track centre and is a legitimate alternate view because the network should read a curve the same whether it leans left or right. The second is a small horizontal pixel shift, a few pixels either way, which probes whether the prediction is stable under the kind of sub-pixel registration jitter a hand-held phone capture always introduces.
For each variant we ran the frozen model, flipped or shifted the output mask back to undo the input transform, and averaged the per-pixel class probabilities across all the variants before taking the argmax. On a clean synthetic render this does almost nothing, because the single pass was already confident and every vote agrees. On a degraded field photograph it does real work: where one pass wavers between curve and background along a faded stretch, the votes from the flipped and shifted passes rarely waver in the same place, and the average pulls the consensus back onto the curve.
The bench below makes that visible. Drag the variant count from a single forward pass up to the full eight, watch the aggregated curve firm up on the noisy photograph as the votes accumulate, and read the multiclass curve error move off its single-pass baseline.
What it moved, and what it did not
The honest accounting is split by curve, and it is the most interesting thing about the result. We measure each predicted curve by mean absolute error against the ground-truth trace, in the multiclass setting that reads two curves at once, on the eighty-twenty train and validation split we held throughout. The single-pass baseline under our Dice-trained model sat at a mean absolute error of 0.0367 for the first curve and 0.0774 for the second. As we aggregated more variants and the operating point shifted toward the recall-favoured Tversky regime, the first curve improved to 0.0277, a clean gain on the curve the petrophysicist cares about most. The second curve moved the other way, to 0.1241.
That second number is not a failure of the method; it is the method showing you a trade you were already making. Favouring recall, which is what both the Tversky operating point and a recall-leaning vote aggregation do, recovers more of a faint curve by being willing to claim more pixels. On a well-behaved curve that recovers signal. On the harder, fainter second curve it also claims pixels that belong to background or to the other curve, and the absolute error rises even as visually the trace looks more complete. Test-time augmentation did not repeal that asymmetry. It made the first curve more robust on inputs the training set never covered, and it inherited the second curve's recall-precision tension rather than fixing it.
We were comfortable shipping that trade, because in the field the first curve is the one a downstream interpreter reads first and the cost of missing a faint stretch of it is higher than the cost of a slightly loose second trace. But we did not pretend the aggregate was a free lunch. A vote is only as good as the disagreement it resolves, and on a curve where every variant agrees on the wrong answer, averaging changes nothing.
What we tell the next team to try first
If you have a model that is solid on its own validation set and soft on real captures, reach for test-time augmentation before you reach for the training pipeline, because it costs you a few extra forward passes and an afternoon, not a retraining cycle. Pick the two or three transforms that genuinely preserve your target's identity, refuse the ones that do not, and never forget to un-transform the output before you aggregate, because an un-registered vote averages your prediction into mush. Then measure per class, not in aggregate, and expect the gain to land unevenly. [6] The lever is cheap precisely because it asks nothing of the weights, and it is honest precisely because it cannot hide the trade-offs the loss function already chose; it can only steady the hand that reads them.
Key takeaways
- Training-time and test-time augmentation share the same transforms but are different levers. Training augmentation changes what the frozen weights learned; test-time augmentation changes how one prediction is assembled at inference, by voting a single input across several transformed views with the weights untouched.
- The recipe is old and public: averaging over crops and horizontal flips at test goes back to the first modern convolutional networks, was standardised as multi-crop and ten-crop evaluation, and was given a clean dense-prediction definition in medical segmentation, where each predicted mask is un-transformed back into the original frame before aggregation.
- We kept the transforms conservative for a log sheet: a horizontal flip and a small horizontal pixel shift only. A vertical flip inverts depth and rotation shears the depth axis, so both were refused. The un-transformation step before averaging is the part that separates a sharper mask from a smeared one.
- The gain landed unevenly across curves. On the eighty-twenty split, the first curve's multiclass mean absolute error improved from a 0.0367 Dice single-pass baseline toward 0.0277 at the recall-favoured Tversky operating point, while the second, fainter curve moved the other way to 0.1241 as favouring recall claimed more pixels.
- Reach for it before retraining when a model is solid on validation and soft on real captures: it costs a few forward passes, not a training cycle. But it cannot repeal the precision-recall trade the loss already chose, and a vote where every variant agrees on the wrong answer recovers nothing.
References
[1] A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 2012. The network that popularised averaging predictions over patches and their horizontal reflections at test time. https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
[2] K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition (VGG). ICLR 2015. Multi-crop and horizontal-flip averaging at evaluation as a standard accuracy lever. https://arxiv.org/abs/1409.1556
[3] K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. CVPR 2016. The ten-crop test protocol, ten flipped and cropped views averaged per image. https://arxiv.org/abs/1512.03385
[4] G. Wang, W. Li, M. Aertsen, J. Deprest, S. Ourselin, T. Vercauteren. Aleatoric Uncertainty Estimation with Test-Time Augmentation for Medical Image Segmentation. Neurocomputing 2019. The canonical formulation of test-time augmentation for dense segmentation, with explicit un-transformation of each predicted mask before aggregation. https://arxiv.org/abs/1807.07356
[5] O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. The encoder-decoder with skip connections that preserves thin-structure detail. https://arxiv.org/abs/1505.04597
[6] S. S. M. Salehi, D. Erdogmus, A. Gholipour. Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. MLMI 2017. The asymmetric loss that lets a segmenter trade precision for recall. https://arxiv.org/abs/1706.05721