The cheapest reliability you can buy in machine learning is almost embarrassingly old. Before deep networks, before GPUs, the standard move when one classifier was not good enough was to train several and let them vote, and it worked for the same reason a panel of imperfect judges beats a single one: their mistakes are not all in the same place. We reach for that move when we run VeerNet, the encoder-decoder EarthScan uses to lift a one-to-three-pixel ink trace off a scanned well log, and this post is about why a trick that old is still the right one for a problem that new. It is a primer, not a results paper. The field worked all of this out long before we touched a raster log, and the point here is to credit that work, then show the one place where a per-pixel vote is unusually well matched to the geometry of a thin curve.
Where the idea comes from
The case for ensembling was made cleanly a long time ago, and it is worth stating in its original terms rather than ours. Dietterich's survey gives three reasons a combination of models beats any single member, and all three apply to a segmenter [1]. The statistical reason is that with limited data many different models fit the training set about equally well, and you cannot know in advance which one generalises, so averaging hedges the bet. The computational reason is that gradient descent lands in different local optima from different seeds and different losses, and a combination smooths over the particular optimum any one run happened to find. The representational reason is that the true decision boundary may sit outside what a single model in the family can express, and a combination can reach shapes no member can. A thin-curve segmenter is squarely in all three situations: limited real data, a non-convex loss with several reasonable minima, and a target so geometrically awkward that no single mask is ever quite right.
The aggregation rule itself, voting, is older still. Breiman's work on bagging made the foundational argument that aggregating predictors trained on perturbed versions of the data reduces variance, and that for classifiers the natural aggregation is a vote [2]. That is exactly the lever we are pulling: instead of perturbing the data we perturb the loss, training several variants that disagree in instructive ways, and then we vote per pixel. The deep-learning era did not overturn any of this; it confirmed it. Lakshminarayanan and colleagues showed that a handful of independently trained networks, simply averaged, beat the single best one and produce calibrated uncertainty for free, with none of the machinery a Bayesian treatment would demand [3]. The lesson the field kept relearning is that independence between members matters more than the cleverness of any one member.
The version that matters for masks
For our problem the most relevant prior art is not the general ensemble literature but the part of it that targets segmentation specifically, because a segmentation ensemble has a structure a classifier ensemble does not. When you ensemble image classifiers you average a vector of class scores once per image. When you ensemble segmenters you have a decision at every pixel, which means the vote happens millions of times per scan and the geometry of where the members agree and disagree becomes the whole story. Kamnitsas and colleagues made the case directly in the medical-imaging setting, building an ensemble that spanned several models and architectures and showing it was substantially more robust than any single member, precisely because the failure modes of different members did not coincide pixel for pixel [4]. That is the result we lean on, and the credit is theirs. Every member of our own ensemble sits in the U-Net encoder-decoder lineage [5], so they are not architecturally diverse the way that work's were; our diversity comes from the loss instead, which turns out to be enough to make the votes disagree where it counts.
There is a second branch worth naming because it changes the cost arithmetic. You do not strictly need several trained models to get an ensemble. Wang and colleagues showed you can take one model and ensemble it over augmented versions of the input at test time, trading a second training run for several forward passes and getting an uncertainty estimate as a bonus [6]. That matters for anyone weighing the inference bill, because it means "ensemble" is a spectrum: several models voting once each, or one model voting several times over perturbed inputs, or any blend. We use the several-models form because we already had the variants in hand, but the trade we describe below applies to both.
Why a one-pixel curve is the perfect case for a vote
Here is the part that is specific to us, and it is geometric. A well-log curve in a raster scan is one to three pixels wide, and the thing that ruins a digitised trace is not a stray pixel but a gap. A single missed pixel punches a hole; string a few together and the curve breaks into fragments that no honest interpolation can rejoin, and the petrophysicist reading the result gets phantom flat spots where the ink was simply never detected. Now consider what a vote does to that failure. The misses that break a single model's trace are not the same misses another model makes, because each variant was trained under a different loss and learned to be cautious in a different place. The recall-tilted variant catches a faint curve segment that the precision-tilted one dropped; the precision-tilted one rejects a smear the recall-tilted one painted. Union them and the gaps fill in, because a gap only survives the vote if every member missed the same pixel, and that is rare.
This is the asymmetry that makes the vote pay. For a thick blob you might worry that a union inflates the mask and hurts more than it helps. For a one-pixel curve the downstream stage does a column-wise read that tolerates a slightly fat trace far better than a broken one, so the recall the vote recovers is almost pure upside and the precision it costs is largely cleaned up afterward. The vote is steadiest exactly where a single model is most fragile.
Reading the cost-versus-recall trade
The honest framing of an ensemble is never "is it better" but "is it worth it," because every extra member is another model to train, store, and run at inference. The lever that controls the trade is the vote threshold: how many of the members must fire on a pixel for the fused mask to keep it. At one extreme, a threshold of one is a union, where any single member's vote keeps the pixel; recall is highest and the gaps fill in, but so does every member's private smear. At the other extreme, full consensus keeps only the pixels every member agreed on; precision is highest and the mask is clean, but the recall falls below the best single model because one cautious member can veto a real pixel the others caught. The useful settings live in between, and the instrument below lets you walk the whole range.
Drag the threshold from union to consensus and watch the fused recall trace fall while the fused precision climbs, against the dashed line that marks the best single model's curve recall. The orange bracket is the thing the whole post is about: the recall lift the ensemble buys over any one model by catching curve pixels no single member agreed on. The right rail shows which of the five loss-trained variants keep or drop one representative curve pixel at the current threshold, so you can see the consensus tighten as you raise the bar. The two numbers the rail anchors are the ones we measured and stand behind: a per-curve F1 spread of 0.37 on curve 1 and 0.32 on curve 2 under the multiclass Dice setting. Those F1 figures are sourced and so is the single-model curve recall the bracket is drawn against; the shape of the recall-and-precision response across the threshold is an illustrative read-out of how a majority vote trades the two, not a measured sweep, and the instrument says so on its face.
What the spread between the curves tells you
The per-curve F1 gap, 0.37 against 0.32, is small but it is not noise, and it is the reason an ensemble helps unevenly. Curve 2 is the harder of the two to recover, which means it is the one where individual models are most likely to disagree, which in turn means it is the one where a vote has the most room to add value. A useful way to think about an ensemble is that it converts disagreement into recall: where the members all agree, the vote changes nothing and you have paid for inference you did not need; where they disagree, the vote is doing real work. The curve with the lower F1 is the curve with more disagreement to harvest, so if you are deciding whether the extra inference cost is justified, look at your hardest class, not your average. An ensemble that barely moves your easy class can still be the difference between a usable and an unusable trace on your hard one.
So when do you actually pay for it
A primer owes you a decision rule rather than an endorsement, so here is the one we use. Run the single best model first and look at where it breaks, not at its average score. If the breakages are concentrated, a few systematic blind spots, fix the model or the data, because an ensemble will only paper over a problem you could have solved cheaper. If the breakages are scattered, different gaps on different scans with no pattern you can train away, that scatter is exactly what a vote eats, and the second and third member earn their inference cost by filling holes the first one could never have closed alone. Set the threshold low enough to recover the gaps and let the downstream column read clean up the strays, then stop adding members the moment a new one stops moving your hardest curve. The voting trick is Dietterich's and Breiman's and the segmentation form is Kamnitsas and colleagues' [1] [2] [4]; what we can add from a working digitisation pipeline is that on a one-pixel trace, where a miss is fatal and a stray is cheap, the oldest move in machine learning is still the one that turns a fragile mask into a steady one.
Key takeaways
- Ensembling is the oldest reliability trick in machine learning: train several imperfect models and let them vote, because their mistakes are not in the same place. Dietterich's three reasons (statistical, computational, representational) all apply to a thin-curve segmenter, and the voting rule itself traces to Breiman's bagging.
- A segmentation ensemble has structure a classifier ensemble lacks: the vote happens at every pixel, so where the members agree and disagree is the whole story. Kamnitsas and colleagues' multi-model, multi-architecture ensemble is the closest public analogue; our diversity comes from the loss, not the architecture, and that is enough to make the votes disagree usefully.
- A one-pixel well-log curve is the ideal case for a vote because the failure that ruins a trace is a gap, and a gap only survives the vote if every member missed the same pixel. Different loss-trained variants miss in different places, so union fills the holes; the recovered recall is near-pure upside since a column read tolerates a fat trace far better than a broken one.
- The cost-versus-recall trade is governed by the vote threshold: union (k=1) maximises recall and fills gaps, consensus (k=5) maximises precision but drops recall below the best single model. The useful settings are in between, and the recall lift over any one model is the value the ensemble actually buys.
- Decide per-class, not on the average: an ensemble converts disagreement into recall, so the hard curve (F1 0.32 on curve 2 versus 0.37 on curve 1) has the most to gain. Run the single best model first; if its breakages are scattered rather than systematic, the vote earns its inference cost, and you stop adding members once a new one stops moving your hardest curve.
References
[1] Dietterich, T. G. Ensemble Methods in Machine Learning. Multiple Classifier Systems, LNCS vol. 1857 (2000). The survey that frames why combining imperfect models beats any one of them. https://doi.org/10.1007/3-540-45014-9_1
[2] Breiman, L. Bagging Predictors. Machine Learning, 24(2), 123-140 (1996). The foundational variance-reduction argument with voting as the aggregation rule for classifiers. https://doi.org/10.1007/BF00058655
[3] Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. NeurIPS (2017). A handful of independently trained networks, averaged, beat the single best and calibrate for free. https://arxiv.org/abs/1612.01474
[4] Kamnitsas, K., et al. Ensembles of Multiple Models and Architectures for Robust Brain Lesion Segmentation. BrainLes Workshop, MICCAI (2017). The segmentation-specific result that a multi-model, multi-architecture ensemble is far more robust than any member. https://arxiv.org/abs/1711.01468
[5] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The encoder-decoder lineage every ensemble member sits in. https://arxiv.org/abs/1505.04597
[6] Wang, G., et al. Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation. Neurocomputing, 338 (2019). The one-model, many-augmented-passes form of an ensemble, which reframes the inference-cost trade. https://arxiv.org/abs/1807.07356