Label Noise in Real-World Segmentation: Sources, Measurement, and Robust Training

Abstract

Every supervised segmenter is trained against labels someone drew, and on real scientific images those labels are wrong more often than the headline accuracy number admits. This note asks three connected questions of the public literature: where do pixel labels go wrong, how is that corruption measured, and which training strategies actually keep working when a fraction of the labels are bad. We separate the noise into three sources that recur across real datasets, annotation disagreement, label propagation from sparse marks, and rasterisation of a vector ground truth, and we review the noise-robust loss and sample-reweighting families published through 2022, crediting each to its origin. We then ground the survey in a worked example from our own VeerNet well-log pipeline, a weighted binary segmenter whose recall holds at 0.96 to 0.97 across three curve masks while its F1 collapses to 0.37, 0.26, and 0.55. That gap is the diagnostic the survey is built around: it is precisely the shape a high-recall low-precision model takes when the labels it trusts include the labels it should not, and it is the failure the robust-training literature exists to soften. The finding is simple to state and uncomfortable to act on: a loss that weights every label equally spends its largest gradients on the labels most likely to be wrong.

Background and the shape of the problem

The study of learning from imperfect labels predates the deep-learning era, and the cleanest map of it remains the classification survey that named the three regimes still used today: noise that is uniform across classes, noise that depends on the true class, and noise that depends on the instance itself (Frenay and Verleysen, 2014). The taxonomy was written for whole-image classification, where each example carries one label, but it transfers to segmentation almost without change once you treat each pixel as its own labelled example. The difference is one of scale and structure. A segmentation mask is millions of correlated pixel labels, and its errors are not sprinkled at random across the image; they cluster on the boundaries of the object, exactly where the signal the model most needs to learn also lives.

That structural fact is what makes segmentation label noise its own problem rather than a footnote to the classification survey. In a real scientific image the boundary between an object and its background is rarely a clean step. A curve drawn on a paper well log has a finite ink width, a microscope image has diffraction blur, a satellite scene has mixed pixels at every field edge. When a human or an automated tool commits that fuzzy boundary to a hard binary mask, it must choose a side for every ambiguous pixel, and different annotators choose differently. The result is a ground truth whose interior is reliable and whose edge is a band of disagreement, and for a thin target the edge is most of the object.

Three sources of this corruption recur often enough across real datasets to be worth naming. The first is annotation disagreement: two qualified people label the same image and produce masks that differ by several percent of pixels, almost all of them on boundaries. The second is propagation error, which arises when sparse human marks, a few clicks or scribbles, are grown into a dense mask by an algorithm whose mistakes then become training labels with the full authority of ground truth. The third, specific to the document and scientific-rendering setting, is rasterisation error: when a vector ground truth is rasterised onto the pixel grid, a one-pixel-wide curve lands on whichever pixels the renderer rounds to, and a slightly different rounding would have produced a slightly different mask. The U-Net lineage met the first two of these in biomedical imaging from the very beginning, which is why its original recipe leaned so heavily on aggressive deformation augmentation to keep the model from memorising the exact, partly-arbitrary boundaries it was shown (Ronneberger et al., 2015). The well-log digitisation literature met the third directly, since the gridlines and printed curves it digitises are themselves renderings, and removing the grid cleanly is half the battle before a single label is even drawn (Yuan and Yang, 2019).

Method

We surveyed the public literature on learning under label noise published through the third quarter of 2022 and organised it along two axes that a practitioner actually has to choose between. The first axis is the loss function: does the objective itself down-weight the influence of a probably-wrong label. The second axis is sample selection: does the training procedure decide, per example or per pixel, whether to trust the label at all. For each family we recorded the assumption it makes about the noise and the regime in which the original work demonstrated it, so that the survey credits where each idea was first shown to hold rather than where it was later popularised.

To keep the survey honest we required a worked example with measured numbers rather than a purely bibliographic comparison. The example is a binary segmentation run from our own VeerNet pipeline, the encoder-decoder convolutional network with a transformer attention stage on the bottleneck that we built to read curves off raster well-log images. The relevant run trained a weighted binary cross-entropy objective with the positive class weighted by a factor of 42 to fight the overwhelming background, and it reports recall of 0.96, 0.97, and 0.97 on three curve masks while the corresponding F1 scores are 0.37, 0.26, and 0.55. We use this run not as a result to be celebrated but as a measured instance of the exact failure the robust-training literature addresses, and we anchor the survey instrument's worked-example row to those sourced numbers.

The classification of strategies into noise-robust and noise-sensitive is the survey instrument's spine. Each row carries a label-noise tolerance read off the strategy's originating literature and an error trace that rises as the assumed pixel-label noise rate climbs. Those tolerance bands and the noise-to-error mapping are schematic, encoding the survey's reading of how each family behaves rather than a re-run benchmark, and the instrument flags that on its own canvas. Only the worked example's recall and F1 numbers are sourced.

Results

The literature sorts into two groups that behave very differently as the label-noise rate rises, and the worked example sits at the painful end of the spectrum the robust families are built to rescue.

The first group changes the loss so that a confidently-wrong label produces a smaller gradient than a confidently-right one. Generalised cross-entropy interpolates between the mean-absolute-error loss, which is provably robust to symmetric label noise but slow to converge, and ordinary cross-entropy, which converges fast but trusts every label completely; the interpolation buys most of the robustness without surrendering the convergence (Zhang and Sabuncu, 2018). Symmetric cross-entropy adds a reverse term that explicitly penalises the model for being dragged toward a noisy label, on the argument that ordinary cross-entropy underfits the clean classes while overfitting the noisy ones (Wang et al., 2019). Both degrade gracefully as the noise rate grows because their worst-case gradient is bounded by construction.

The second group keeps the loss but changes which examples it is computed on. MentorNet learns a data-driven curriculum that down-weights examples a separate network judges likely to be mislabelled, so the main network spends its capacity on examples it can trust (Jiang et al., 2018). Co-teaching trains two networks in parallel and lets each select the small-loss examples for the other, exploiting the observation that a network learns clean patterns before it memorises noisy ones, so an example with a small loss early in training is probably correctly labelled (Han et al., 2018). These selection methods tolerate even extreme noise rates, at the cost of training two models or maintaining a separate scoring network.

Against that backdrop, two widely-used losses do the opposite of what label-noise robustness requires: they deliberately concentrate gradient on the hardest examples, which under label noise are disproportionately the mislabelled ones. Focal loss reshapes cross-entropy to up-weight hard examples so that easy background pixels stop dominating the gradient, a fix designed for class imbalance, not label noise (Lin et al., 2017). Plain weighted cross-entropy, which simply multiplies the rare positive class's loss by a large constant, does the same thing more bluntly. Both are excellent at the problem they were built for and structurally fragile to label noise, because the boundary pixels they up-weight are exactly the pixels whose labels are least certain.

The instrument makes the trade-off operable. Each row is a strategy with its noise tolerance and an error trace; drag the assumed label-noise rate and watch the robust families hold while the worked example climbs.

A survey strip, not a result of ours. Each row is one robust-training family from the public label-noise literature through 2022, with a tolerance band read off that literature and an error trace that climbs as the assumed pixel-label noise rate rises. Generalised and symmetric cross-entropy, loss reweighting, and small-loss co-teaching are classed as noise-robust and degrade gracefully; focal loss and plain weighted BCE up-weight exactly the pixels noise corrupts and climb steeply. Drag the assumed label-noise rate and every row's marker slides along its trace. The orange row is the worked example: our own weighted binary-BCE run with class_weight 42, where recall holds at 0.96 to 0.97 across the three masks while F1 collapses to 0.37, 0.26, and 0.55, the textbook high-recall low-precision signature that label noise amplifies. Those binary weighted-BCE numbers are sourced from the engagement archive; the per-strategy tolerance bands and the noise-to-error mapping encode the survey's reading of the literature and are schematic, flagged on the canvas.

The worked example is the whole argument in one measurement. A weighted binary cross-entropy with the positive class weighted by 42 produced recall of 0.96, 0.97, and 0.97 across three masks and F1 of 0.37, 0.26, and 0.55. Recall that high with F1 that low can only mean one thing: the model is predicting the positive class generously, catching nearly every true curve pixel and a great many false ones, so precision is poor and the harmonic mean that is F1 is dragged down by it. That is the canonical high-recall low-precision regime, and label noise on the boundary is one of its principal causes, because a model rewarded for never missing a true pixel learns to claim the uncertain boundary band wholesale rather than risk a miss. A robust objective from the first survey group, or a selection method from the second, attacks this directly by refusing to let the uncertain boundary labels command the same gradient as the confident interior ones.

Strategy	Noise assumption	Verdict
Generalised cross-entropy	Bounded gradient under symmetric noise	Robust [2]
Symmetric cross-entropy	Penalises being pulled toward noisy labels	Robust [3]
Loss reweighting / MentorNet	Down-weights likely-mislabelled examples	Robust [4]
Co-teaching	Small-loss examples are probably clean	Robust [5]
Focal loss	Up-weights hard examples	Sensitive to label noise [6]
Weighted BCE (worked example)	Up-weights the rare positive class	Sensitive; recall 0.96 to 0.97, F1 0.37 / 0.26 / 0.55

Discussion

The survey lands on a distinction that the imbalance and noise literatures often blur together. Class imbalance and label noise look similar from a distance, since both make the rare class hard to learn, and both are commonly treated with the same tool of re-weighting the rare class upward. But the correct response to each is opposite at the boundary. The right move for pure imbalance is to up-weight the rare class so its few pixels are not drowned out, which is exactly what weighted cross-entropy and focal loss do (Lin et al., 2017). The right move for label noise is to down-weight the uncertain pixels so the model does not over-trust them, which is what the robust losses and selection methods do. When both problems are present at once, as they almost always are on a thin curve against a vast background, the two corrections fight each other on the boundary band: imbalance handling wants to shout there, noise handling wants to whisper. The worked example is a model that resolved that conflict entirely in favour of imbalance, weighting the positive class by 42 and saying nothing about noise, and its recall-precision split is the predictable consequence.

Where our own work sits in this map is at the rasterisation-noise end, which is in one respect the friendliest case for the robust-training literature. Because our training masks come from a procedural renderer rather than a human annotator, we know the noise model exactly: it is the rounding the renderer applies when it commits a vector curve to the pixel grid, plus the degradation model we apply to imitate the scanning channel. That knowledge is leverage the natural-image researcher does not have, because it lets the noise be characterised rather than merely assumed. A region-overlap loss such as Tversky, which can be tuned to trade false positives against false negatives, becomes a principled instrument once the boundary uncertainty is quantified rather than guessed at (Salehi et al., 2017). The honest framing is that the well log is an unusually measurable case of label noise, and the value of saying so is that it marks where the robust-training recipes are easiest to apply with confidence: anywhere the labels are rendered from known source data, so their noise model is a fact rather than a hypothesis.

The deeper point the survey kept returning to is about measurement. None of the robust strategies can be chosen well without first estimating the noise rate, and estimating the noise rate on a segmentation boundary is itself an annotation problem, since it requires a second, more careful labelling to compare against. The field has largely sidestepped this by assuming a noise rate or by inferring it from the small-loss dynamics during training, and both are reasonable, but both are guesses. The setting where the guess becomes a measurement is the one where the labels are generated rather than collected, and that is the setting our pipeline happens to occupy.

Limitations

This is a survey, and a survey carries its sources' assumptions forward. The noise-robust and noise-sensitive verdicts above are our reading of the public literature through the third quarter of 2022, organised for a practitioner choosing between families, not a controlled head-to-head benchmark under a fixed noise model. The tolerance bands and the noise-to-error traces in the instrument are schematic by design: they express how each family is expected to behave as the noise rate rises, not measured degradation curves, and the instrument says so on its own face. Only the worked example's numbers are sourced, the weighted binary cross-entropy run with the positive class weighted by 42 reporting recall of 0.96, 0.97, and 0.97 and F1 of 0.37, 0.26, and 0.55 across three masks, and even those are a single run on one dataset, so they illustrate the high-recall low-precision regime rather than prove a general rate. The rasterisation-noise framing that makes our own case so tractable is, by the same token, what limits how far our experience generalises: a dataset whose labels were drawn by hand carries an instance-dependent, partly-unknowable noise model that no renderer can characterise for it, and the comfortable measurability we describe does not transfer to that world.

References

[1] B. Frenay, M. Verleysen. Classification in the Presence of Label Noise: A Survey. IEEE Transactions on Neural Networks and Learning Systems, 2014. https://ieeexplore.ieee.org/document/6685834

[2] Z. Zhang, M. Sabuncu. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. NeurIPS 2018. https://arxiv.org/abs/1805.07836

[3] Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, J. Bailey. Symmetric Cross Entropy for Robust Learning with Noisy Labels. ICCV 2019. https://arxiv.org/abs/1908.06112

[4] L. Jiang, Z. Zhou, T. Leung, L. Li, L. Fei-Fei. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. ICML 2018. https://arxiv.org/abs/1712.05055

[5] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, M. Sugiyama. Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels. NeurIPS 2018. https://arxiv.org/abs/1804.06872

[6] T. Lin, P. Goyal, R. Girshick, K. He, P. Dollar. Focal Loss for Dense Object Detection. ICCV 2017. https://arxiv.org/abs/1708.02002

[7] S. Salehi, D. Erdogmus, A. Gholipour. Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. MLMI Workshop, MICCAI 2017. https://arxiv.org/abs/1706.05721

[8] O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. https://arxiv.org/abs/1505.04597

[9] B. Yuan, Q. Yang. Digitization of Well-Logging Parameter Graphs Based on a Gridlines-Elimination Approach. J. Pet. Explor. Prod. Technol., 2019. http://www.jsoftware.us/show-409-JSW15423.html

Label Noise in Real-World Segmentation: Sources, Measurement, and Robust Training

Abstract

Background and the shape of the problem

Method

Results

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on