There is a small dishonesty buried in almost every segmentation ground truth, and it lives at the edges. When a human annotator draws the boundary of a curve trace on a scanned well log, the mask they produce is a clean binary thing: this pixel is curve, that pixel is background, with a hard line between them. But the scan does not work that way. The ink of an old raster log does not stop at a pixel boundary; it fades, it antialiases, it bleeds into the paper, and the column of pixels crossing the edge of the trace is a graded transition from definitely-ink to definitely-paper, with a band in the middle where any honest annotator would shrug. A hard 0/1 label paves over that shrug. It tells the network that the boundary pixels are certain when they are demonstrably not, and the network, being an obedient function approximator, learns to be confidently wrong exactly where the data is most uncertain. Soft cross-entropy and label smoothing are the two standard answers to that dishonesty, and this piece is a survey of what the literature says they do, read through the lens of the thin-curve digitisation problem. It was one of the five losses we evaluated when we built VeerNet, our encoder-decoder for digitising raster logs, and the place where its argument is sharpest is the fuzzy edge.
The hard label asserts a certainty the pixel does not have
Start with what a per-pixel cross-entropy loss actually charges for. Under a one-hot target, the loss for a pixel is the negative log of the probability the model assigns to the single correct class. If the target says this edge pixel is curve with probability one, then the only way to drive the loss to zero is for the model to predict curve with probability one too. The gradient keeps pushing the logit for curve toward infinity, long after the model has correctly identified the pixel, because a one-hot target is never satisfied by anything short of total confidence. That is the mechanism Szegedy and colleagues identified when they introduced label smoothing as a regulariser for image classification: a hard target encourages the largest logit to become arbitrarily larger than the others, which makes the model overconfident and hurts its ability to generalise [1].
On a well-defined interior pixel that overconfidence is mostly harmless, because the pixel really is curve and pushing the model to say so is correct. The problem is the boundary band. There, the one-hot target is not just demanding confidence, it is demanding confidence about a fact that is genuinely uncertain. The annotator had to put the line somewhere, and wherever they put it, the pixels on either side of that line are nearly identical mixtures of ink and paper. Charging the model full penalty for hesitating on those pixels trains it to manufacture a sharpness the underlying image does not contain. That is the specific failure soft targets are designed to prevent.
Label smoothing: blunt, uniform, and surprisingly effective
The simplest soft target is label smoothing, and its simplicity is the point. Instead of assigning the correct class a probability of one and every other class a probability of zero, you shave a small amount, often written epsilon, off the correct class and spread it uniformly across the others. A three-class pixel that was one-hot at curve-1 becomes, say, curve-1 at 0.9 with background and curve-2 at 0.05 each. Nothing about the image is consulted; the smoothing is a flat, content-blind redistribution applied to every label in the dataset.
What that buys you is calibration. Muller, Kornblith, and Hinton studied label smoothing carefully and found that it consistently improves the calibration of the trained network, meaning the confidence the model reports lines up better with the accuracy it actually achieves, and it does so by pulling the representations of correct examples into tighter, more equidistant clusters [2]. For a digitisation pipeline that is not a cosmetic property. The mask is not the deliverable; downstream we threshold the predicted probabilities and trace a centreline through the surviving pixels, and a well-calibrated probability field is one where the threshold means what you think it means. A model that has been trained to scream confidence at every pixel gives you a probability map that is saturated near zero and one with nothing useful in between, and the threshold becomes a coin toss exactly in the boundary band where you needed it to be informative.
But the same paper is careful to name the cost, and we would be misreading it to skip that. Label smoothing erases information. Because it pulls every correct example toward the same smoothed target regardless of how that example actually sits relative to the other classes, it discards the fine structure of how confident the model should be on a given input. Muller and colleagues show this is precisely why label smoothing hurts knowledge distillation: a smoothed teacher has flattened away the relative-similarity signal a student needs to learn from [2]. The lesson we took is that uniform smoothing is a blunt instrument. It treats a pixel deep inside the curve, where the model should be certain, exactly like a pixel straddling the edge, where it should not. It is regularisation by uniform self-doubt, and on a task where the uncertainty is concentrated in a known place, the uniformity is leaving value on the table.
Soft cross-entropy: let the target carry the ambiguity
The more targeted answer is to abandon the binary ground truth entirely and train against soft targets that vary per pixel, which is what a soft cross-entropy loss consumes. Here the boundary pixel is not labelled curve-with-a-little-smoothing; it is labelled with a fractional target that reflects how much ink it actually contains, a value between zero and one that slides smoothly across the transition band. The interior pixels keep their near-certain targets, the deep-background pixels keep theirs, and only the genuinely ambiguous band carries graded labels. The loss then charges the model proportionally: full penalty for disagreeing in the certain regions, and a softened, tolerant penalty in the band where the truth is itself uncertain.
This is the case Gros, Lemay, and Cohen-Adad make directly in the medical-imaging literature, and it transfers almost word for word to scanned logs. Their argument is that the binary black-and-white approach is too constraining because the contrast between two tissues is often ill-defined, so voxels on an object's edge contain a mixture of tissues, and forcing a single hard label there is a detrimental approximation [3]. Swap tissues for ink and paper, voxels for pixels, and that is exactly the edge of a raster-log curve. Their SoftSeg recipe reframes the problem as a regression toward soft targets rather than a hard classification, and reports that the soft formulation beat the binary one on the segmentation metric across several datasets [3]. The mechanism is the same one the edge-band tuner above makes visible: when the target stops claiming false certainty, the loss landscape over the boundary pixels flattens from a punishing spike into a tolerant valley, and the model is free to express the partial-volume reality of the edge instead of being whipped toward a sharpness the scan never had.
Where soft labels are honest and where they are merely soft
It would be easy to read all of this as soft-always-wins, and the survey literature does not support that reading. The honesty of a soft label depends entirely on whether the softness corresponds to real uncertainty. SoftSeg's soft targets are honest because they are derived from the partial-volume content of each edge voxel: the fractional label is a measurement, not a guess [3]. Label smoothing's softness is, by contrast, a flat regulariser that has no idea where the real ambiguity lives [1][2]. Both are softer than a one-hot target, but only one of them is softer in the right places. A soft target that smears uncertainty uniformly across pixels that are not uncertain is not modelling anything; it is just adding noise to the labels and hoping the regularisation pays for it. Sometimes it does, as the calibration results show, and sometimes the information it destroys costs more than the overconfidence it prevents, as the distillation results show [2].
That is why we did not treat soft cross-entropy as a free win. It sat in the same controlled five-loss comparison as Dice, Focal, Lovasz, and Tversky, trained under identical conditions on the same synthetic multiclass dataset with its three output classes, background plus two curves, so that any difference we read off was attributable to the loss and not to a luckier schedule. The discipline that the survey by Jadon catalogues, of treating the loss as a hypothesis to be tested against the structure of the task rather than a default to be inherited, is the one that matters here [4]. Each of the five candidates encodes a different theory of what a good segmentation is, and soft cross-entropy's theory is specifically a theory about the labels: that the ground truth itself is uncertain at the edges and the objective should say so. Whether that theory pays off is an empirical question about your particular foreground, not a thing you can settle from first principles.
The thin-curve twist the medical analogy does not quite capture
There is one way scanned logs differ from the medical case, and it sharpens the argument rather than weakening it. A curve trace is one to three pixels wide, which means the boundary band is not a thin rim around a large object; it is a substantial fraction of the foreground itself. On a fat anatomical structure the ambiguous edge is a small minority of the pixels and the soft-versus-hard decision is a refinement. On a one-to-three-pixel curve, almost every foreground pixel is within a pixel of an edge, so the edge band is not a detail of the label, it is most of the label. A hard target's overconfidence problem, which is mild on fat objects, becomes the dominant failure mode, because there is barely any well-defined interior to anchor the model. That is the structural reason soft labels have more to offer on thin curves than the general segmentation literature would lead you to expect, and it is the reason we kept soft cross-entropy in the comparison rather than dismissing it as a classification-era trick.
It also explains why label smoothing and soft cross-entropy can pull in different directions on this task. Uniform smoothing spends its regularisation budget everywhere, but on a thin curve there is almost no certain interior for that budget to protect, so most of it lands on pixels that were already uncertain and a fair amount lands on the background, where it can quietly encourage the model to hedge on pixels that are unambiguously paper. Per-pixel soft cross-entropy spends the budget only in the band, which on a thin curve is where nearly all of it belongs. The blunt tool and the targeted tool converge to similar behaviour on a fat object and diverge sharply on a sliver, and the sliver is what we have.
The rule the survey leaves behind
Strip away the well logs and the general statement is about the epistemics of the label, not the geometry of the loss. A loss function is a contract about what the model should be confident about, and a hard label signs that contract for every pixel including the ones where no honest annotator is confident. Soft cross-entropy and label smoothing are two ways of refusing to sign that contract at the edges, one targeted and measured, one uniform and blunt. The literature is clear that softening helps when the softness tracks real uncertainty and calibration matters downstream [1][2][3], and equally clear that it can erase useful signal when it does not [2]. The right move is not to adopt soft labels because a paper liked them, nor to keep hard labels because they are simpler, but to ask where the genuine ambiguity in your ground truth actually lives, and to choose a target that is soft precisely there and nowhere else. On a thin-curve digitisation task the answer to that question is unusually concentrated, which is the whole reason this loss earned a seat at the five-way table.
None of this displaces the architecture work or the data work; soft labels are not a substitute for more wells or a better encoder. They are a correction to a quiet lie in the ground truth, and on a task where the lie sits on top of nearly every foreground pixel, correcting it is worth the experiment. The contribution we claim is narrow and honest: VeerNet is ours, the controlled five-loss comparison is ours, and the reading above of why soft targets matter at a fuzzy edge is built on prior art we have tried to credit at every step rather than reinvent.
Key takeaways
- A hard 0/1 label asserts a certainty the boundary pixel does not have: the ink of a scanned curve fades across a transition band, and a one-hot target trains the network to be confidently wrong exactly where the image is most ambiguous (Szegedy et al. on label smoothing as a regulariser).
- Label smoothing is the blunt fix, shaving a uniform epsilon off the correct class for every pixel. It improves calibration (Muller, Kornblith, Hinton) which matters because the mask is thresholded downstream, but it erases relative-similarity information and treats certain interior pixels the same as ambiguous edge pixels.
- Soft cross-entropy is the targeted fix, training against per-pixel fractional targets that carry the real partial-volume content of each edge pixel. SoftSeg (Gros et al.) makes exactly this case for ill-defined tissue boundaries; ink-versus-paper edges on a raster log are the same problem.
- Softness is only honest where it tracks real uncertainty: a measured fractional edge label models something, a flat uniform smear does not. Whether soft beats hard is an empirical question about your foreground, which is why soft cross-entropy sat in the same controlled five-loss comparison (Dice, Focal, Lovasz, Soft-CE, Tversky) over three output classes rather than being assumed.
- On a one-to-three-pixel curve the edge band is most of the foreground, not a thin rim, so the hard-label overconfidence problem dominates and targeted soft cross-entropy has more to offer than the general segmentation literature would predict. Choose a target that is soft precisely where the ambiguity lives and nowhere else.
References
[1] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the Inception Architecture for Computer Vision. CVPR (2016). Introduces label smoothing as a regulariser that discourages overconfident logits. https://arxiv.org/abs/1512.00567
[2] Muller, R., Kornblith, S., and Hinton, G. When Does Label Smoothing Help? NeurIPS (2019). Shows label smoothing improves calibration and clustering but erases relative-similarity information, which is why it hurts knowledge distillation. https://arxiv.org/abs/1906.02629
[3] Gros, C., Lemay, A., and Cohen-Adad, J. SoftSeg: Advantages of soft versus binary training for image segmentation. Medical Image Analysis, 71, 102038 (2021). Argues that binary labels are too constraining at ill-defined edges where pixels are mixtures, and trains against soft targets instead. https://arxiv.org/abs/2011.09041
[4] Jadon, S. A survey of loss functions for semantic segmentation. IEEE CIBCB (2020). The catalogue of loss families and the discipline of matching the objective to the task. https://arxiv.org/abs/2006.14822