Loss Functions for Imbalanced Segmentation: A Quantitative Meta-Review

Abstract

Which loss function should a practitioner reach for when the foreground is a sliver of the frame? The published literature on imbalanced semantic segmentation has answered this question many times over since 2015, but the answers are scattered across the papers that introduced each objective and the two surveys that catalogue them, and they are rarely read against the same task. This meta-review surveys the five loss families that the field most often evaluates for heavy class imbalance, namely soft cross-entropy, the Dice overlap loss, the Focal reshaping of cross-entropy, the Lovasz-Softmax intersection-over-union surrogate, and the Tversky generalisation of Dice, and synthesises what the literature reports about each one's behaviour as the background majority grows. We credit each objective to its originating work and locate the families on the two axes the field has organised itself around: overlap-aware versus per-pixel, and symmetric versus precision-recall-tunable. We then read the synthesised picture against a real three-class raster-log segmentation set in which the background occupies roughly 97 percent of every frame, where a Dice baseline scores an intersection over union of 0.94 on the background but only 0.26 and 0.21 on the two thin curves. The survey's central finding, consistent with what the field reports, is that no single family dominates: the right choice is dictated by which error the downstream task pays for, and the overlap-aware and tunable families separate from the plain per-pixel baseline precisely in the high-imbalance regime where the curves live.

Background and the shape of the literature

The modern story begins with the move to dense, end-to-end pixel prediction. Fully convolutional networks made it practical to assign a label to every pixel of an image in a single forward pass [1], and with that capability came the question this review is about, because the obvious training objective, per-pixel cross-entropy summed over the frame, behaves badly when one class swamps the others. On an image that is overwhelmingly background, a model can drive the average cross-entropy down to a comfortable-looking value by predicting background almost everywhere, never learning the minority class at all. The literature's response to that failure is the family of loss functions this survey covers, and each member is best understood as a different answer to the same complaint.

The first major answer was to stop summing per-pixel errors and start scoring set overlap. The Dice loss, introduced in differentiable form by Milletari and colleagues for volumetric medical segmentation, optimises the Dice coefficient directly, which measures the overlap between the predicted mask and the ground truth normalised by their combined size [2]. Because the score is a ratio of overlap to union rather than a count of correct pixels, it is largely indifferent to the size of the background, and that indifference is exactly what makes it the default reach for imbalanced foregrounds. Sudre and colleagues generalised the idea to the multiclass setting with a class-frequency weighting, the generalised Dice loss, which up-weights rarer classes so that a tiny structure is not drowned out by a large one in the same frame [3]. These two papers define the overlap-aware corner of the design space.

A parallel answer kept the per-pixel form of cross-entropy but reshaped its gradient. The Focal loss multiplies the cross-entropy of each pixel by a modulating factor that shrinks toward zero as the pixel becomes easy and already-correct, so the gradient concentrates on the hard, usually minority, pixels rather than being diluted by the legion of trivially-correct background ones [5]. Where Dice changes what is scored, Focal changes which pixels the score listens to, and the two strategies are not mutually exclusive, a point the later literature returns to.

The third answer addressed the gap between the objective and the metric directly. Practitioners report intersection over union, but per-pixel and even Dice losses are proxies for it rather than the thing itself. The Lovasz-Softmax loss optimises a tractable, differentiable surrogate of the intersection-over-union measure, so the gradient is aligned with the number that ends up in the table [6]. And the fourth answer made the overlap loss tunable: the Tversky loss generalises Dice by splitting the denominator into separately-weighted penalties for false positives and false negatives, turning a fixed overlap score into a precision-recall dial [4], an idea that Abraham and Khan later combined with the Focal reshaping into a focal Tversky variant for small-lesion segmentation [7]. Two surveys have since tried to put the whole landscape in order, Jadon's catalogue of segmentation losses [8] and the broader Loss Odyssey of Ma and colleagues [9], and this meta-review leans on both rather than reinventing their taxonomy.

Method of the survey

The synthesis here is a structured reading, not a new experiment, and it is worth being precise about how it was assembled so the reader can judge what it does and does not establish. We took the five loss families that the two surveys [8] [9] treat as the canonical candidates for imbalanced segmentation and traced each one back to its originating paper to recover the exact objective and the regime its authors claimed it for. For each family we extracted two things the literature consistently reports on: its position on the overlap-aware-versus-per-pixel axis, and whether it exposes any control over the precision-recall trade. We then organised the families on those two axes and asked what the published evidence says happens to each as class imbalance worsens.

To keep the synthesis from floating free of any real task, we read it against a concrete reference point: a three-class raster-log segmentation problem, background plus two well-log curves, in which the curves are one to three pixels wide and the background occupies on the order of 97 percent of every frame. That reference is not the contribution of this review, which is a survey, but it grounds the abstract claims in a setting where the imbalance is extreme and measured. The numbers we quote from it, a Dice baseline and a Tversky comparison, are real and sourced from the engagement archive; the family-by-family sensitivity behaviour we discuss is the literature's reported behaviour, illustrated against that baseline rather than re-measured for every family. The interactive arena below is built on the same basis: real anchor numbers, illustrative response shapes.

Results: what the field reports, read against a real baseline

The synthesised picture has a clear top-level shape, and it matches what both surveys conclude: under mild imbalance the five families are nearly interchangeable, and they separate only as the background majority grows. That separation is the whole reason the literature exists, so the interesting region is the high-imbalance one, which is where the well-log reference sits.

The reference baseline makes the cost of imbalance vivid. With a plain Dice loss on the three-class set, the per-class intersection over union is 0.94 on the background but collapses to 0.26 on the first curve and 0.21 on the second, and the F1 scores tell the same story at 0.97, 0.37, and 0.32. The background class, which is easy and enormous, is essentially solved; the minority curves, which are the entire point of the task, are where every loss function's behaviour actually matters. A model could report a respectable frame-averaged score while being nearly useless on the only class a petrophysicist cares about, which is the imbalance trap the whole loss literature is built to escape.

The five published loss families for imbalanced segmentation, laid out on one arena whose driving axis is the class imbalance itself. Drag the slider along the bottom track to sweep the share of the frame that is background, from a balanced regime up to the real 97 percent background of the three-class well-log set, marked on the track. Each teal curve is one family's reported sensitivity, the share of the learning signal it still spends on the thin minority curve as the background majority grows; the single orange curve is Tversky, which can tilt its penalty toward the false negatives a one-pixel curve produces and so holds the most foreground signal at the heaviest imbalance. The right-hand panel anchors the arena in real measured numbers: the multiclass Dice baseline per class, with a solid bar for intersection over union (background 0.94, curve 1 0.26, curve 2 0.21) and a dashed bar for F1 (0.97, 0.37, 0.32), making the collapse on the minority curves plain. The one scarce orange note records the measured Tversky improvement over Dice on the harder curve, mean absolute error 0.0277 against 0.0367 on curve 1. The IoU, F1 and MAE figures and the 97 percent background reading are sourced from the engagement archive; the shape of each family's sensitivity response across the imbalance axis is illustrative of its loss mechanism, not a measured trace.

Read across the families, the literature's reported ordering on the minority class is consistent. Soft cross-entropy, the per-pixel baseline, degrades the hardest under imbalance because its gradient is dominated by the background majority, and both surveys note this as the motivating failure [8] [9]. Dice resists the imbalance far better because its overlap normalisation is indifferent to background size [2], with the generalised-Dice weighting designed to push that resistance further on multiclass frames [3]. Focal holds signal on the minority by down-weighting the easy background pixels rather than by changing what is scored [5]. Lovasz-Softmax, by optimising the intersection-over-union surrogate directly, tracks the reported metric most closely and stays comparatively flat as imbalance rises [6]. And Tversky, alone among the five, can be tilted to penalise the missed-curve-pixel error harder than the stray-background error, which is why the field reports it holding the most sensitivity on a thin minority class at the most extreme skew [4] [7].

That last point is where the survey can be pinned to a measured number from the reference task. On the harder of the two curves, a recall-tilted Tversky configuration left a mean absolute error of 0.0277 on the recovered curve against the Dice baseline's 0.0367, the single clearest piece of evidence that the tunable family converts its extra expressiveness into a real downstream gain under this imbalance. It is one task and one configuration, not a universal verdict, but it is exactly the direction the published literature predicts.

Discussion: two axes, no universal winner

Laid out on the two axes the field has argued over, the families stop looking like a list and start looking like a design space. The first axis is overlap-aware against per-pixel. Soft cross-entropy is purely per-pixel and pays the full imbalance penalty; Dice, generalised Dice, and Lovasz-Softmax are overlap- or metric-aware and largely escape it; Focal is a per-pixel loss that buys back some of that robustness by reweighting rather than rescoring. The second axis is symmetric against precision-recall-tunable. All of the above weigh a false positive and a false negative the same way once the class weighting is fixed; only Tversky, and the focal-Tversky hybrid built on it, lets the practitioner state in the gradient that one kind of mistake costs more than the other.

Where our own work sits in this landscape is worth stating plainly, because it is the boundary between this survey and our prior writing. This meta-review is a reading of the public field. Our separate, controlled Tversky-versus-Dice ablation, run under identical conditions on the same architecture and dataset, is the experiment; this is the map that experiment lives on. The value of the map is that it explains why that ablation came out the way it did: Tversky and Dice share an equation until the false-positive and false-negative weights are split, so on a one-pixel curve where a miss is geometrically more expensive than a stray, the only family that can express that asymmetry is the one that wins. The survey predicts the result; the ablation confirms it; neither replaces the other.

The practical reading for someone choosing a loss is not to memorise an ordering but to locate their task on the two axes. If the foreground is heavily outnumbered, leave the per-pixel corner. If the metric you will be judged on is intersection over union itself, the surrogate that optimises it directly is the principled choice. And if your two error types carry genuinely different downstream costs, as they do whenever a missed thin structure is worse than a smudge, only the tunable family lets you say so to the optimiser.

Limitations

This is a survey, and it inherits the limits of one. It synthesises what the published papers and the two catalogue surveys report; it does not re-run all five losses under a common protocol, and where it quotes measured numbers, those come from a single reference task and a single architecture rather than from a fresh multi-family benchmark. The family-by-family sensitivity behaviour discussed in the results, and rendered in the arena, is the literature's reported qualitative behaviour illustrated against that reference, not a freshly measured response curve for each loss. The reference task is also narrow: a three-class raster-log problem with one-to-three-pixel curves and roughly 97 percent background, which is precisely the extreme-imbalance regime, so the relative ordering observed there may compress or reshuffle on tasks with milder skew or thicker foregrounds. The Tversky improvement we quote is from one recall-tilted configuration on the harder curve, not an exhaustive sweep of its alpha-beta dial. And the survey deliberately scopes itself to the five families the two reference surveys treat as canonical; compound and learned objectives that postdate the period, and the many hybrid combinations the field has since explored, are out of its frame. A reader should treat the synthesis as a map of the published landscape and a guide to where to look, not as a substitute for running the controlled comparison on their own data.

Key takeaways

The imbalanced-segmentation loss literature is best read as a design space on two axes: overlap-aware versus per-pixel, and symmetric versus precision-recall-tunable. The five surveyed families (soft cross-entropy, Dice, Focal, Lovasz-Softmax, Tversky) each occupy a different position on those axes.
Under mild imbalance the families are nearly interchangeable; they separate only as the background majority grows, which is exactly why the literature exists. The real three-class raster-log reference sits at roughly 97 percent background, deep in the regime where the choice matters.
The reference Dice baseline makes the trap vivid: IoU of 0.94 on the easy background but only 0.26 and 0.21 on the two thin curves, and F1 of 0.97 against 0.37 and 0.32. A high frame-averaged score can hide near-failure on the only class that matters.
The field reports, and our reference confirms, that soft cross-entropy degrades worst under imbalance while the overlap-aware (Dice, generalised Dice, Lovasz) and the tunable (Tversky) families hold signal on the minority class. A recall-tilted Tversky cut curve-1 MAE to 0.0277 against Dice's 0.0367.
This is a survey of the published field, distinct from our own controlled Tversky-versus-Dice ablation. The map explains why that experiment came out as it did; choosing a loss means locating your own task on the two axes, not inheriting a default.

References

[1] Long, J., Shelhamer, E., and Darrell, T. Fully Convolutional Networks for Semantic Segmentation. CVPR (2015). The dense per-pixel prediction setting in which the imbalance problem this review surveys first becomes acute. https://arxiv.org/abs/1411.4038

[2] Milletari, F., Navab, N., and Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 3DV (2016). Introduces the Dice loss in its now-standard differentiable form, the overlap-aware answer to imbalance. https://arxiv.org/abs/1606.04797

[3] Sudre, C. H., Li, W., Vercauteren, T., Ourselin, S., and Cardoso, M. J. Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations. DLMIA Workshop, MICCAI (2017). The class-frequency-weighted multiclass extension of Dice for severe imbalance. https://arxiv.org/abs/1707.03237

[4] Salehi, S. S. M., Erdogmus, D., and Gholipour, A. Tversky loss function for image segmentation using 3D fully convolutional deep networks. MLMI Workshop, MICCAI (2017). Splits the overlap denominator into weighted false-positive and false-negative penalties, the precision-recall dial. https://arxiv.org/abs/1706.05721

[5] Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. Focal Loss for Dense Object Detection. ICCV (2017). The modulating factor that down-weights easy, already-correct pixels so the gradient concentrates on the hard minority. https://arxiv.org/abs/1708.02002

[6] Berman, M., Triki, A. R., and Blaschko, M. B. The Lovasz-Softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. CVPR (2018). Direct optimisation of the IoU measure the field reports. https://arxiv.org/abs/1705.08790

[7] Abraham, N., and Khan, N. M. A Novel Focal Tversky Loss Function with Improved Attention U-Net for Lesion Segmentation. ISBI (2019). Combines the Tversky tunability with the Focal reshaping for small-structure segmentation. https://arxiv.org/abs/1810.07842

[8] Jadon, S. A survey of loss functions for semantic segmentation. IEEE CIBCB (2020). One of the two catalogue surveys this meta-review's taxonomy leans on. https://arxiv.org/abs/2006.14822

[9] Ma, J., Chen, J., Ng, M., Huang, R., Li, Y., Li, C., Yang, X., and Martel, A. L. Loss Odyssey in Medical Image Segmentation. Medical Image Analysis, 71 (2021). The broader comparative survey of segmentation losses, including the compound and tunable families. https://doi.org/10.1016/j.media.2021.102035

Loss Functions for Imbalanced Segmentation: A Quantitative Meta-Review

Abstract

Background and the shape of the literature

Method of the survey

Results: what the field reports, read against a real baseline

Discussion: two axes, no universal winner

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on