Most of the loss-function conversation in segmentation is a fight about the denominator. Dice, Tversky, and their relatives all score a prediction by some flavour of overlap and argue over how to weight the two ways a mask can be wrong. But two of the five candidates we ran through VeerNet, the encoder-decoder EarthScan built to digitise raster well logs, are not in that fight at all. Focal keeps cross-entropy and reshapes its gradient. Lovasz-Softmax abandons per-pixel scoring and goes straight for the intersection-over-union number itself. Both are well-known, both predate this work by years, and neither was covered in our earlier head-to-head between Tversky and Dice. This is the piece that closes that gap, and it is worth writing because the two losses fail and succeed for reasons that have nothing to do with the overlap-versus-overlap argument.
The setting that makes the comparison sharp is the same one that makes the whole task hard. The foreground is a curve one to three pixels wide on a scanned page, the rest of the frame is background, and the peak intersection-over-union we reached on the multiclass set was a sober 0.51. When the best achievable overlap is that low, the choice of objective is not a tie-breaker between near-perfect models. It is the thing deciding whether the optimiser spends its gradient where the signal is.
Focal keeps cross-entropy but redistributes its attention
Focal loss is the gentlest departure from the per-pixel baseline. It takes ordinary cross-entropy and multiplies each pixel's contribution by a modulating factor, one minus the predicted probability of the correct class, raised to a focusing power gamma [1]. The arithmetic is almost trivial and the consequence is not. A pixel the model is already confident and correct about, a background pixel far from any ink, gets its loss multiplied by a number very close to zero. A pixel the model is unsure about, the marginal ink at the edge of a faint trace, keeps almost all of its weight. The gradient is quietly re-pointed away from the millions of easy pixels and toward the thin band where the decision is genuinely in doubt.
The Focal modulating factorThe Focal modulating factor is (1 minus p) raised to the power gamma, where p is the model's predicted probability for the correct class. As p climbs toward 1 on an easy, correctly-classified pixel the factor collapses toward 0, so that pixel contributes almost nothing to the gradient; gamma controls how aggressively the easy pixels are suppressed. was designed for dense object detection, where the background-to-foreground ratio is brutal, and a one-pixel curve on a full page is exactly that kind of imbalance. So Focal should be a natural fit, and on the segmentation metrics it behaves like one. The catch is that down-weighting the easy pixels does nothing to express a preference between the two error types on the hard ones. Focal will happily concentrate on the ambiguous band and still treat a missed curve pixel and a stray background pixel inside that band as equally costly. It sharpens where the model looks without saying what the model should prefer when it looks there.
Lovasz optimises the score you actually report
Lovasz-Softmax starts from a different complaint. Intersection-over-union is the number everyone reports, but it is not differentiable, so the field has historically trained on a per-pixel proxy and hoped the proxy and the metric move together. They do not always. The Lovasz approach is to build a tractable, piecewise-linear surrogate of the IoU loss using the Lovasz extension of a submodular set function, which gives a smooth objective whose gradient is genuinely aligned with the overlap measure you are graded on [2]. The idea that one should optimise IoU directly rather than through a surrogate had been circulating for a couple of years by then, both as a relaxation of the IoU itself [3] and as a detection objective [4]; Lovasz is the version that made it practical for multiclass segmentation.
The mechanical difference from Focal is the interesting part. Focal decides a pixel's weight from that pixel's own confidence in isolation. Lovasz decides a pixel's contribution from its rank in the sorted order of prediction errors across the whole image, because the IoU surrogate is a function of the error vector as a set, not pixel by pixel. A confident, correct pixel does not get suppressed the way Focal suppresses it; it still carries whatever charge its rank in the IoU error implies. The two losses can therefore disagree most exactly where you might expect them to agree, on the pixels the model is already sure about.
Why an overlap loss and a surrogate loss are not interchangeable
It is tempting to file Lovasz next to Dice and call them both overlap losses, but the instrument above is meant to show why that filing is wrong. Dice computes a single overlap ratio and back-propagates through it; its gradient on any given pixel is smeared through that one global ratio. Lovasz constructs the gradient from the sorted margin, so a pixel that sits at the boundary between getting counted as intersection or not receives a much larger, sharper push than a pixel deep inside a confidently-correct region. On a thick, blob-like foreground that distinction is academic. On a one-pixel curve, where almost every foreground pixel is a boundary pixel, it changes which pixels move the loss at all. The surrogate concentrates its force on precisely the pixels whose flip would change the IoU, which on a thin structure is nearly all of them.
That is also why neither loss is a free lunch on this task. Focal's suppression of easy pixels assumes the easy pixels are genuinely settled, but on a noisy scan a smudge can look easy and be wrong, and Focal will under-weight the very pixel that needed correcting. Lovasz's direct IoU pressure is only as good as the IoU it is chasing, and when the ceiling is 0.51 the surrogate is optimising toward a target that is itself modest, so it sharpens the model against a metric that cannot reward it much. The honest read is that both are improvements on a naive per-pixel objective for different reasons, and that the reasons do not compose into a clear winner.
Reading them on the curve, not the mask
The number a petrophysicist cares about is not IoU. It is how closely the digitised curve tracks the real log, which we measure as the mean absolute error on the recovered signal. Of the two losses in this piece, Focal carried regression-stage error through to that final curve in our run: a mean MAE of 0.0405 on curve-1 and 0.1027 on curve-2. Set against the Dice baseline of 0.0367 and 0.0774 on the same two curves, Focal is close but slightly behind on both, which is the quiet lesson of the comparison. Concentrating the gradient on the hard pixels helped the segmentation without translating into a lower error on the deliverable, because the deliverable rewards continuity of the trace more than it rewards correctness on individually hard pixels. Lovasz was evaluated in the same sweep, but its per-curve MAE was not logged in this run, so the instrument shows it as evaluated rather than inventing a figure. We would rather leave the cell honest than fill it.
The choice this leaves you with
Set side by side, Focal and Lovasz make plain that a loss can refuse to be ordinary cross-entropy along more than one axis. Focal moves along the difficulty axis, spending its gradient on the pixels still in doubt. Lovasz moves along the metric-alignment axis, spending its gradient on the pixels that would change the score you report. Those are orthogonal moves, and on a thin-foreground digitisation task with a low IoU ceiling, each buys you something real and neither buys you everything. The practical advice is unglamorous: do not assume the loss that wins the segmentation leaderboard wins the metric your downstream stage is judged on, and run the candidate you are tempted by all the way through to the deliverable before you commit to it.
Key takeaways
- Focal and Lovasz are the two of VeerNet's five benchmarked losses that depart from per-pixel cross-entropy, and they were not covered in the earlier Tversky-versus-Dice comparison; they fail and succeed for reasons unrelated to the overlap-denominator argument.
- Focal keeps cross-entropy and multiplies each pixel by (1 minus p) to the gamma, collapsing the gradient on easy, already-correct pixels toward zero and concentrating it on the ambiguous band; it sharpens where the model looks but says nothing about which error type to prefer once it looks there.
- Lovasz-Softmax abandons per-pixel scoring and optimises a tractable surrogate of intersection-over-union built from the sorted-margin order, so a pixel's contribution comes from its rank in the IoU error rather than its own confidence; on a one-pixel curve where almost every foreground pixel is a boundary pixel, that changes which pixels move the loss.
- On the metric that ships, the mean absolute error on the recovered curve, Focal measured 0.0405 (curve-1) and 0.1027 (curve-2), slightly behind the Dice reference of 0.0367 and 0.0774, because the deliverable rewards trace continuity more than correctness on individually hard pixels. Lovasz regression-stage MAE was not logged this run and is shown as evaluated, not guessed.
- With a peak multiclass IoU of only 0.51, the objective decides where the gradient is spent rather than separating near-perfect models. Focal and Lovasz move along orthogonal axes (difficulty versus metric-alignment); each buys something real and neither is a free lunch, so run the loss through to the deliverable before committing.
References
[1] Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. Focal Loss for Dense Object Detection. ICCV (2017). The modulating factor that down-weights easy examples under heavy foreground-background imbalance. https://arxiv.org/abs/1708.02002
[2] Berman, M., Triki, A. R., and Blaschko, M. B. The Lovasz-Softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. CVPR (2018). Direct optimisation of IoU as a smooth training objective via the Lovasz extension. https://arxiv.org/abs/1705.08790
[3] Rahman, M. A., and Wang, Y. Optimizing Intersection-over-Union in Deep Neural Networks for Image Segmentation. ISVC (2016). An early differentiable relaxation of the IoU for training segmentation networks. https://link.springer.com/chapter/10.1007/978-3-319-50835-1_22
[4] Yu, J., Jiang, Y., Wang, Z., Cao, Z., and Huang, T. UnitBox: An Advanced Object Detection Network. ACM Multimedia (2016). The IoU loss as a detection objective, an early argument for optimising the reported metric directly. https://arxiv.org/abs/1608.01471
[5] Jadon, S. A survey of loss functions for semantic segmentation. IEEE CIBCB (2020). The catalogue of loss families this comparison draws from. https://arxiv.org/abs/2006.14822