The number we kept staring at was not the average. Averaged across examples, the multiclass segmenter looked fine, and averages are how you convince yourself a model works when it does not. What told the real story was the spread. On the cleanest examples the digitised curve tracked the ground truth almost exactly, and on the hard ones it fell apart, and the gap between those two was not noise. It was a pattern, and the pattern pointed at the loss function. We were training with Dice, and Dice, for all its virtues on balanced masks, was quietly teaching the model to abandon the very pixels the whole product exists to recover.
The failure had a shape, and the shape was omission
Start with what the model is actually asked to do. A raster well log is mostly white paper. The signal is a curve, or on the multiclass task two curves, each often a single pixel wide, winding down a page that is overwhelmingly background. We had three classes to predict per pixel: background, curve 1, and curve 2. Background is easy because it is everywhere. The curves are hard because they are scarce, thin, and on the difficult scans faint or crossing.
Dice loss scores the overlap between prediction and truth, and it treats a false positive and a false negative as the same kind of mistake [2]. On a balanced mask that symmetry is a feature. On ours it was a trap. When the foreground is a thin curve against a sea of background, the model has vastly more opportunities to err by omission than by commission, because most of the places a curve could go, it does not. The cheapest way for a Dice-trained model to lower its loss on a hard example is to predict a little less curve. Leave the faint stretch blank, skip the pixel where two curves nearly touch, and the symmetric penalty barely notices. The result was a segmenter that drew confident curves where the signal was strong and gave up where it was weak, which is the exact opposite of what a digitiser is for. Nobody needs help reading the clear part of the log.
We saw the same failure earlier in the binary phase, where a weighted binary cross-entropy with a class weight of 42 pushed recall up to 0.97 but left F1 stranded in the 0.3 range because precision collapsed. That told us the imbalance was real and that brute-force reweighting was not the answer. Cross-entropy weighting scales the whole class; it does not let you say which direction of error you are willing to tolerate. We wanted a knob that penalised the missed curve specifically, not the class as a whole.
Why Tversky, and not just heavier Dice
The Tversky loss is the fix that keeps the part of Dice that works and adds the part it lacks. It generalises the Dice overlap by putting two separate weights, alpha on false positives and beta on false negatives, in the denominator [1]. Set alpha equal to beta and you get Dice back exactly; the loss is a strict superset. Tilt beta above alpha and you tell the optimiser that a missed foreground pixel hurts more than a spurious one. That is the entire idea, and it is the right idea for our data, because our foreground is scarce and our failure mode is omission. The formulation comes from Salehi and colleagues, who built it for exactly this problem, small and imbalanced foreground in medical segmentation [1], and it traces back to Tversky's asymmetric account of similarity, where the features one object has and another lacks can weigh differently from the reverse [3]. We did not invent the loss. What we did was recognise that our problem was theirs wearing a different hat, and turn the dial the way scarce thin curves demand.
Choosing Tversky over simply cranking Dice's class weights harder was a deliberate call. Reweighting a symmetric loss makes every error on the rare class louder, false positives included, so past a point you trade the model's omission problem for a commission problem: it starts hallucinating curve where there is only stain and grid. The binary run had already shown us that cliff. Tversky lets you move only the term you mean to move. You raise the cost of the miss without equally raising the cost of the false alarm, so the model learns to reach for the faint pixel without being rewarded for inventing pixels that are not there.
What actually moved, curve by curve
We ran the comparison as one controlled swap on the multiclass model, holding everything else fixed, and read it on the metric that ships: the error left on the digitised curve, plus goodness-of-fit against the ground-truth trace. Tversky was one of five losses we evaluated in that sweep, alongside Dice, Focal, Lovasz, and soft cross-entropy, and it is the one this piece is about because it is the one that recovered the hard examples.
The headline sits on curve 1. On the cleanest example the Tversky model reached R-squared 0.9891, a curve that lies almost on top of the truth. On a genuinely harder example the same curve-1 recovery held at R-squared 0.8126, which is not the peak but is a real, usable fit on a scan that Dice had been leaving half-drawn. Averaged across examples, curve-1 mean absolute error fell from 0.0367 under Dice to 0.0277 under Tversky, and mean squared error fell from 0.0091 to 0.0021. That MSE drop is the tell. MSE punishes the large, isolated miss far more than the small, diffuse one, so cutting it by more than a factor of four is the quantitative signature of exactly the thing we were chasing: the model stopped dropping whole stretches of curve on the hard pages.
Curve 2 is where the honesty of the trade shows. It is the harder of the two classes, more often the fainter or the more frequently crossed, and Tversky did not make it easy. On the harder example curve-2 goodness-of-fit reached only R-squared 0.5461, and curve-2 mean absolute error under Tversky was 0.1241, higher than Dice's 0.0774 on that class. We could have hidden that by reporting only the average, and we are choosing not to. The recall tilt spends part of its budget, and it spends it on curve 2. Pushing the model to never miss curve 1 makes it slightly more willing to commit on curve 2 in places it should not, and that shows up as more curve-2 error. For this product that is the correct trade. Curve 1 is the primary track the operator needs recovered first, and a curve-1 fit of 0.99 on the clean examples with a real fit on the hard ones was worth accepting a rougher curve 2 that a human reviewer can clean in the loop.
The precision-recall trade, stated plainly
Every one of these numbers is one operating point on a single trade. Dice sits at the balanced point where a miss and a false alarm cost the same. Tversky lets us walk toward the recall-weighted end, where the miss costs more, and we walked toward it on purpose because the value in a well-log digitiser is asymmetric. A curve the model failed to draw is a hole in the deliverable that someone has to notice is missing, which is the hardest kind of error to catch. A curve the model drew slightly too eagerly is a mark a reviewer can see and erase. Given that asymmetry, tilting the loss toward recall is not a hack to inflate a metric. It is aligning what the model minimises with what the deliverable is worth, and the curve-1 recovery from a half-drawn Dice trace to R-squared 0.9891 is what that alignment bought.
Limitations
These figures are per-example and per-curve, not a benchmark. The curve-1 R-squared of 0.9891 is the cleanest example; the 0.8126 is a harder one, and the difficulty axis in the exhibit is a reading aid, not a calibrated scale of example hardness. The Dice-era height the exhibit ghosts for the hard curve-1 point is illustrative, because we archived Dice as mean metrics across examples rather than a matched per-example R-squared for that same case; the sourced facts are the mean errors under each loss and the Tversky per-example R-squared values. The mean errors summarise a specific evaluation set of synthetic multiclass logs and will not transfer unchanged to a different distribution of scans. Curve 2 remaining hard, at R-squared 0.5461 with higher mean absolute error under Tversky than Dice, is a real cost of the recall tilt and not an artefact we tuned away. And the win is a loss-function win only: it says nothing about failure modes upstream of segmentation, such as page reassembly or depth calibration, that a good curve on a mis-assembled image would still get wrong.
References
-
Salehi, S. S. M., Erdogmus, D., and Gholipour, A. (2017). Tversky loss function for image segmentation using 3D fully convolutional deep networks. MLMI Workshop, MICCAI. https://arxiv.org/abs/1706.05721
-
Milletari, F., Navab, N., and Ahmadi, S.-A. (2016). V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 3DV. https://arxiv.org/abs/1606.04797
-
Tversky, A. (1977). Features of Similarity. Psychological Review, 84(4), 327-352. https://psycnet.apa.org/record/1978-09287-001