The Right Way to Report a Segmentation Ablation

Swapping the loss function on a segmentation head is one of the cleanest experiments a computer-vision team runs, and one of the easiest to report dishonestly without meaning to. The setup is disciplined by nature: hold the architecture, the data, and the optimiser fixed, change exactly one thing, and read off what moved. That is the textbook shape of a controlled comparison. What wrecks it is not the experiment but the write-up, and the wreck is usually quiet. A team tries five losses, one posts a beautiful number on the one metric it happens to care about that week, and the report becomes a single figure with the other four arms and the other three metrics left off the page. The experiment was a controlled comparison. The report is a headline. This note is about the gap between those two, using the loss ablation behind VeerNet, the encoder-decoder EarthScan uses to lift curves off scanned well logs, as the worked case.

We should be precise about what this is not. It is not a retelling of the VeerNet result, which lives in its own whitepaper and stands on its own numbers. It is a note about the protocol that made those numbers safe to publish. None of the discipline below is ours to claim. The case that ablation and controlled comparison, rather than leaderboard rank, are what carry an empirical claim is Sculley and colleagues' [1], the catalogue of reporting habits that inflate a result past its evidence is Lipton and Steinhardt's [2], and the demonstration that method gaps evaporate once every arm shares a tuning budget is Melis, Dyer, and Blunsom's [3]. What we add is one honest grid on one oddly shaped subsurface dataset, and an argument for why the grid, not the winner, is the deliverable.

The one thing that must stay fixed is the budget

A loss ablation is only a controlled comparison if the loss is the only thing that changed. The trap is that a loss function interacts with training time. Some losses converge fast and then overfit; others need more epochs to reach their best held-out point. If you let each arm run until it personally looks best, you are no longer comparing losses, you are comparing loss-plus-schedule, and the schedule is doing work you are not reporting. Melis and colleagues made the sharp version of this point in a different domain: much of the apparent difference between methods disappears once every method is handed the identical budget, because the reported gap was partly a tuning gap in disguise [3]. The fix is dull and non-negotiable. Every arm gets the same budget, decided before the runs, and the number you report is the number at that budget, not the number at each arm's private best.

For this ablation the budget was fifty epochs on an 80/20 split, applied identically to all five losses. Fifty epochs is not a magic figure; it is the ceiling the engagement already used, and the only thing that matters is that it was the same ceiling for every arm. Dice did not get sixty because it was still climbing; Tversky did not get forty because it peaked early. Holding the budget fixed is what lets the final table be read as a loss comparison at all, and it is the first thing a skeptical reader should check before believing any cell in it.

Every metric on the page, because the metrics disagree

The second rule is that you report every metric you measured, next to every loss, in one place. This sounds like a courtesy and is actually the whole argument, because segmentation losses are optimised against different notions of correct and the metrics inherit that disagreement. Jadon's survey is blunt about it: across the Dice, Focal, Lovasz, cross-entropy and Tversky families, no single loss dominates every task, which is exactly why you run the ablation and exactly why a one-metric report is a lie of omission [5]. Tversky in particular exists to buy a tunable precision-recall trade-off on imbalanced masks [4], so it is designed to win some columns and not others. A report that shows you only the column it won has not shown you the loss; it has shown you the column.

Our grid makes the disagreement concrete. On the curve1 regression read-out, Tversky posts the lowest mean absolute error at 0.0277, beating Dice at 0.0367 and Focal at 0.0405, and the lowest mean squared error at 0.0021 against Dice at 0.0091. If curve fidelity were the only thing on the page, Tversky wins and the story ends. But the mask-quality metrics tell a different and incomplete story: the only run with an archived IoU and F1 on the curve-1 mask is the Dice run, at 0.26 and 0.37. That gap is not a defeat for Tversky and it is not a win for Dice. It is the honest state of the archive, and the discipline is to show it rather than paper over it.

A loss-function ablation reported the honest way: every evaluated loss as a row, every metric as a column, and every cell trained on the identical 50-epoch, 80/20 budget. The orange marker is the only element that argues, and it moves between columns: Tversky posts the lowest curve1 MAE (0.0277) and MSE (0.0021), but the only run with an archived mask IoU and F1 is Dice (0.26 and 0.37), so no single loss can be crowned across curve error and mask quality at once. Toggle to Cherry-picked to see the failure mode: one loss reported on the one metric that flatters it, with the other four losses and the mask columns left off the page. Cells with no archived number stay blank rather than being imputed, because a gap is information too. Every plotted figure (the five losses, the curve1 MAE and MSE values, the Dice mask IoU and F1, the 50-epoch budget, the 80/20 split) is sourced from the engagement archive; nothing here is illustrative.

A blank cell is a finding, not an embarrassment

The instrument above does something that looks like an admission and is really the point: where a number was never logged, the cell stays blank and says so. Three of the five losses have no archived curve-error entry, and four of the five have no archived mask IoU or F1. The tempting move is to backfill those cells, to rerun quietly until the table is full and rectangular, or worse to impute a plausible value so the grid looks complete. Both are the reporting failure Lipton and Steinhardt describe, where the presentation implies more evidence than the experiment produced [2]. A blank cell is information. It tells the reader precisely which comparisons the study actually licenses and which it does not, and it stops anyone downstream from citing a mask score for Tversky that no run ever measured.

This is why the honest matrix and the cherry-picked view in the exhibit are not two aesthetics of the same data. They are two different claims. The full grid claims exactly what the runs support: Tversky leads on curve error at a fixed budget, the mask metrics were only logged for Dice, and no loss can be crowned across both families of metric at once. The single-number view claims something the runs never established, that one loss is simply best, by hiding every cell that would complicate it. The difference between them is the difference between an ablation and an advertisement.

What the grid actually licenses

Read correctly, the table supports a narrow and useful conclusion and refuses a broad and flattering one. It licenses the statement that, at fifty epochs on an 80/20 split, Tversky produced the most accurate curve1 reconstruction among the losses we logged for that metric, by a clear margin on both MAE and MSE. It does not license the statement that Tversky is the best loss for this task, because the mask-quality evidence needed to make that call was only ever recorded for Dice. Those are different sentences, and the reporting discipline forces the writer to say the first and forbids the second. Sculley and colleagues frame this as the difference between pace and progress: it is easy to produce a result that moves fast and means little, and the correction is the unglamorous work of controlled comparison reported in full [1].

The habit is portable, which is the reason to bother. Any team running a loss sweep, on masks or anything else, can apply the same three rules: fix the budget before the runs so the comparison is of losses and not schedules, put every metric beside every arm so the reader sees the disagreements you saw, and leave the gaps visibly empty so no one mistakes a missing measurement for a bad one. The rules cost nothing, and they separate a table you can build an engineering decision on from a figure you can only nod at.

Limitations

This is a report about reporting, calibrated on one ablation from one engagement, and it should not be read as a benchmark of loss functions. The curve1 MAE and MSE figures, the Dice-run mask IoU and F1, the count of five evaluated losses, the fifty-epoch budget, and the 80/20 split are the real archive numbers, but the archive is itself incomplete: curve-error was logged for three of the five losses and mask metrics for only one, which is why the grid has honest blanks rather than a full rectangle. Those blanks limit what can be concluded, and deliberately so, but they also mean this study cannot rank all five losses against each other on any single metric, let alone across metrics. The right budget, the right metric to prioritise, and the right loss are all properties of a specific dataset's class imbalance and a specific team's tolerance for false positives, so our finding that Tversky led on curve error here does not transfer as a constant to a different operator's logs or a different curve count. And a fixed-budget comparison answers only which loss did best in the time allowed; it says nothing about which loss would win with more epochs, which is a separate experiment we did not run.

The report is part of the experiment

The lesson that outlasts this particular grid is that the write-up is not a summary bolted onto the science; it is the last controlled variable. An ablation run cleanly and reported carelessly is indistinguishable, to the reader, from an ablation run carelessly, because the reader only ever sees the report. Fix the budget, show every metric, keep the blanks honest, and the same five runs that could have been spun into a one-line boast instead support a claim narrow enough to be true and specific enough to act on. That is the entire trade, and it is a good one.

References

[1] Sculley, D., Snoek, J., Wiltschko, A., and Rahimi, A. Winner's Curse? On Pace, Progress, and Empirical Rigor. ICLR 2018 Workshop. The case that controlled comparison and honest ablation, not leaderboard rank, are what make an empirical claim mean something. https://openreview.net/forum?id=rJWF0Fywf

[2] Lipton, Z. C., and Steinhardt, J. Troubling Trends in Machine Learning Scholarship. Communications of the ACM 62(6), 2019 (arXiv 2018). The catalogue of reporting habits that let a result look stronger than its evidence, including comparisons that vary more than the one factor under study. https://arxiv.org/abs/1807.03341

[3] Melis, G., Dyer, C., and Blunsom, P. On the State of the Art of Evaluation in Neural Language Models. ICLR 2018. Evidence that method gaps shrink or vanish once every arm shares a tuning budget, the reason a fixed budget belongs in the protocol. https://arxiv.org/abs/1707.05589

[4] Salehi, S. S. M., Erdogmus, D., and Gholipour, A. Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. MLMI, MICCAI 2017. Introduces the Tversky loss and its tunable precision-recall trade-off on imbalanced masks. https://arxiv.org/abs/1706.05721

[5] Jadon, S. A Survey of Loss Functions for Semantic Segmentation. IEEE CIBCB 2020. The survey of Dice, Focal, Lovasz, cross-entropy and Tversky-family losses, and the observation that no single loss dominates across tasks. https://arxiv.org/abs/2006.14822

The Right Way to Report a Segmentation Ablation

The one thing that must stay fixed is the budget

Every metric on the page, because the metrics disagree

A blank cell is a finding, not an embarrassment

What the grid actually licenses

Limitations

The report is part of the experiment

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on