Skip to main content

Blog

How to Read a Benchmark Without Being Fooled by It

A field note on benchmark literacy, grounded in one of our own runs. A single headline segmentation score can read as excellent or as poor depending only on how you slice it, and the slice is a choice the reader makes, not a property of the model. This post walks three tests a score has to survive before it means anything: subtract the base rate, so a class that owns the pixels does not score high for free; split the number by class, because the mean drowns the classes you built the model for; and check that held-out really means unseen rather than merely later. We use our multiclass Dice-loss curve-segmentation numbers as the worked example, where a background IoU of 0.94 and F1 of 0.97 sit next to curve masks at 0.26 and 0.21 IoU, and where recall near 0.97 hides precision collapsing on the foreground. The point is not that the model is bad; it is that the aggregate was never the thing to trust.

EarthScan insight

A benchmark score is a compression. It takes a model's behaviour across thousands of pixels or rows and squeezes it down to one number you can put on a slide, and like any compression it throws things away. The trouble starts when the reader forgets that and treats the surviving number as the whole story. We have done this to ourselves, and the fastest way to teach benchmark literacy is to show the exact place where our own numbers could have fooled us. This is a note on reading scores without self-deception, using the curve-segmentation runs behind VeerNet, the encoder-decoder EarthScan uses to lift well-log curves off scanned paper. It is deliberately not the VeerNet build story; it is the reader's-side discipline that any of those scores demands before you are allowed to believe it.

The single fact that organises everything below is this: the same trained model reads as excellent or as poor depending only on how you slice its numbers. Nothing about the weights changes between the two readings. The slice is a decision the reader makes, and most of the ways a benchmark misleads are just a slice chosen, usually by accident, to flatter.

Subtract the base rate before you admire the score

The first test is the oldest one, and it is about what a score gets for free. In our multiclass segmentation task there are three classes: the paper background and the two plotted curves. The background is almost all of every image. A model that labels every pixel "background" and gives up entirely on the curves would still be right about the overwhelming majority of pixels, and any pixel-weighted summary would reward it handsomely for that. So when the background class comes back at an IoU of 0.94 and an F1 of 0.97, the honest reaction is not admiration. It is the question: how much of that did the base rate hand over before the model did anything clever?

Powers made the general version of this case: precision, recall and F-measure all carry biases tied to how common the positive class is, and a headline that ignores the base rate can flatter a model that has mostly learned the prior [1]. On our data the prior is overwhelming. The background score is close to what you would get for correctly guessing "this pixel is blank paper," which is not the skill anyone paid for. The curves are the product. Judge the model on the thing it was built to find, and subtract the credit the easy majority class was always going to earn.

Split the number by class, because the mean hides the argument

The second test is what turns the base-rate worry into a specific, damning picture, and it is the reason per-class reporting became the segmentation convention in the first place. If you report one mean IoU over the three classes, the background's 0.94 pulls the average up so far that the summary looks respectable, and the two numbers that actually matter disappear into it. Split it, and the model's real skill on the foreground is exposed: curve 1 comes in at IoU 0.26 and F1 0.37, and curve 2 at IoU 0.21 and F1 0.32, under the same Dice-loss run that produced the beautiful background figures.

This is not a subtle gap. The foreground scores are less than a third of the background ones. The PASCAL VOC challenge institutionalised mean-over-classes and per-class breakdowns for exactly this reason, so that one dominant class could not silently carry a benchmark [4], and the IoU coefficient it standardised on traces back to Jaccard's overlap measure from botany [2]. The lesson from both is the same: an overlap score is only meaningful against a stated class, and a mean across classes with wildly different base rates is a number that describes no one.

ONE RUN · TWO HONEST READINGS0.34F1 on the curves that matterA beautiful aggregate can hide a model that cannot trace a curveBEFORE YOU TRUST A SCORE1Subtract the base ratea class that owns the pixels scores high for free2Split it by classthe mean drowns the classes you built it for3Check for leakageheld-out must mean unseen, not merely laterHOW YOU SLICE THE SAME RUNAs reportedbackground carries itPer classforeground exposedSAME MODEL, MEASURED CLASS BY CLASS0.000.250.500.751.00foreground F1 0.340.940.97Background0.260.37Curve 10.210.32Curve 2IoUF1this readingIoU 0.23 · F1 0.34peak across runs IoU 0.51 · F1 0.55sourced multiclass Dice loss: bg IoU 0.94 / F1 0.97, curve1 0.26 / 0.37, curve2 0.21 / 0.32, 80/20 split · the "as reported" headline is the background score by construction
One multiclass Dice-loss run, read two honest ways. The toggle switches between the number a headline would quote, which is the background score because that class owns almost every pixel, and the per-class split that separates background from the two curve masks the model was actually built to trace. Nothing about the model changes between the readings; only the aggregation does. The grouped bars carry the sourced per-class IoU and F1: background at 0.94 and 0.97 against curve 1 at 0.26 and 0.37 and curve 2 at 0.21 and 0.32. The three-test checklist on the left states what a score must survive before you trust it, subtract the base rate, split by class, and check for leakage. The orange element is the only one that argues: the honest foreground reference line that snaps down onto the collapsed curve scores when you switch to the per-class reading, exposing the gap the aggregate hid. The per-class scores, the recall figures, the peak IoU of 0.51 and F1 of 0.55 across runs, and the 80/20 train-validation split are sourced from the engagement archive; the aggregate headline is not a separately logged metric but the background score relabelled as the pixel-weighted number a summary would report.

There is a second-order trap hiding in the same run, and it is worth naming because it is where a careless reader recovers false confidence. Recall on the curve masks looks reassuring: around 0.97 on one and 0.96 on another in the binary framing, and it is tempting to quote that and move on. But recall alone is the model's willingness to shout "curve" wherever a curve might be. Precision is whether it was right when it shouted, and on the curve masks precision is what collapses. A model can score near-perfect recall and still be useless if it drapes curve labels across half the page; the F1 is low precisely because precision fell through the floor while recall stood tall. Quote the recall on its own and you have fooled yourself again, with the model's help.

Check that held-out means unseen, not merely later

The third test is the one that survives even a perfect per-class breakdown, because it attacks the split itself. We trained on an 80/20 train-validation partition, which is the standard move, and the standard move is safe only if the validation set is genuinely independent of the training set. When your data is synthetic, generated from the same procedure, the risk is that the validation logs are not a fresh distribution but near-duplicates of training logs produced by the same generator with the same conventions. A held-out score computed on that is measuring memorisation dressed as generalisation.

Kaufman, Rosset and Perlich give the general anatomy: leakage is any information available at scoring time that would not be available in genuine deployment, and it is the most common reason a model that benchmarks well fails in the field [3]. For a raster-log model the deployment distribution is other operators' scans, with their own printing quirks, grid styles, smudges and overlaps. A validation score built only on our own synthetic curves does not test that transfer, however cleanly the split was drawn. "Held out" has to mean unseen in the sense that matters, not merely a slice we set aside from the same barrel. This is the test a benchmark cannot pass on its own; only knowing where the data came from can settle it.

What the three tests leave you believing

Run our numbers through the three tests and the honest summary is neither "excellent" nor "poor." It is precise: on synthetic multiclass data, the model has learned the background trivially, has real but weak skill on the foreground curves, with peak IoU of 0.51 and peak F1 of 0.55 across runs sitting well below what the background figures might have implied, and its transfer to unseen operator scans is unproven by this validation set. That is a far more useful thing to carry into a decision than a single flattering number, and it is exactly what the headline would have hidden.

None of this is our invention, and that is the point. The base-rate caution is Powers' [1], the overlap coefficient is Jaccard's [2], the per-class reporting convention is PASCAL VOC's [4], and the leakage anatomy is Kaufman and colleagues' [3]. What we contribute is the discipline of turning those on our own results before we quote them, and the willingness to publish the gap between the aggregate and the foreground rather than the aggregate alone. A benchmark is worth reading only after you have subtracted the base rate, split it by class, and asked where the held-out data really came from. Before that, the number is not a measurement. It is a mood.

Limitations

This is a reading discipline illustrated on one engagement's runs, not a general benchmark result. The per-class IoU and F1 figures, the recall values, the peak scores and the 80/20 split are the real archive numbers from the multiclass Dice-loss runs, but the "as reported" headline the instrument contrasts against the per-class split is not a separately logged metric; it is the background score relabelled as the pixel-weighted number a summary would quote, because on this data those are effectively the same thing. Whether they coincide as tidily on a differently balanced dataset is not something these numbers can tell you. The leakage argument is a caution about synthetic-to-real transfer, not a measurement of leakage; we are flagging a risk the validation split cannot rule out, not quantifying one we detected. And the three tests here are necessary, not sufficient: a score that survives all three can still mislead through a mismatched objective, an unrepresentative test set, or a metric that does not match what a petrophysicist would actually complain about. The tests narrow the ways a benchmark can fool you; they do not promise it will not.

References

[1] Powers, D. M. W. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation. Journal of Machine Learning Technologies 2(1), 2011, pp. 37-63. The case that precision, recall and F-measure carry base-rate biases, so a headline can flatter a model that has mostly learned the prior. https://arxiv.org/abs/2010.16061

[2] Jaccard, P. The Distribution of the Flora in the Alpine Zone. New Phytologist 11(2), 1912, pp. 37-50. The origin of the intersection-over-union coefficient that IoU segmentation scores are built on. https://nph.onlinelibrary.wiley.com/doi/10.1111/j.1469-8137.1912.tb05611.x

[3] Kaufman, S., Rosset, S., and Perlich, C. Leakage in Data Mining: Formulation, Detection, and Avoidance. Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011, pp. 556-563. The systematic account of how information from outside the training frame contaminates a held-out score. https://dl.acm.org/doi/10.1145/2020408.2020496

[4] Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision 88(2), 2010, pp. 303-338. The benchmark that made per-class scoring and mean-over-classes standard so a single class could not carry the number. https://link.springer.com/article/10.1007/s11263-009-0275-4

Go to Top

© 2026 Copyright. Earthscan