Skip to main content

Blog

Two Opposite Failure Modes, One Identical Score: Stress-Testing F1 Before Trusting It

Before we wired a single metric into model selection, we tried to break it. Two toy detections from our own error workbook say why: one model predicts 100 sinusoids against 10 real, another predicts 10 against 100 real. Precision alone crowns the first, recall alone crowns the second, and each hands its favourite a misleading 100%. F1, the harmonic mean, collapses both opposite failures to exactly 18.18%. This is a short field note on adversarially probing a domain metric with pathological cases before you let it rank models across validation and blind sets, a habit worth keeping for any custom score.

Narendra Patwardhanby Narendra Patwardhan7 min read
EarthScan insight

Every custom metric is a small piece of code that will decide which model ships. Before we let one make that call across five model configurations, on both a validation set and a held-out blind set, we tried to fool it. The cleanest attempt lives as two rows on the 'Reference' sheet of our error workbook, put there so that anyone reading the model-selection ledger could see why the headline number is F1 and not something simpler. Those two rows are the subject of this note.

The setup is deliberately extreme. We detect sinusoids in borehole-image logs, where each detection is a planar feature the model claims to have found. Imagine a section with 10 real sinusoids in it. Now imagine two very different bad models turned loose on it.

The two rows that were built to break the metric

The first model is trigger-happy. It predicts 100 sinusoids where 10 are real. Give it credit for coverage: it does catch all 10 true features, so there are 10 true positives and nothing is missed. But 90 of its predictions are spurious. In detection terms that is 10 true positives, 90 false positives, 0 false negatives. A borehole-image interpreter handed this output would spend the afternoon deleting nine wrong picks for every right one.

The second model has the opposite pathology. Flip the section around so there are 100 real sinusoids, and let the model predict only 10. Every one of those 10 is correct, so precision is spotless: 10 true positives, 0 false positives. But it walked past 90 real features. That is 90 false negatives, a model that would quietly leave most of the fractured interval uninterpreted.

These are opposite failure modes. One buries you in false alarms; the other misses most of the section. No sane review would treat them as interchangeable. Yet look at what a single metric does with them.

Where a single metric gets fooled

Precision asks: of the features you predicted, what fraction were real? For the trigger-happy model that is 10 out of 100, or 10 percent. For the conservative model it is 10 out of 10, a perfect 100 percent. So precision alone crowns the model that missed 90 real sinusoids.

Recall asks the mirror question: of the real features, what fraction did you find? For the trigger-happy model that is 10 out of 10, a perfect 100 percent. For the conservative model it is 10 out of 100, back to 10 percent. So recall alone crowns the model that raised 90 false alarms.

Each single-number lens awards one of the two pathological cases full marks. If model selection had leaned on precision alone, it would have preferred the model that finds almost nothing. If it had leaned on recall alone, it would have preferred the model that cries wolf ninety times. Neither is the model you want in front of an interpreter, and neither metric, on its own, is honest about that.

ADVERSARIAL METRIC PROBE · TWO TOY CASES, ONE VERDICT18.18%F1 for BOTH toy casesOpposite failures, one score: 100 false alarms or 90 misses both land at F1 = 18.18%Precision or recall alone would each hand one pathological case a full 100%. F1 refuses.Case A100 predicted vs 10 realTP10FP90FN0precision10.00%recall100%F1 verdict18.18%Case B10 predicted vs 100 realTP10FP0FN90precision100%recall10.00%F1 verdict18.18%SCORE UNDER THE CHOSEN LENS0%25%50%75%100%F1 = 18.18%18.18%Case A18.18%Case BLENS LEVERcollapses both to one honest 18.18%Precision onlyawards Case B a misleading 100%Recall onlyawards Case A a misleading 100%F1 (harmonic mean)collapses both to one honest 18.18%
Two adversarial toy detections from the team's error-workbook 'Reference' sheet, run through three metric lenses. Case A predicts 100 sinusoids against 10 real (every real one caught, 90 false alarms), Case B predicts 10 against 100 real (all 10 correct, 90 missed). They are opposite failures, yet both score exactly F1 = 18.18%. Toggle the lens: precision alone hands Case B a misleading 100%, recall alone hands Case A a misleading 100%, and only F1 collapses both to the same honest 18.18% line. The orange element is always the single-metric award that flatters a pathological case; under F1 nothing is orange, which is the reason the team wired F1, not precision or recall alone, into model selection. The F1 = 18.18% figure is sourced from the workbook 'Reference' sheet; the confusion counts (TP = 10 in each, FP = 90 / FN = 90 mirrored) are the illustrative decomposition that reproduces it and the two 100% awards.

What the harmonic mean does instead

F1 is the harmonic mean of precision and recall,

F1 as the harmonic mean of precision and recall
F1=2precisionrecallprecision+recallF_1 = \frac{2 \cdot \mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}

and the harmonic mean is unforgiving of imbalance in a way the arithmetic mean is not. Feed it a precision of 100 percent and a recall of 10 percent and it does not settle near the middle. It is dragged down toward the smaller of the two:

Both toy cases evaluate to the same F1
F1=21.000.101.00+0.10=0.201.10=0.1818F_1 = \frac{2 \cdot 1.00 \cdot 0.10}{1.00 + 0.10} = \frac{0.20}{1.10} = 0.1818

The trigger-happy model has precision 0.10 and recall 1.00; the conservative model has precision 1.00 and recall 0.10. Swap the two numbers and the harmonic mean does not move, because the formula is symmetric in precision and recall. Both land on exactly F1 = 18.18 percent. The metric that could be fooled into awarding a 100 percent now says, correctly, that both models are poor, and poor to the same degree. That is the whole argument for the choice on one sheet: the two rows sit side by side, both reading 18.18 percent, next to the two 100 percent awards that precision and recall would have handed out. (This is a different failure than aggregate scores hiding a weak class, which we treated separately in How to Read a Benchmark Without Being Fooled by It; here the single numbers are not averaged over classes, they are simply gameable on their own.)

Why we did this before, not after

It would have been easy to write f1_score(...) into the selection loop, watch the numbers look reasonable, and move on. The reason we did not is that a domain metric is rarely the textbook one. Ours counts a correctly-shaped sinusoid predicted at the wrong depth as a false positive, matches true positives by least depth error within a tolerance, and computes dip and azimuth error on matched pairs only. Once a metric has that much project-specific logic wired into it, "it looks reasonable on the real data" is not evidence that it behaves well. Real data does not include the adversarial corner where the model catches everything by predicting everything, or scores a flawless precision by predicting almost nothing.

Toy cases do include those corners, on purpose. The 100-versus-10 and 10-versus-100 rows are not data we collected; they are inputs we constructed to sit exactly where a naive metric breaks. Running them cost nothing and told us something the validation set could not: that a single-metric ranking would have a blind spot precisely at the two failure modes an interpreter cares about most. F1 passed the probe, so F1 is what ranked the five configurations, on both the validation and the blind sets.

The transferable habit is smaller than the borehole-image detail and outlasts it. When you are about to let a custom metric choose models, write down the two or three ways a lazy model could score well without being good, hand-build the inputs that trigger each, and check the metric refuses to be fooled. If it awards full marks to a case you would reject on sight, you have found the bug in the ruler before it warped every measurement downstream.

Limitations

The 18.18 percent figure is exact for the two constructed cases and nothing more; it is a property of the metric under adversarial input, not a performance number for any model we shipped. The confusion counts here (10 true positives in each case, with 90 false positives or 90 false negatives) are the illustrative decomposition that reproduces the sourced score and the two misleading 100 percent awards. F1 surviving these two toy cases does not make it the right metric for every task; it made it the defensible default for a detection problem where both false alarms and misses carry real interpretation cost, which is why we still report precision and recall alongside it rather than in place of it. A metric that passes an adversarial probe has cleared one bar, not every bar.

References

[1] Van Rijsbergen, C. J. Information Retrieval, 2nd ed. Butterworth-Heinemann, 1979. The standard reference for the F-measure as the harmonic mean of precision and recall, and for why the harmonic mean penalises imbalance between the two.

Go to Top

© 2026 Copyright. Earthscan