Two Opposite Failure Modes, One Identical Score: Stress-Testing F1 Before Trusting It

Every custom metric is a small piece of code that will decide which model ships. Before we let one make that call across five model configurations, on both a validation set and a held-out blind set, we tried to fool it. The cleanest attempt lives as two rows on the 'Reference' sheet of our error workbook, put there so that anyone reading the model-selection ledger could see why the headline number is F1 and not something simpler. Those two rows are the subject of this note.

The setup is deliberately extreme. We detect sinusoids in borehole-image logs, where each detection is a planar feature the model claims to have found. Imagine a section with 10 real sinusoids in it. Now imagine two very different bad models turned loose on it.

The two rows that were built to break the metric

The first model is trigger-happy. It predicts 100 sinusoids where 10 are real. Give it credit for coverage: it does catch all 10 true features, so there are 10 true positives and nothing is missed. But 90 of its predictions are spurious. In detection terms that is 10 true positives, 90 false positives, 0 false negatives. A borehole-image interpreter handed this output would spend the afternoon deleting nine wrong picks for every right one.

The second model has the opposite pathology. Flip the section around so there are 100 real sinusoids, and let the model predict only 10. Every one of those 10 is correct, so precision is spotless: 10 true positives, 0 false positives. But it walked past 90 real features. That is 90 false negatives, a model that would quietly leave most of the fractured interval uninterpreted.

These are opposite failure modes. One buries you in false alarms; the other misses most of the section. No sane review would treat them as interchangeable. Yet look at what a single metric does with them.

Where a single metric gets fooled

Precision asks: of the features you predicted, what fraction were real? For the trigger-happy model that is 10 out of 100, or 10 percent. For the conservative model it is 10 out of 10, a perfect 100 percent. So precision alone crowns the model that missed 90 real sinusoids.

Recall asks the mirror question: of the real features, what fraction did you find? For the trigger-happy model that is 10 out of 10, a perfect 100 percent. For the conservative model it is 10 out of 100, back to 10 percent. So recall alone crowns the model that raised 90 false alarms.

Each single-number lens awards one of the two pathological cases full marks. If model selection had leaned on precision alone, it would have preferred the model that finds almost nothing. If it had leaned on recall alone, it would have preferred the model that cries wolf ninety times. Neither is the model you want in front of an interpreter, and neither metric, on its own, is honest about that.

Two adversarial toy detections from the team's error-workbook 'Reference' sheet, run through three metric lenses. Case A predicts 100 sinusoids against 10 real (every real one caught, 90 false alarms), Case B predicts 10 against 100 real (all 10 correct, 90 missed). They are opposite failures, yet both score exactly F1 = 18.18%. Toggle the lens: precision alone hands Case B a misleading 100%, recall alone hands Case A a misleading 100%, and only F1 collapses both to the same honest 18.18% line. The orange element is always the single-metric award that flatters a pathological case; under F1 nothing is orange, which is the reason the team wired F1, not precision or recall alone, into model selection. The F1 = 18.18% figure is sourced from the workbook 'Reference' sheet; the confusion counts (TP = 10 in each, FP = 90 / FN = 90 mirrored) are the illustrative decomposition that reproduces it and the two 100% awards.

What the harmonic mean does instead

F1 is the harmonic mean of precision and recall,

F1 as the harmonic mean of precision and recall

F_1 = \frac{2 \cdot \mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}

and the harmonic mean is unforgiving of imbalance in a way the arithmetic mean is not. Feed it a precision of 100 percent and a recall of 10 percent and it does not settle near the middle. It is dragged down toward the smaller of the two:

Both toy cases evaluate to the same F1

F_1 = \frac{2 \cdot 1.00 \cdot 0.10}{1.00 + 0.10} = \frac{0.20}{1.10} = 0.1818

The trigger-happy model has precision 0.10 and recall 1.00; the conservative model has precision 1.00 and recall 0.10. Swap the two numbers and the harmonic mean does not move, because the formula is symmetric in precision and recall. Both land on exactly F1 = 18.18 percent. The metric that could be fooled into awarding a 100 percent now says, correctly, that both models are poor, and poor to the same degree. That is the whole argument for the choice on one sheet: the two rows sit side by side, both reading 18.18 percent, next to the two 100 percent awards that precision and recall would have handed out. (This is a different failure than aggregate scores hiding a weak class, which we treated separately in How to Read a Benchmark Without Being Fooled by It; here the single numbers are not averaged over classes, they are simply gameable on their own.)

Why we did this before, not after

It would have been easy to write f1_score(...) into the selection loop, watch the numbers look reasonable, and move on. The reason we did not is that a domain metric is rarely the textbook one. Ours counts a correctly-shaped sinusoid predicted at the wrong depth as a false positive, matches true positives by least depth error within a tolerance, and computes dip and azimuth error on matched pairs only. Once a metric has that much project-specific logic wired into it, "it looks reasonable on the real data" is not evidence that it behaves well. Real data does not include the adversarial corner where the model catches everything by predicting everything, or scores a flawless precision by predicting almost nothing.

Toy cases do include those corners, on purpose. The 100-versus-10 and 10-versus-100 rows are not data we collected; they are inputs we constructed to sit exactly where a naive metric breaks. Running them cost nothing and told us something the validation set could not: that a single-metric ranking would have a blind spot precisely at the two failure modes an interpreter cares about most. F1 passed the probe, so F1 is what ranked the five configurations, on both the validation and the blind sets.

The transferable habit is smaller than the borehole-image detail and outlasts it. When you are about to let a custom metric choose models, write down the two or three ways a lazy model could score well without being good, hand-build the inputs that trigger each, and check the metric refuses to be fooled. If it awards full marks to a case you would reject on sight, you have found the bug in the ruler before it warped every measurement downstream.

Limitations

The 18.18 percent figure is exact for the two constructed cases and nothing more; it is a property of the metric under adversarial input, not a performance number for any model we shipped. The confusion counts here (10 true positives in each case, with 90 false positives or 90 false negatives) are the illustrative decomposition that reproduces the sourced score and the two misleading 100 percent awards. F1 surviving these two toy cases does not make it the right metric for every task; it made it the defensible default for a detection problem where both false alarms and misses carry real interpretation cost, which is why we still report precision and recall alongside it rather than in place of it. A metric that passes an adversarial probe has cleared one bar, not every bar.

References

[1] Van Rijsbergen, C. J. Information Retrieval, 2nd ed. Butterworth-Heinemann, 1979. The standard reference for the F-measure as the harmonic mean of precision and recall, and for why the harmonic mean penalises imbalance between the two.

Two Opposite Failure Modes, One Identical Score: Stress-Testing F1 Before Trusting It

The two rows that were built to break the metric

Where a single metric gets fooled

What the harmonic mean does instead

Why we did this before, not after

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on