Every custom metric is a small piece of code that will decide which model ships. Before we let one make that call across five model configurations, on both a validation set and a held-out blind set, we tried to fool it. The cleanest attempt lives as two rows on the 'Reference' sheet of our error workbook, put there so that anyone reading the model-selection ledger could see why the headline number is F1 and not something simpler. Those two rows are the subject of this note.
The setup is deliberately extreme. We detect sinusoids in borehole-image logs, where each detection is a planar feature the model claims to have found. Imagine a section with 10 real sinusoids in it. Now imagine two very different bad models turned loose on it.
The two rows that were built to break the metric
The first model is trigger-happy. It predicts 100 sinusoids where 10 are real. Give it credit for coverage: it does catch all 10 true features, so there are 10 true positives and nothing is missed. But 90 of its predictions are spurious. In detection terms that is 10 true positives, 90 false positives, 0 false negatives. A borehole-image interpreter handed this output would spend the afternoon deleting nine wrong picks for every right one.
The second model has the opposite pathology. Flip the section around so there are 100 real sinusoids, and let the model predict only 10. Every one of those 10 is correct, so precision is spotless: 10 true positives, 0 false positives. But it walked past 90 real features. That is 90 false negatives, a model that would quietly leave most of the fractured interval uninterpreted.
These are opposite failure modes. One buries you in false alarms; the other misses most of the section. No sane review would treat them as interchangeable. Yet look at what a single metric does with them.
Where a single metric gets fooled
Precision asks: of the features you predicted, what fraction were real? For the trigger-happy model that is 10 out of 100, or 10 percent. For the conservative model it is 10 out of 10, a perfect 100 percent. So precision alone crowns the model that missed 90 real sinusoids.
Recall asks the mirror question: of the real features, what fraction did you find? For the trigger-happy model that is 10 out of 10, a perfect 100 percent. For the conservative model it is 10 out of 100, back to 10 percent. So recall alone crowns the model that raised 90 false alarms.
Each single-number lens awards one of the two pathological cases full marks. If model selection had leaned on precision alone, it would have preferred the model that finds almost nothing. If it had leaned on recall alone, it would have preferred the model that cries wolf ninety times. Neither is the model you want in front of an interpreter, and neither metric, on its own, is honest about that.
What the harmonic mean does instead
F1 is the harmonic mean of precision and recall,
and the harmonic mean is unforgiving of imbalance in a way the arithmetic mean is not. Feed it a precision of 100 percent and a recall of 10 percent and it does not settle near the middle. It is dragged down toward the smaller of the two:
The trigger-happy model has precision 0.10 and recall 1.00; the conservative model has precision 1.00 and recall 0.10. Swap the two numbers and the harmonic mean does not move, because the formula is symmetric in precision and recall. Both land on exactly F1 = 18.18 percent. The metric that could be fooled into awarding a 100 percent now says, correctly, that both models are poor, and poor to the same degree. That is the whole argument for the choice on one sheet: the two rows sit side by side, both reading 18.18 percent, next to the two 100 percent awards that precision and recall would have handed out. (This is a different failure than aggregate scores hiding a weak class, which we treated separately in How to Read a Benchmark Without Being Fooled by It; here the single numbers are not averaged over classes, they are simply gameable on their own.)
Why we did this before, not after
It would have been easy to write f1_score(...) into the selection loop, watch the numbers look reasonable, and move on. The reason we did not is that a domain metric is rarely the textbook one. Ours counts a correctly-shaped sinusoid predicted at the wrong depth as a false positive, matches true positives by least depth error within a tolerance, and computes dip and azimuth error on matched pairs only. Once a metric has that much project-specific logic wired into it, "it looks reasonable on the real data" is not evidence that it behaves well. Real data does not include the adversarial corner where the model catches everything by predicting everything, or scores a flawless precision by predicting almost nothing.
Toy cases do include those corners, on purpose. The 100-versus-10 and 10-versus-100 rows are not data we collected; they are inputs we constructed to sit exactly where a naive metric breaks. Running them cost nothing and told us something the validation set could not: that a single-metric ranking would have a blind spot precisely at the two failure modes an interpreter cares about most. F1 passed the probe, so F1 is what ranked the five configurations, on both the validation and the blind sets.
The transferable habit is smaller than the borehole-image detail and outlasts it. When you are about to let a custom metric choose models, write down the two or three ways a lazy model could score well without being good, hand-build the inputs that trigger each, and check the metric refuses to be fooled. If it awards full marks to a case you would reject on sight, you have found the bug in the ruler before it warped every measurement downstream.
Limitations
The 18.18 percent figure is exact for the two constructed cases and nothing more; it is a property of the metric under adversarial input, not a performance number for any model we shipped. The confusion counts here (10 true positives in each case, with 90 false positives or 90 false negatives) are the illustrative decomposition that reproduces the sourced score and the two misleading 100 percent awards. F1 surviving these two toy cases does not make it the right metric for every task; it made it the defensible default for a detection problem where both false alarms and misses carry real interpretation cost, which is why we still report precision and recall alongside it rather than in place of it. A metric that passes an adversarial probe has cleared one bar, not every bar.
References
[1] Van Rijsbergen, C. J. Information Retrieval, 2nd ed. Butterworth-Heinemann, 1979. The standard reference for the F-measure as the harmonic mean of precision and recall, and for why the harmonic mean penalises imbalance between the two.