Are the Model's Confidences Honest? A Per-Curve Calibration Study

Abstract

An overlap score answers one question: did the predicted mask land in the right place? It is silent on a second question that matters just as much once a model's output has to be reviewed by a person, namely whether the probability the model attached to each pixel can be believed. This study asks that second question of a three-class raster-log segmenter, background plus two well-log curves, and asks it per class rather than over the frame as a whole. We survey the calibration literature the field has assembled since 2005, from the reliability diagram and the expected-calibration-error estimator through the finding that modern networks are systematically over-confident, and we read that literature against our own Dice baseline, in which the background class reaches an intersection over union of 0.94 and an F1 of 0.97 while the two thin curve classes sit at IoU 0.26 and 0.21 and F1 0.37 and 0.32. The central observation is that a single frame-averaged calibration number hides the very split that matters: the easy, enormous background class is close to honest and mildly over-confident at the top, exactly as the literature reports for high-accuracy classes, while the minority curve classes are the opposite problem, claiming a certainty their pixels do not earn. Calibration, like accuracy, is a per-class property on an imbalanced task, and on this task it is the thin classes whose confidences a petrophysicist should distrust.

Why an overlap score is the wrong place to look for trust

Begin with what intersection over union and F1 actually measure, because the gap they leave is the whole subject of this piece. Both are functions of a thresholded, hard prediction: every pixel has already been committed to a class before the score is computed, and the score counts how well those committed labels overlap the truth. Nothing in either number depends on the probability the model held before it committed. A model that predicts a curve pixel with a trembling 0.51 and a model that predicts it with a confident 0.99 produce the identical mask and therefore the identical F1, even though one of them is telling you it is unsure and the other is telling you it is certain. The moment a human has to decide which of the model's picks to trust without re-checking every pixel, that discarded probability is exactly the information they need, and the overlap score has thrown it away.

Calibration is the property that makes the discarded probability usable. A model is calibrated when, among all the pixels it labels with confidence 0.8, about 80 percent are in fact correct; the confidence behaves like a frequency you can bet on. The idea predates deep learning: the supervised-learning calibration study of the mid-2000s already showed that different classifier families distort their probability estimates in characteristic, repeatable ways, and that a probability emitted by a model is not automatically a probability you can trust ^[1]. The tool that made the distortion visible is the reliability diagram, which plots predicted confidence against empirical correctness so that a perfectly calibrated model traces the 45-degree line and any departure from it is a measurable, signed error. The expected-calibration-error estimator turned that picture into a scalar by binning the predictions and summing the weighted distance from the diagonal across the bins ^[2].

The reason the question became urgent for modern segmentation is a single, much-cited finding: the networks that drove accuracy up through the 2010s also became markedly more over-confident than their shallower predecessors, reporting probabilities clustered near one that their true correctness did not support, with temperature scaling offered as a cheap post-hoc correction ^[3]. That work studied whole-image classifiers, but the lesson transfers directly to dense prediction, where every pixel is its own classification and the over-confidence is replicated millions of times per frame.

What the calibration literature has settled, and what it has not

The field's vocabulary for what a confidence even means was sharpened by the distinction between two sources of uncertainty: the aleatoric uncertainty that lives in the data itself, the irreducible ambiguity of a blurred or torn boundary, and the epistemic uncertainty that lives in the model, the part that more data could in principle remove ^[4]. A scanned curve edge is a textbook case of the first: the raster genuinely does not resolve where the one-pixel trace ends and the paper begins, so a calibrated model should report middling confidence there, and a model that reports 0.99 on an ambiguous edge is not merely wrong, it is dishonest about a kind of uncertainty it cannot escape.

Two strands of the later literature bear directly on our setting. The first is that the way calibration is measured is itself fragile: the standard equal-width binning of the expected-calibration-error estimator is sensitive to the number of bins and can both mask and manufacture miscalibration, which is why measured calibration numbers should be read as estimates with their own error bars rather than as exact quantities ^[5]. The second is that the training objective shapes calibration as much as it shapes accuracy: focal loss, by down-weighting the easy, already-confident pixels, was shown to produce better-calibrated networks than plain cross-entropy precisely because it stops the majority class from driving every probability toward one ^[6]. That result is the bridge from the imbalance literature to the calibration literature, and it is the one most relevant to a frame that is roughly 97 percent background.

Segmentation-specific work made the per-class point explicit. A study of deep medical image segmentation found that confidence calibration there is not a single frame-level property but varies sharply by structure, with the large, easy regions calibrated very differently from the small, hard ones, and argued that calibration has to be assessed and corrected per structure rather than per image ^[7]. The most recent broad re-examination complicated the original over-confidence story further, showing that the relationship between architecture, accuracy, and calibration is not monotone and that some newer model families are better calibrated out of the box than the 2017 picture suggested ^[8]. What none of this literature settles is the specific shape calibration takes on a one-to-three-pixel curve drowning in background, which is the gap this study reads its own numbers into.

How the study was assembled

This is a structured reading grounded in a real baseline, not a fresh calibration benchmark, and it is worth stating the boundary precisely so the reader can judge what it does and does not establish. The contribution of the underlying engagement is the segmenter and its measured overlap performance; the contribution of this piece is to ask the calibration question of that segmenter and to locate the answer in the published literature.

The baseline is the three-class Dice-loss run from the engagement archive: background and two well-log curves, trained and evaluated on synthetic raster logs and read against real scans, with per-class metrics recorded for intersection over union, F1, precision, and recall. The single quantity from that run that speaks most directly to calibration is per-class precision, because precision is by definition the empirical correctness of the pixels the model commits to a given class: of every pixel the model called a curve, the fraction that truly was one. That makes precision the natural anchor for the confident end of each class's reliability curve, and it is the sourced number the interactive probe below is pinned to. The full reliability shape across the lower-confidence bins, the per-bin expected-calibration-error figures, and the signed gap band are illustrative geometry built to make the calibration argument legible; they are not a measured per-bin reliability histogram, and the probe says so on its own face. The reliability shapes we draw follow the qualitative pattern the literature reports for a high-accuracy majority class versus low-accuracy minority classes under heavy imbalance, anchored at each curve's sourced precision rather than invented.

Reading the reliability per class

The sourced precisions already foreshadow the finding before any reliability curve is drawn. Background precision is 0.96, against an F1 of 0.97 and an intersection over union of 0.94: the background class is, for practical purposes, solved, and a pixel the model calls background is almost always background. Curve-1 precision is 0.41 against an F1 of 0.37 and an IoU of 0.26; curve-2 precision is 0.36 against an F1 of 0.32 and an IoU of 0.21. The brute fact is that fewer than half of the pixels the model commits to either curve class are actually curve pixels. Any confidence the model reports on those commitments has to be read against that floor.

A reliability-curve probe that asks whether the model's per-pixel confidences are honest, one curve per class. The x axis is the predicted confidence; the y axis is the empirical correctness of pixels at that confidence; the dashed 45-degree line is perfect calibration, where a pixel called 0.80 is right 80 percent of the time. The background class hugs the diagonal and only droops slightly at the very top, the mild over-confidence of an easy majority class. The two thin curve classes sit well below the diagonal across the range: when the model says it is sure a pixel belongs to a curve, it is right far less often than the stated confidence promises. Drag the confidence lever to scan a confidence level and read each class's empirical correctness, its signed reliability gap, and its expected calibration error; the worst-calibrated class is drawn in orange. The per-class precision, F1, and IoU that anchor the curves are sourced from the Dice three-class run (background precision 0.96 / F1 0.97 / IoU 0.94; curve-1 0.41 / 0.37 / 0.26; curve-2 0.36 / 0.32 / 0.21); the smooth reliability shapes between bins, the per-bin calibration errors, and the orange gap band are illustrative geometry built to argue the calibration story, not a measured reliability histogram.

The probe makes the per-class split visible as three separated curves on the reliability plane. The background curve hugs the diagonal across most of the range and only droops below it near the very top, where its empirical correctness caps at its 0.96 precision rather than reaching a true one: this is the mild over-confidence the broad literature reports for easy, high-accuracy classes ^[3], present but small, because background is so nearly always right that even its over-claims are nearly true. The two thin-curve reliability lines tell the opposite story. They sit well below the diagonal across the confidence range, because the empirical correctness of the pixels the model is sure about tops out near 0.41 and 0.36, not near one. When the model says it is confident a pixel is a curve, the frequency with which it is right falls far short of the confidence it states. That is the dishonesty the title asks about, and it lives entirely in the minority classes.

Scan the confidence lever to a high value, where the over-confidence is sharpest, and the signed reliability gap on the worst curve class opens to roughly a fifth of the confidence scale, while the background gap stays within a few points of the diagonal. The two classes are mis-calibrated in opposite directions and to opposite degrees, and a frame-averaged calibration number, dominated by the 97 percent of pixels that are background, would report the model as nearly honest and bury the curve-class problem completely. This is the segmentation-specific finding the medical-imaging calibration work warned about, reproduced on raster logs: calibration is a per-structure property, and the easy structure's good behaviour masks the hard structure's bad behaviour whenever the two are averaged together ^[7].

What this means for trusting a per-pixel probability

Laid out this way, the practical reading is sharper than any single calibration scalar would give. The background confidences are usable more or less as reported; a high-confidence background pixel is a safe bet, and the small top-end over-confidence is the routine kind every accurate class shows. The curve-class confidences are not usable at face value, and the danger is specifically the confident pick: a curve pixel the model labels at 0.9 is right closer to four times in ten than nine times in ten, so a reviewer who triages by trusting the model's most certain curve picks and re-checking only the uncertain ones would be triaging exactly backwards. The information the overlap score discarded, the probability, is recoverable and informative, but only after it is recalibrated per class, and the literature points to two non-exclusive routes: a post-hoc correction such as temperature scaling fitted per class rather than globally ^[3], or a training objective that does not manufacture the over-confidence in the first place, which is the calibration argument for the focal family on imbalanced data ^[6].

Where our own work sits in this map is at the boundary between measurement and method. The measured artefact is the segmenter and its per-class overlap and precision; everything about reliability shape in this piece is a reading of that artefact through the public calibration literature, not a new calibration experiment. The value of the reading is diagnostic: it explains why the curve classes are the ones to distrust, names the direction of the distrust, and points at the two corrections the field has validated, without claiming to have measured a recalibration on this task. The honest summary is that overlap told us where the model looks, and calibration, read this way, tells us how much to believe what it says when it looks there.

Limitations

The evidence here is a reading of the published calibration literature anchored on a single baseline, and it inherits the limits of both halves. From the literature side, every calibration estimator it leans on is itself approximate: the expected-calibration-error figures the probe reports are binned quantities, and the binning is exactly the source of artefact the measurement-pitfalls work documents, so the per-bin numbers should be read as illustrative of the shape rather than as exact calibration measurements. From the baseline side, only the per-class precision, F1, and IoU are sourced; the reliability curves between the confident endpoint and the origin, the signed gap band, and the per-bin calibration errors are illustrative geometry chosen to follow the qualitative pattern the literature predicts for this imbalance regime, not a measured reliability histogram from the run. The anchor itself is partial: precision pins the confident end of each curve, but a full reliability assessment would need the model's per-pixel probability distribution across all bins, which this study reasons about rather than re-measures. The reference task is also narrow, a three-class raster-log problem with one-to-three-pixel curves and roughly 97 percent background, which is the extreme-imbalance corner where the per-class calibration split is sharpest; on tasks with milder skew or thicker foregrounds the gap between the majority and minority classes should compress. And the piece diagnoses miscalibration without demonstrating a fix: it points at per-class temperature scaling and at calibration-aware training objectives as the literature's validated routes, but it does not fit either on this data, so the claim is about where the dishonesty lives and which direction it runs, not about how far a correction would close it.

Key findings

Overlap metrics such as IoU and F1 score a thresholded hard mask and are blind to the per-pixel probability behind it. Whether that probability can be trusted is the separate property of calibration, which a reliability diagram makes visible as departure from the 45-degree line.
Calibration on an imbalanced segmentation task is a per-class property, not a frame-level one. A single frame-averaged number is dominated by the 97 percent background majority and hides the split that matters.
The background class is near-honest and only mildly over-confident at the top, where its empirical correctness caps at its 0.96 precision rather than reaching one, the routine over-confidence the broad literature reports for easy high-accuracy classes.
The two thin curve classes sit well below the diagonal: their confident picks are right only about as often as their precisions of 0.41 and 0.36 allow, so a high-confidence curve pixel is right closer to four times in ten than nine, and triaging by trusting the model's most certain curve picks would be backwards.
The discarded probability is recoverable but only after per-class correction. The literature points to per-class temperature scaling or a calibration-aware training objective such as focal loss; this study locates the dishonesty in the minority classes without measuring how far a fix would close it.

References

[1] Niculescu-Mizil, A., and Caruana, R. Predicting Good Probabilities with Supervised Learning. ICML (2005). Early evidence that a model's emitted probabilities are not automatically trustworthy and distort in family-specific ways. https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf

[2] Naeini, M. P., Cooper, G. F., and Hauskrecht, M. Obtaining Well Calibrated Probabilities Using Bayesian Binning. AAAI (2015). The binning estimator behind the expected-calibration-error number this study reads per class. https://ojs.aaai.org/index.php/AAAI/article/view/9602

[3] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On Calibration of Modern Neural Networks. ICML (2017). Documents the systematic over-confidence of modern networks and proposes temperature scaling as a cheap post-hoc correction. https://arxiv.org/abs/1706.04599

[4] Kendall, A., and Gal, Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NeurIPS (2017). The aleatoric-versus-epistemic distinction that frames what a confidence on an ambiguous edge can honestly mean. https://arxiv.org/abs/1703.04977

[5] Nixon, J., Dusenberry, M. W., Zhang, L., Jerfel, G., and Tran, D. Measuring Calibration in Deep Learning. CVPR Workshops (2019). On the binning artefacts that make calibration estimators fragile, which is why the per-bin numbers here are read as shape, not as exact values. https://arxiv.org/abs/1904.01685

[6] Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P. H. S., and Dokania, P. K. Calibrating Deep Neural Networks using Focal Loss. NeurIPS (2020). Shows that down-weighting easy pixels improves calibration, the bridge from the imbalance literature to the calibration literature. https://arxiv.org/abs/2002.09437

[7] Mehrtash, A., Wells, W. M., Tempany, C. M., Abolmaesumi, P., and Kapur, T. Confidence Calibration and Predictive Uncertainty Estimation for Deep Medical Image Segmentation. IEEE TMI (2020). Establishes that segmentation calibration varies by structure and must be assessed per structure, the per-class point this study reproduces on raster logs. https://arxiv.org/abs/1911.13273

[8] Minderer, M., Djolonga, J., Romijnders, R., Hubis, F., Zhai, X., Houlsby, N., Tran, D., and Lucic, M. Revisiting the Calibration of Modern Neural Networks. NeurIPS (2021). Complicates the 2017 over-confidence story, showing the architecture-accuracy-calibration relationship is not monotone. https://arxiv.org/abs/2106.07998

Are the Model's Confidences Honest? A Per-Curve Calibration Study

Abstract

Why an overlap score is the wrong place to look for trust

What the calibration literature has settled, and what it has not

How the study was assembled

Reading the reliability per class

What this means for trusting a per-pixel probability

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on