There is a number that should bother you more than it usually does. When we trained a multiclass segmentation model to recover the curves on scanned well logs and split the Intersection-over-Union out by class, the background mask scored 0.94 and the two curve masks scored 0.26 and 0.21. Same model, same loss, same training run, same forward pass. One number says the model is excellent. Two numbers say it is barely working. The reflex is to call this a class-imbalance problem or a model-capacity problem and reach for the usual fixes. Both of those are real, and both matter, but neither is the deepest reason the curve numbers are low. The deepest reason is geometry, and it is baked into the metric before the model ever sees a gradient. A one-pixel-wide target is a worst case for any overlap metric, and the curves on a well log are about as one-pixel as targets get.
What IoU actually measures, and why width is destiny
Intersection-over-Union is a ratio. You take the pixels your prediction and the ground truth agree on, the intersection, and divide by every pixel either of them claims, the union. A perfect prediction scores 1. A prediction that overlaps nothing scores 0. It is the right metric for segmentation because it punishes both kinds of error at once: claiming pixels you should not have, and missing pixels you should have caught. There is nothing wrong with IoU. The problem is what happens to that ratio when the target is thin.
Consider a target that is a solid blob, ten pixels across. Slide your prediction one pixel sideways. You lose a one-pixel-wide strip on one edge and gain a one-pixel-wide strip on the other; the bulk of the shape still overlaps itself. The intersection is still most of the blob, the union is a little larger than the blob, and the IoU stays high. The shape has a body, and the body absorbs a small error. Now consider a target that is a curve, one pixel wide. Slide your prediction one pixel sideways and the prediction and the truth no longer touch at all. There is no body to absorb the error, because the whole shape is edge. The intersection drops to zero, the union doubles, and the IoU is zero. The error was identical in both cases, one pixel. The score went from "barely moved" to "total miss" purely because of how wide the target was.
That is the entire argument, and it is worth sitting with because it is so counterintuitive. The model that produced the curve prediction may have understood the curve almost perfectly. It may have placed the trace within a single pixel of the truth down the entire length of the log. On a thick target that is a triumph. On a one-pixel target it can read as a complete failure, because IoU has no notion of "close." It only knows "overlapping" and "not overlapping," and a one-pixel shape gives it almost no room to be the former.
The exhibit above is the intuition pump. Two pixel grids, one shift slider. The left grid holds a thin one-pixel curve; the right grid holds a thick blob. Drag the slider to nudge both predictions sideways by the same amount and watch the live IoU, which is computed exactly from the drawn cells rather than scripted. A single pixel of shift sends the curve toward zero while the blob barely flinches. The grid pushes the geometry to its limit, a perfectly thin curve against a perfectly aligned truth, so the curve hits a clean zero where a real, slightly-fuzzy mask lands at 0.26 instead. The shape of the failure is the same; the grid just shows it without the noise.
This is the thin-structure problem, and it is everywhere
None of this is unique to well logs. The literature on thin structures has been wrestling with exactly this for years, because the same geometry shows up wherever the thing you care about is a line rather than a region. Retinal-vessel segmentation, road-network extraction from satellite imagery, crack detection in concrete and steel, neuron tracing in microscopy: all of them share the property that the foreground is a thin, branching, mostly-boundary structure, and all of them find that region-overlap metrics behave badly on it. Vessel segmentation benchmarks learned early that a pixel-counting score rewards a fat, blurry vessel prediction over a thin, slightly-misplaced one, even though the thin prediction is geometrically closer to the truth [3].
The community's response has been to change what gets measured. Boundary IoU was introduced precisely because standard mask IoU is dominated by the interior of large objects and is insensitive to boundary quality on exactly the objects where boundaries matter; it restricts the IoU computation to a band around the contour so that getting the edge right is what scores, rather than getting the bulk right [1]. For tubular and curvilinear structures specifically, clDice measures overlap on the morphological skeleton of the prediction and the truth, so two curves that trace the same path score as agreeing even if their pixel masks are offset, and it doubles as a topology-preserving training loss that rewards connectivity rather than raw pixel count [2]. Both of these are admissions of the same fact: for thin structures, plain region IoU is measuring the wrong thing, and a low number is as likely to indict the metric as the model.
That reframing is the practical takeaway. When our curve IoU came in at 0.26 and 0.21, the honest interpretation was not "the model has learned nothing about curves." It was "region IoU is the wrong lens for a one-pixel target, and the number it reports is partly an artifact of target width, not model quality." The background number, 0.94, is high in part for the symmetric reason: background is a vast solid region, and region IoU flatters solid regions. Neither number is lying, but neither means quite what its face value suggests.
Why the loss inherits the same problem
The metric is only half of it. Most segmentation losses are differentiable relatives of the metric, which means they inherit its geometry. A Dice loss is essentially one minus a soft IoU-like overlap, designed so that optimising it pushes the overlap up [4]. That is a feature on thick targets and a liability on thin ones. If a one-pixel-off prediction has near-zero overlap with a one-pixel truth, then the loss gradient from that example is enormous and unstable: the model is told it got everything wrong even though it was a hair away from getting everything right. The optimisation surface for a thin target is full of these cliffs, where a sub-pixel change in the predicted curve position swings the overlap from almost-zero to almost-one. Training is harder not because the pattern is harder to recognise but because the loss landscape the metric induces is brutal.
This is also why the fixes that help are the ones that soften the geometry rather than the ones that just push harder on the same surface. Reweighting the loss toward the rare curve class lifts recall, because it makes ignoring curves expensive, but it cannot change the fact that a one-pixel target has no margin. A skeleton-based or boundary-band objective helps more directly, because it changes the geometry of what counts as overlap so that "close" starts to score like "correct" [1][2]. And the most underrated lever is resolution: a curve that is one pixel wide at the training resolution might be three pixels wide at twice the resolution, and three pixels of width is enough body to give the metric something to hold onto. Thinness is partly a property of the target and partly a property of how you sampled it.
How to read thin-structure numbers without being fooled
The discipline that falls out of all this is a short list, and it is the same discipline whether your thin structures are well-log curves, retinal vessels, or hairline cracks.
First, never average a thin class with a thick class and report one number. A blended IoU of 0.7 across background and curves is a fiction that hides a 0.94 and a 0.26. The per-class split is the only honest view, and the gap between the classes is itself a signal about target width, not just about model quality.
Second, when a thin-class IoU looks bad, check the geometry before you blame the model. Overlay a few predictions on the truth at full resolution and look. If the predicted curve tracks the true curve within a pixel or two down the whole log, the model has learned the structure and the low IoU is the metric punishing thinness. If the prediction wanders, breaks, or hallucinates, that is a real failure. The number alone cannot tell these apart; your eyes and a boundary-aware metric can.
Third, reach for a thin-structure-aware metric when the structure is thin. Boundary IoU and skeleton-based scores like clDice exist because the field already learned this lesson the hard way; you do not have to relearn it on your own data [1][2]. Reporting a boundary or centerline score alongside region IoU turns "the model failed" into "the model places the curve within tolerance but the masks are offset," which is a completely different conversation with a completely different fix.
Key takeaways
- IoU is a ratio of overlap to union, and its behaviour depends brutally on target width. A thick blob has a body that absorbs a one-pixel error; a one-pixel curve is all edge, so a one-pixel shift drops the overlap toward zero. The error is identical; only the geometry differs.
- On our own raster-log run the same multiclass model under one Dice loss scored IoU 0.94 on the thick background mask and 0.26 and 0.21 on the two thin curve masks. The spread is largely geometry, not a difference in how well the model understood each class.
- A low thin-class IoU does not by itself mean the model failed. A prediction can track the true curve within a pixel down the whole log and still score near zero, because region IoU has no notion of close, only of overlapping versus not.
- Losses inherit the metric's geometry. A Dice-style loss creates a cliff-filled optimisation surface for one-pixel targets, where a sub-pixel position change swings overlap from near-zero to near-one, which is why thin-structure training is unstable.
- Fix it by softening the geometry, not just pushing harder: boundary-band metrics like Boundary IoU, skeleton-based scores and losses like clDice, and simply training at higher resolution so a one-pixel curve becomes a three-pixel one with a body the metric can hold onto. Always split per-class and look at the overlay before blaming the model.
The uncomfortable lesson the curve numbers taught us is that a metric you trust on most problems can quietly mislead you on a specific one, and the warning sign is not a crash or an error but a number that looks bad for an honest reason. Thin structures are that specific problem. The moment your target is as wide as the error you are measuring against it, overlap stops being a fair judge, and you have to either change the metric, change the resolution, or change how you read the score. The curves were never the hard part. The ruler was.
References
[1] Cheng, B., Girshick, R., Dollar, P., Berg, A. C., and Kirillov, A. Boundary IoU: Improving Object-Centric Image Segmentation Evaluation. CVPR (2021). Shows standard mask IoU is dominated by object interiors and insensitive to boundary quality, and proposes a contour-band IoU instead. https://arxiv.org/abs/2103.16562
[2] Shit, S., Paetzold, J. C., Sekuboyina, A., et al. clDice - a Novel Topology-Preserving Loss Function for Tubular Structure Segmentation. CVPR (2021). A skeleton-based overlap measure and loss for thin tubular structures that rewards connectivity over raw pixel count. https://arxiv.org/abs/2003.07311
[3] Maninis, K.-K., Pont-Tuset, J., Arbelaez, P., and Van Gool, L. Deep Retinal Image Understanding. MICCAI (2016). Vessel segmentation, a canonical thin-structure benchmark where pixel-counting metrics misbehave. https://arxiv.org/abs/1609.01103
[4] Milletari, F., Navab, N., and Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 3DV (2016). Introduces the Dice loss, a differentiable overlap objective that inherits the metric's geometry. https://arxiv.org/abs/1606.04797