The segmenter had stopped improving and we had run out of things to blame. VeerNet was reading scanned well logs, tracing the curves off the paper into vectors, and on the curve-heavy sheets it had settled onto a number and refused to move. The best of the curve masks scored an F1 of 0.55. The other two sat at 0.37 and 0.26. We had cycled the optimiser, reweighted the loss, and grown the training set, and the plateau held through all of it. The mistake we eventually found was not in any of those knobs. It was in the shape of the thing we were asking the network to predict.
We wrote up the diagnosis of that plateau separately, in the companion account of how the recall-precision signature revealed the binary framing to be the wrong problem statement, and we will not re-run the metric autopsy here. This piece takes the diagnosis as settled and asks the next question: given that the target shape was the defect, what is it about the shape that a stack of binary masks gets wrong, and what does the fix actually cost. The answer is a story about loss geometry, about who owns the boundary between two traces, and about the training budget we spent to buy a target the objective could hold.
The defect was in the geometry of the loss
The first version framed segmentation as a set of independent binary decisions. For a sheet with two curves we produced three masks, one per class of stroke, and each mask answered a single yes-or-no question at every pixel: is this pixel part of my curve or not. Three masks, three separate sigmoid outputs, three separate binary losses summed together. On paper this is a reasonable place to start, and it matches how a lot of detection pipelines decompose a scene into per-object masks [1]. Reweighting the thin foreground bought recall and capped the masks at 0.55, 0.37, and 0.26, and the companion piece walks through why that trade is forced. What matters here is the geometry underneath the trade.
Three independent binary masks share no rule about who owns a pixel. Each mask lives in its own loss with its own background, and the objective is a sum of three terms that never reference one another. Nothing in that sum says a pixel belongs to at most one curve. Two masks can both light up on the same stroke, and because each term is computed as if the other mask did not exist, neither is charged for the overlap. The loss surface has no ridge separating curve one from curve two, because the two never appear in the same expression. They are parallel problems that happen to share a backbone, not a single problem with a shared constraint.
That absent constraint is not a tuning gap, it is a missing term. You cannot reweight your way to a rule the loss does not contain. The place where two curves run close together, cross, or nearly touch, which is exactly the place that matters for reading a log correctly, is precisely where the summed binary objective is silent. Where the traces separate cleanly the three masks agree by default and the framing looks fine. Where they collide, the framing has nothing to say, and the collision is the entire job.
One question with three answers
The reformulation was to stop predicting a stack of independent masks and predict a single target instead. Rather than three binary maps, the network now produces one map with three mutually exclusive classes: background, curve one, curve two. Every pixel is assigned to exactly one of them through a softmax, and the loss scores the whole assignment at once rather than three yes-or-no calls in isolation. This is the ordinary multiclass framing of dense prediction, the same target shape that fully convolutional segmentation and the encoder-decoder architectures built on it assume [1] [2], and moving to it changed the objective in a way that no amount of reweighting on the binary side could.
The change is that the classes now compete. Because a pixel can go to only one class, giving it to curve one is the same act as taking it away from curve two and from background, and the loss feels all three consequences of a single decision. The boundary between two adjacent traces stops being invisible and becomes a thing the objective has a stake in, because putting the boundary in the wrong place is now a cost paid on both sides at once. That is the property the independent masks never had and could not be tuned into having. The imbalance did not vanish, and we still leaned on region-overlap losses of the Dice and Tversky family, and on focal weighting, to keep the thin foreground classes from being swamped [3] [4]. But those tools were now working with the grain of the target instead of against it: they were shaping how a single coherent assignment traded errors, not trying to reconcile three separate assignments after the fact.
The exhibit below is the whole argument in one frame. On the left, the three binary masks sit under the plateau line at 0.55, and no mask clears it. On the right is the reframed target: one column split into three exclusive classes, one softmax answering one question. Toggle the plateau line and confirm for yourself that under the old framing the ceiling was real.
What it cost and what it bought
Nothing about this was free, and the ledger is worth reading in full. The binary regime trained on two thousand instances and finished fifty epochs in about two hours. The single three-class target ran on fifteen thousand instances and took about ten hours for the same fifty epochs. That is seven and a half times the data and five times the wall-clock, and the two figures move together because the reformulation is what makes the larger corpus worth having: separating one curve from another, rather than each curve from background alone, needs far more crossings and near-misses in the training set for the objective to learn the seam. We paid that willingly, because the binary plateau was not a budget problem and could not be spent out of. More epochs and more data on the old target shape would have bought a slightly better version of the same stalled model. The reformulation bought a different model, and the cost line is the price of a target the loss could actually hold.
What we got for the cost was separation that stayed separated. Under the old framing, improvement on one mask tended to come at the expense of another, because the masks were coupled only through the shared backbone and not through the loss, so there was no consistent pressure holding the curves apart where they met. Under the single target, the per-class scores stopped behaving like three unrelated numbers stranded at a wall and started behaving like one system that could be pushed, because the objective finally had a reason to keep curve one and curve two distinct rather than letting both drift toward whatever the shared features found easiest.
Why the target shape is the real design decision
The lesson is not specific to well logs. When a segmentation model plateaus and the usual levers do nothing, the first suspect should be the shape of the label, not the capacity of the network. A target made of independent binary masks quietly encodes an assumption that the classes do not interact, and for anything where the interesting behaviour lives at the boundaries between classes, that assumption is false in the one place it matters most. Curves in a well log interact constantly. They cross, they crowd, they run parallel a hair apart, and the entire value of the digitiser is getting those exact situations right.
Framing the target as a single multiclass assignment did not add a clever module or a new loss nobody had seen. It changed what question the model was being asked, from three separate calls that could not see each other into one call that had to be internally consistent. The plateau at 0.55 was the sound of three answers that never had to agree. Once the target forced them to be one answer, the wall stopped being a wall.
Limitations
This account describes a target reformulation on one operator's raster-log archive, and the specific numbers, the 0.55 plateau on the best binary mask and the 0.37 and 0.26 on the others, are from that engagement and should not be read as benchmarks for segmentation in general. The multiclass framing here used three classes because the sheets in scope carried at most two curves plus background; a sheet with more overlapping traces would need more classes and would not necessarily behave the same way, since mutual exclusivity gets harder to satisfy as classes crowd. The training figures, two thousand instances at roughly two hours for the binary regime and fifteen thousand instances at roughly ten hours for the multiclass regime, both across fifty epochs, reflect our data and hardware at the time and are not a claim about what the reformulation costs anywhere else. Finally, the point of this piece is the target shape, not the surrounding pipeline. Reassembly of multi-page scans, curve tracing from the mask, and depth indexing are separate stages with their own failure modes, and moving from binary to multiclass fixed the plateau in the segmenter without touching any of them.
References
-
Long, J., Shelhamer, E., and Darrell, T. (2015). Fully Convolutional Networks for Semantic Segmentation. CVPR 2015. https://arxiv.org/abs/1411.4038
-
Ronneberger, O., Fischer, P., and Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. https://arxiv.org/abs/1505.04597
-
Milletari, F., Navab, N., and Ahmadi, S. (2016). V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 3DV 2016. https://arxiv.org/abs/1606.04797
-
Lin, T., Goyal, P., Girshick, R., He, K., and Dollar, P. (2017). Focal Loss for Dense Object Detection. ICCV 2017. https://arxiv.org/abs/1708.02002