Before you pick a loss, a backbone, or an augmentation policy, there is an earlier fork that quietly decides most of them: are you doing instance segmentation or semantic segmentation? The two get filed under one heading because both produce masks, but they answer different questions, and choosing the wrong one wastes months building machinery for a problem you do not have. This is a plain guide to that fork for anyone who has to slice an image into parts, written from the raster-log-digitisation work behind VeerNet, the encoder-decoder EarthScan uses to lift curves off scanned paper logs. It is deliberately the primer, not the published research comparison; the point here is the decision, and why for our task it is not close.
The two questions, stated plainly
Semantic segmentation labels every pixel with a class drawn from a set you fix before training. The output is one map the same size as the input, and each pixel says which of your named classes it belongs to. There is no notion of separate objects inside a class: every pixel that is road is simply road, whether it is one road or five. The framing that made this the standard shape of the task is the fully convolutional network [1], and the encoder-decoder that produces a full-resolution map for a small, known class set is the U-Net family [3].
Instance segmentation asks a different thing: find each individual object and give it its own mask, even when several objects share a class. Two cars are two instances, each with a separate outline, and the method has to detect them as distinct before it masks them. The reference approach detects each object as a region first, then predicts a mask inside that detection [2]. The unifying vocabulary that names this split cleanly comes from panoptic segmentation, which distinguishes "stuff", the uncountable regions semantic labelling handles, from "things", the countable instances that instance methods exist for [4].
So the fork is not about which model is more accurate. It is a question about your data: are the parts you care about a fixed set of named categories, or an open, countable population of separable objects? Answer that, and the method follows.
What the raster log actually contains
A scanned well log, in the form our multiclass model was trained on, contains two constant curves per image. They are not anonymous objects to be discovered and counted. They are the same two named quantities on every log, and the output vocabulary is fixed before we see a single pixel: three classes, background, curve1, and curve2. Nothing about the task changes that count from image to image. There is no image where a third curve of the same kind might appear and need to be told apart from the first two as a separate instance. The population is closed, named, and small.
That is the exact profile semantic segmentation is built for. We already know what is in the picture; the only question left is which pixels belong to which of the three known classes. Reaching for instance segmentation here would be answering a question the data never asks, "how many separate curve-objects are there and where does each begin", when the honest answer is always the same: two, these two, every time. The exhibit below turns that reasoning into a fork you can drive. Drag the lever for how many distinct objects you actually need to tell apart, and watch where the verdict lands. For a small, fixed, named class set, it lands on semantic, and it stays there until the count climbs into the range where unlabelled duplicates are the real problem.
The numbers behind the fork
The choice is not only conceptual; we have the measurements. Trained as one three-class per-pixel model, the semantic setup peaks at F1 0.55 and IoU 0.51 across the classes it labels jointly. Those are honest numbers for thin, sparse structures on noisy scans, and the whitepaper covers why thin-structure overlap is hard, but the shape that matters here is that a single model carries a shared understanding of the whole picture: it knows that a pixel is curve1 partly because it knows the pixel next to it is background and the one below is curve2. The classes are decided together, in one pass, against one another.
Contrast the instance-flavoured alternative for this task, which is to train separate binary masks, one per curve, and stitch them back together afterward. That path gives three disjoint masks scoring F1 0.37, 0.26, and 0.55, and the deeper problem is not just the lower average. It is that the masks have no shared notion of the scene. Each is solved in isolation, so nothing in the setup prevents two of them from claiming the same pixel or both going quiet in the same gap, because none of them was ever told the others exist. The joint three-class model gets that mutual exclusivity for free, by construction, which is the practical reason the semantic framing wins for a fixed class set and not merely a headline-metric reason.
When the other branch is right
None of this makes instance segmentation the weaker tool. It makes it the tool for a different job. If the task were "count and separate every vug in a carbonate image" or "outline each of an unknown number of grains", the objects would be countable, duplicated, and unlabelled in advance, and semantic labelling alone would collapse them into one undifferentiated region. That is precisely the "things" case the panoptic framing names [4], and it is where detecting each instance first [2] earns its cost. The lever in the exhibit is built to show this: push the object count up, and the fork tips over to the instance branch, because at that point you genuinely do not know how many separate objects there are, and telling them apart is the whole task.
The discipline is to answer the data question honestly rather than defaulting to whichever method is fashionable. For raster logs the answer is unambiguous. Two named curves, three fixed classes, the same set on every image: that is a semantic problem, and the machinery of instance detection would be effort spent solving a counting question that our data never poses.
Limitations
This is a decision guide grounded in one engagement, not a benchmark, and it should be read as such. The F1 and IoU figures are the real archive numbers for our specific curves, scans, and class set, so they characterise the difficulty of this task, not segmentation in general; a different operator's logs with more curves, crossing traces, or a genuinely variable curve count could shift the fork or even move it to the instance branch. The object-count lever in the exhibit is an illustrative control that dramatises the decision rule, not a measured series, and the split point on it is a rule of thumb rather than a learned threshold. The framing also assumes the class set really is known and stable ahead of time; where that assumption fails, where new kinds of curve can appear without warning, the tidy semantic case weakens and the honest answer becomes more mixed. And picking the right branch only settles the framing. It does not decide the backbone, the loss under foreground scarcity, or whether a mask that scores well actually reconstructs a usable curve, which are separate questions this note does not try to answer.
References
[1] Long, J., Shelhamer, E., and Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), pp. 3431-3440. The paper that fixed per-pixel dense labelling as the standard shape of semantic segmentation. https://openaccess.thecvf.com/content_cvpr_2015/html/Long_Fully_Convolutional_Networks_2015_CVPR_paper.html
[2] He, K., Gkioxari, G., Dollar, P., and Girshick, R. Mask R-CNN. IEEE International Conference on Computer Vision (ICCV 2017), pp. 2961-2969. The reference instance-segmentation method: detect each object, then mask inside each detection. https://openaccess.thecvf.com/content_iccv_2017/html/He_Mask_R-CNN_ICCV_2017_paper.html
[3] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015, Lecture Notes in Computer Science 9351, Springer, pp. 234-241. The encoder-decoder that produces a full-resolution per-pixel map for a small, known class set. https://link.springer.com/chapter/10.1007/978-3-319-24574-4_28
[4] Kirillov, A., He, K., Girshick, R., Rother, C., and Dollar, P. Panoptic Segmentation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), pp. 9404-9413. The framing that separates uncountable "stuff" from countable "things", the distinction that decides the fork. https://openaccess.thecvf.com/content_CVPR_2019/html/Kirillov_Panoptic_Segmentation_CVPR_2019_paper.html