Region Proposal versus Per-Pixel Labeling: Instance and Semantic Segmentation Compared

Abstract

This is a survey, not a benchmark. We ask which segmentation paradigm fits a specific target geometry, overlapping thin curves, the shape that recovering numeric logs from scanned well-log images reduces to. Two lineages answer different questions: region-proposal instance segmentation, running from Faster R-CNN to Mask R-CNN, localises each object with an axis-aligned bounding box before masking inside it [3][4], while per-pixel semantic segmentation, running from FCN through U-Net and DeepLab, labels every pixel directly with no box at all [1][2][5]. Crediting both lineages period-correctly, we argue from representation, not from a measured shoot-out, that for two long thin curves that cross the bounding boxes become nearly coincident and the box abstraction discards the very information needed to tell the curves apart, whereas per-pixel labeling keeps overlapping curves distinct by construction. We illustrate the box-collapse geometrically with an interactive exhibit, and we ground the per-pixel side in our own setting, the 2 constant curves per log we cast as a 3-class problem, reporting the sourced per-class intersection-over-union of 0.94 for background, 0.26 for curve 1, and 0.21 for curve 2. The finding is qualitative as much as quantitative: per-pixel labeling is the right hypothesis class for crossing thin structures, and the residual difficulty lives in the thin foreground classes rather than in the choice of paradigm.

The semantic and instance lineages diverged from a common ancestor and it is worth tracing both, period-correctly, because the comparison only makes sense once each is credited for what it actually solves.

The semantic lineage starts with the fully convolutional network, which replaced the dense classification head of an image classifier with convolutions so that the output is a spatial map of class scores at every pixel rather than a single label for the whole image [1]. That single change, dense prediction, is the foundation of everything that followed. U-Net added a symmetric decoder with skip connections from the encoder, which restored the fine spatial detail that pooling throws away and made the approach work on small datasets and delicate structures, the reason it became the default in biomedical and scientific imaging [2]. DeepLab attacked the same resolution problem from a different angle, using atrous convolution to widen the receptive field without downsampling away the detail, and pairing it with a conditional random field to sharpen boundaries [5]. The through-line of this family is that the prediction is per-pixel and the design effort goes into recovering high-resolution detail. None of these methods has a notion of an object instance; they label stuff, not things.

The instance lineage starts from object detection. Faster R-CNN introduced a learned region proposal network that scans the image and emits candidate object boxes, which a second stage then classifies and refines [3]. Mask R-CNN extended that two-stage detector by adding a small mask-prediction branch that, for each proposed and classified box, predicts a binary mask of the object inside it [4]. This is an elegant and influential design, and for countable, blobby objects, people, cars, animals, it is excellent: the box is a tight, informative summary of where the object is, and predicting a mask within a well-localised box is an easy problem. Panoptic segmentation later formalised the relationship between the two families, naming the semantic part stuff and the instance part things and proposing a metric that scores both at once [6]. That formalisation is exactly the right lens for the question this survey asks: which abstraction, the box-owned thing or the per-pixel stuff, fits a target that is a thin curve.

Method

The comparison we set up is deliberately about the target geometry rather than about any one network's tuning, so we describe it as a hypothesis-class question. A region-proposal instance method represents each curve by an axis-aligned bounding box plus a mask sampled on a fixed grid inside that box. A per-pixel semantic method represents the scene as a class label at every pixel, with no per-object container at all. We then ask what each representation does as two thin curves move from well separated to fully crossing.

In our own setting the per-pixel formulation is concrete. The scanned log presents 2 constant curves per track, and we cast recovery as a 3-class per-pixel segmentation: class 0 background, class 1 curve 1, class 2 curve 2. Every pixel of the image is assigned exactly one of those three labels, and the curves are recovered by reading off the pixels of each foreground class. This is the FCN formulation applied directly to the problem [1], realised with a U-Net-style encoder-decoder so the thin foreground survives the downsampling [2]. There is no box anywhere in the pipeline, which is the whole point of the comparison.

To make the failure of the box representation visible rather than merely asserted, the interactive exhibit below sweeps the two curves from apart to crossing and draws both paradigms in parallel. The left panel shows the region-proposal view, where each curve gets a box; the right panel shows the per-pixel view, where each pixel gets a class. The slider is the curve separation.

Two thin curves swept from well separated to fully crossing, with the two segmentation paradigms shown side by side. On the left, region-proposal instance methods (the Mask R-CNN lineage) own a whole box per curve; as the curves approach, the boxes swell into each other, their overlap balloons, and the instance question of which curve owns the shared ink breaks down at the junction. On the right, per-pixel semantic labelling (the FCN lineage) assigns every pixel independently to one of 3 classes (background plus curve 1 plus curve 2), so the two curves keep distinct labels straight across the crossing. The bars show the sourced multiclass IoU for that 3-class target (background 0.94, curve 1 0.26, curve 2 0.21). The curve count, class count and IoU values are sourced from the engagement archive; the crossing geometry and the box-overlap percentage are illustrative, drawn to make the failure mode legible.

What the exhibit makes plain is a geometric fact, not an artefact of training. A long thin curve that rises and falls across the width of the image has a bounding box that spans almost the full frame: the box is wide because the curve is wide, and tall because the curve undulates. Two such curves have two nearly identical, nearly frame-filling boxes. As you drag them together, the boxes do not just overlap, they become indistinguishable, and at the crossing the box-based question of which instance owns a given pixel has no good answer, because both boxes claim almost every pixel. The per-pixel panel is untroubled by the same motion: each pixel is labelled on its own evidence, so the two curves keep distinct labels straight through the junction. The box is a summary that is informative for a compact object and nearly empty for a frame-spanning curve, and that is the structural reason the instance paradigm struggles here.

Results

To be precise about what this section reports: we did not run a controlled bake-off of Mask R-CNN against a per-pixel network on the identical logs, and nothing below should be read as one. The comparison this survey draws is a representational one, argued from how each paradigm encodes a frame-spanning curve and corroborated by the published behaviour of each lineage on thin and overlapping structures [1][2][4]. What we can report on our own data is the per-pixel side of the argument, the side we built and measured.

For that per-pixel side, the sourced per-class intersection-over-union on the 3-class target is background 0.94, curve 1 0.26, and curve 2 0.21. Those three numbers tell a coherent story. The background class, the overwhelming majority of pixels and spatially easy, is recovered almost perfectly at 0.94. The two thin foreground classes sit far lower, 0.26 and 0.21, with a small asymmetry in the expected direction, the curve that spends more of its length near the other being marginally harder. The modest foreground IoU is not evidence against the per-pixel paradigm; it is the residual difficulty of thin-structure recovery, where a curve a few pixels wide gives the IoU denominator almost nothing to work with, so even a near-perfect centreline scores low.

The reasoned comparison rests on the geometry the exhibit makes visible and on the literature, not on a measured loss for the box-based method. The region-proposal lineage was designed for and is reported strong on compact, countable objects, where a tight box is an informative summary [3][4]; the per-pixel lineage was designed for and is reported strong on delicate, dense structure, which is why U-Net became the default in scientific and biomedical imaging [2]. Mapping those documented strengths onto our target, frame-spanning curves that cross, the predicted outcome is that the box representation collapses at the junction while per-pixel labeling keeps the curves separable. The per-pixel run is consistent with that prediction: it yields two distinct, if thin, foreground classes straight through the crossings, which is exactly the property the box-based representation cannot guarantee.

Discussion

Where does this leave the two lineages in the broader field. The conclusion is not that region-proposal instance segmentation is a worse idea than per-pixel labeling in general; that would be a misreading of a paradigm that is excellent on the objects it was designed for [4]. The conclusion is narrower and, we think, more useful: the suitability of a segmentation paradigm is a function of the target geometry, and the axis-aligned box that makes Mask R-CNN strong on compact things is precisely what makes it weak on frame-spanning thin curves that overlap. The box is a low-dimensional summary of object extent, and a summary is only as good as how much it throws away. For a person in a photograph the box throws away almost nothing useful; for a curve that crosses another curve the box throws away the only thing that mattered, which is the local, per-pixel evidence of which ink is which at the junction.

Seen through the panoptic lens, our target is stuff that behaves like things [6]. We want two distinct curves out, which sounds like an instance problem, but the curves are best recovered as semantic classes because the per-pixel representation is the one that survives overlap. The pragmatic synthesis the field has converged on, and that our work is an instance of, is to pose such problems semantically when the number of object types is small and fixed, here two curves, so that distinct classes do the work that instances would otherwise be asked to do. With 2 constant curves per log the class budget is tiny and the assignment is unambiguous, which is exactly the regime where a 3-class semantic formulation cleanly substitutes for instance segmentation. VeerNet, the encoder-decoder we built for this digitisation, takes that route deliberately, and the IoU profile above is the per-pixel paradigm doing its job: separating the curves everywhere, including where they cross.

Limitations

Three limitations bound the claims here. First, the central comparison is geometric and illustrative rather than a controlled benchmark; we argue and draw the box-collapse failure but do not report a tuned Mask R-CNN IoU on the identical logs, so a reader who wants a head-to-head number will not find one in this piece. Second, the substitution of semantic classes for instances works because our curve count is small and fixed at 2; it does not generalise to a setting with many same-class curves per image, where you genuinely need instance identity and a different mechanism would be required. Third, the foreground IoU figures of 0.26 and 0.21 reflect the thin-structure denominator problem and should not be compared naively to whole-object IoU from natural-image benchmarks; the appropriate companion metrics for thin curves are tolerance-band and centreline measures, which a per-pixel IoU understates by design. Within those bounds, the survey's conclusion holds: for overlapping thin curves with a small fixed class count, per-pixel semantic labeling is the right hypothesis class and region-proposal instance segmentation is not.

Key findings

Two lineages, two questions: region-proposal instance methods (Faster R-CNN to Mask R-CNN) put a box around each object then mask inside it; per-pixel semantic methods (FCN, U-Net, DeepLab) label every pixel directly with no box at all.
The box abstraction collapses on crossing thin curves: a frame-spanning curve has a frame-filling box, so two overlapping curves have two near-identical boxes and the instance question of which curve owns a pixel has no answer at the junction.
Per-pixel labeling is untroubled by overlap: each pixel is classed on its own evidence, so two curves keep distinct labels straight through a crossing. That is the structural advantage this survey argues.
Our digitisation target is a clean fit for the semantic paradigm: 2 constant curves per log cast as a 3-class per-pixel problem (background plus curve 1 plus curve 2), with sourced multiclass IoU of background 0.94, curve 1 0.26 and curve 2 0.21.
Modest foreground IoU is the thin-structure denominator problem, not a verdict on the paradigm: a curve a few pixels wide scores low on IoU even when its centreline is right, which is why tolerance-band and centreline metrics are the proper companions.

References

[1] Long, J., Shelhamer, E., and Darrell, T. Fully Convolutional Networks for Semantic Segmentation. CVPR (2015). The paper that established dense, per-pixel labeling with fully convolutional networks as the semantic segmentation template. https://arxiv.org/abs/1411.4038

[2] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The encoder-decoder with skip connections that made dense labeling work on small data and thin structures. https://arxiv.org/abs/1505.04597

[3] Ren, S., He, K., Girshick, R., and Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NeurIPS (2015). Introduced the learned region proposal network that the box-based instance lineage is built on. https://arxiv.org/abs/1506.01497

[4] He, K., Gkioxari, G., Dollar, P., and Girshick, R. Mask R-CNN. ICCV (2017). The reference region-proposal instance segmenter: a box and a per-instance mask predicted inside each proposed region. https://arxiv.org/abs/1703.06870

[5] Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE TPAMI, 40(4), 834-848 (2018). Atrous convolution for dense labeling at higher effective resolution. https://arxiv.org/abs/1606.00915

[6] Kirillov, A., He, K., Girshick, R., Rother, C., and Dollar, P. Panoptic Segmentation. CVPR (2019). Formalised the split between stuff (semantic) and things (instance) and the metric that unifies them. https://arxiv.org/abs/1801.00868

Region Proposal versus Per-Pixel Labeling: Instance and Semantic Segmentation Compared

Abstract

Method

Results

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on

Abstract

Background and related work

Method

Results

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on