Self-Supervised Pretraining for Dense Prediction: A Survey of Pretext Tasks

Abstract

Self-supervised pretraining learns a representation from unlabelled images by inventing a pretext task whose targets come for free from the data itself, then transfers that representation to a downstream task that has few labels. For dense prediction, where the downstream task is a per-pixel mask rather than a single image label, the question of which pretext to choose is sharper than for classification, because a good pretext has to teach the network about local structure and not only about whole-image semantics. This survey reads the pretext literature through three working families: reconstruction pretexts that hide or corrupt part of the input and ask the network to restore it, instance-contrastive methods that pull two augmentations of the same image together and push different images apart, and masked-image modelling that reconstructs heavily masked patches in a transformer. We credit the papers that built each family and characterise what each is reported to teach. We then ask the question that decides the matter for us: how much of this transfers when the target is not internet-scale photographs but a small, document-style scientific dataset. Our raster well-log digitisation work is the test case, with synthetic segmentation training sets of 2,000 binary and 15,000 multiclass instances on an 80/20 train and validation split, and a 0.51 peak multiclass IoU that we read as an initialisation ceiling rather than an architecture ceiling. The short finding is that the families divide cleanly by their appetite for scale, and the family that leads the public benchmarks is the one whose appetite our regime cannot feed.

Why pretraining, and why this is a different question from few-shot

The expensive resource in dense prediction is the label, because the label is a mask a human has to draw. There are two distinct ways to spend less of it, and they are easy to conflate. One is to keep the supervised recipe but require only a handful of labelled examples per class at test time, which is the few-shot setting we surveyed separately. The other, the subject here, is to pretrain the network on a large pool of unlabelled images using a pretext task, so that when the few real labels do arrive the network is already a good feature extractor and needs less of them to specialise. Few-shot learning borrows labels; self-supervised pretraining borrows a representation. The two are orthogonal and can be stacked, but they answer different questions, and the rest of this piece is strictly about the second.

The reason the pretext choice is delicate for segmentation specifically is that not every self-supervised representation is dense. A pretext tuned to produce one good vector per image can leave the per-pixel features underdeveloped, which is exactly the part a segmentation head needs. Several of the methods below were designed for image-level transfer and only later adapted, and a smaller line was built dense from the start. Keeping that distinction visible is half the value of a survey.

Family one: reconstruction and handcrafted pretexts

The oldest family invents a pretext by removing information and scoring the network on putting it back. Context prediction asked a network to name the spatial relationship between two image patches, which forces it to learn what objects look like in order to reason about their layout (Doersch et al., 2015). Solving jigsaw puzzles generalised the idea to a permutation of nine tiles, which is a richer signal about part configuration (Noroozi and Favaro, 2016). Context encoders moved from relationships to pixels directly, masking a region and training the network to inpaint it with an adversarial and reconstruction loss, which is the conceptual ancestor of everything in the third family (Pathak et al., 2016). Colourful image colourisation removed the chrominance channels and asked the network to predict colour from luminance, a pretext that happens to teach object recognition because colour is correlated with semantics (Zhang et al., 2016). Predicting image rotations is the minimalist endpoint of the family: classify which of four rotations was applied, a near-trivial pretext that nonetheless yields surprisingly strong features (Gidaris et al., 2018).

What unites this family is that the pretext is handcrafted and the supervisory signal is local and cheap. That is the property that matters for a small dataset: these methods were demonstrated on modest corpora and do not assume a million images to work at all. Their ceiling on natural-image benchmarks is lower than what came later, but their floor is reachable on a small budget, and for a document-style dataset the floor is the relevant number.

Family two: instance-contrastive learning

The second family changed the field. Instead of handcrafting a pretext, it defines the task as instance discrimination: take two random augmentations of the same image, embed both, and train the representation so the two views agree while views of different images disagree. SimCLR showed that with a strong augmentation policy, a projection head, and a large batch of negatives, this simple contrastive objective matches supervised pretraining on ImageNet linear probes (Chen et al., 2020). MoCo removed the dependence on a single huge batch by maintaining a momentum-updated queue of negatives, which made contrastive learning feasible without enormous accelerators (He et al., 2020). BYOL then made the striking claim that the negatives are not strictly necessary at all, bootstrapping a target network with a stop-gradient and a predictor to avoid the collapse that the field assumed required contrast (Grill et al., 2020). DINO carried the self-distillation idea into vision transformers and observed that the learned attention maps segment objects without any segmentation supervision, which is the most directly dense-relevant result the family produced (Caron et al., 2021).

This family has two properties that pull against a small dataset. First, the objective is image-level by construction, so the dense features that segmentation needs are a side effect rather than the trained target. The dense-from-the-start variants address this directly: DenseCL applies the contrastive loss between corresponding local features across views rather than between pooled image vectors (Wang et al., 2021), and PixPro enforces pixel-level consistency between the two views to learn representations tuned for dense downstream tasks (Xie et al., 2021). Second, and harder to engineer around, contrastive learning is hungry for negatives and for augmentation diversity, both of which scale with the size and variety of the unlabelled pool. On a few thousand near-identical synthetic logs, the negatives are too similar to teach much, and the augmentation policy that works on natural photographs has to be rebuilt from scratch.

Family three: masked-image modelling

The third family is the most recent and the most scale-dependent. It revives the inpainting idea inside a transformer and trains it to reconstruct heavily masked patches. BEiT predicts discrete visual tokens for the masked positions, importing the masked-language-model recipe into vision (Bao et al., 2022). The masked autoencoder simplified this to reconstructing raw pixels with an asymmetric encoder that sees only the visible patches and a lightweight decoder, masking as much as seventy-five per cent of the image, which makes pretraining both effective and cheap to run per step (He et al., 2022). SimMIM showed that the recipe does not need the discrete tokeniser or any heavy machinery, just a direct regression of masked pixels with a simple head, and still transfers strongly (Xie et al., 2022).

Masked-image modelling produces excellent dense features, because reconstructing local patches is itself a dense objective, and that is precisely why it is attractive for segmentation. The catch is in the same sentence that makes it attractive. The MAE result is explicitly a scaling result: its advantage over supervised pretraining grows with model and data size, and the regime in which it was demonstrated is large transformers on ImageNet and beyond. A masked autoencoder trained on a few thousand synthetic logs has very little to reconstruct that it has not effectively memorised, and the scale that the method converts into representation quality is the one resource a document-style dataset does not have.

Method: how we read transfer onto our regime

We did not run a full pretrain-then-finetune sweep of every method above on the well-log task; that is a larger study than this survey claims to be. What we did is place our true operating point on the same axis the literature implicitly uses, the size of the unlabelled pool, and reason about which families still have support there. Our segmentation training sets are real and small: 2,000 instances for binary segmentation and 15,000 for the multiclass setting, both synthetic, both on an 80/20 train and validation split. The unlabelled pool available for any pretraining is of the same order, because a printed log is a deterministic rendering and the synthetic generator is the source of both the images and their masks. That budget sits at the far left of the axis the contrastive and masked families were tuned on.

The downstream number we are trying to move is the peak multiclass IoU, which topped out at 0.51 on this task. We read that figure deliberately as an initialisation question rather than only a capacity question, because the per-class breakdown shows the network learning the background mask easily and struggling on the thin curve classes, which is the signature of a feature extractor that has not been taught the right local structure before the labelled stage begins. The instrument below plots the three families against the unlabelled budget and marks our two regimes and that ceiling, so the positional argument is visible rather than asserted.

Results: where our budget lands on the map

A map of self-supervised pretext-task families against the unlabelled data budget they were designed for, with our real small-data segmentation regime drawn on top. The teal bands are three families credited in the article, placed across a log image-budget axis by the regime each was built and reported for: reconstruction pretexts (inpainting, colourisation, jigsaw, rotation) that ask least of scale, instance-contrastive pretexts (SimCLR, MoCo, BYOL, DINO) that climb steeply once negatives and batch sizes grow, and masked-image modelling (BEiT, MAE, SimMIM) whose yield depends most sharply on scale. Drag the lever to sweep the budget; the right-hand column reads how much of the selected family's reported yield is in reach at that point and whether the budget sits inside or below the regime it was tuned for. The orange operating line and the two dashed markers are ours: our segmentation training sets were 2000 binary and 15000 multiclass synthetic instances on an 80/20 train and validation split, and the 0.51 peak IoU is the ceiling that better initialisation is meant to lift. The argument is positional: a document-style scientific dataset sits far to the left of where the field's leading pretexts were pressure-tested. The instance counts, the split, and the 0.51 figure are sourced from the engagement archive; the family bands and their transfer-yield shading are illustrative of the cited literature, not a reproduction of any single published table.

The chart makes one claim, and the lever lets the reader test it. Sweep the unlabelled budget from a small scientific corpus toward internet scale and the three families separate by appetite. The reconstruction family carries most of its reported yield at the left of the axis, because its handcrafted pretexts were designed for and demonstrated on modest data. The contrastive family climbs only once the budget is large enough to supply diverse negatives and a rich augmentation distribution. The masked family climbs latest and steepest, because its advantage is a scaling property by design. Our two markers, the 2,000-instance binary set and the 15,000-instance multiclass set, both fall well to the left of where the two leading families reach their reported sweet spots. At our budget the contrastive and masked bands are still in the shallow part of their curves, several decades short of the regime that earned them their benchmark numbers, while the reconstruction band is already near its plateau.

Read against the 0.51 IoU ceiling, the practical reading is not that masked or contrastive pretraining is bad. It is that the public ordering of these methods, which is roughly reconstruction below contrastive below masked, is an ordering at internet scale, and our regime is not at internet scale. The methods that win the leaderboard win it with a resource we do not have, and the method that asks least of scale is the one whose conditions our data actually meets.

Discussion: what this implies for a document-style dataset

The honest synthesis is that the field's progress in self-supervised pretraining has been, to a large degree, progress at exploiting scale, and a survey that ignores that reads the leaderboard backwards for a small-data problem. For natural-image transfer the ranking in the literature is real and well earned. For a document-style scientific dataset, three observations reorder the practical choice. First, denseness of the pretext matters more than its leaderboard rank, which is why the dense-from-the-start contrastive variants and the masked family are conceptually the right shape for segmentation even when their scale assumption fails. Second, the handcrafted reconstruction pretexts are underrated in this regime precisely because their floor is reachable, and inpainting on synthetic logs is a pretext whose targets we can manufacture without limit. Third, and most useful, the renderability of our data cuts both ways: the same generator that makes the scale problem acute also lets us design a domain-specific reconstruction pretext, masking and restoring curve segments rather than generic patches, which is a pretext the literature did not need to invent because the public datasets are not renderable.

Where our own work sits in this field, then, is not as a competitor on any of these benchmarks but as a regime the benchmarks do not cover. The contribution of the survey is to make the axis explicit: pretext-task quality is reported against a data budget, and a method's rank changes when the budget does. For us the actionable conclusion is to prefer dense, reconstruction-shaped pretexts that we can feed from the generator, and to treat the contrastive and masked families as targets to revisit only if the unlabelled pool ever grows by the decades the chart says they need.

Limitations

This is a survey with a positional argument, not a controlled benchmark, and three limits follow from that. We did not finetune each surveyed method on the well-log task, so the transfer-yield curves in the instrument are illustrative of the cited literature rather than measured on our data; the only measured figures are our own instance counts, split, and the 0.51 peak IoU. The data-budget axis is a useful simplification but not the only axis that matters: augmentation design, model capacity, and the match between pretext and downstream geometry all move transfer independently of raw image count, and a fuller study would vary them. Finally, the survey is period-bounded to early 2023 and to the families that had stabilised by then; the pretext literature moves quickly, and a later reading would have to fold in methods that postdate this quarter.

References

[1] C. Doersch, A. Gupta, A. A. Efros. Unsupervised Visual Representation Learning by Context Prediction. ICCV 2015. https://arxiv.org/abs/1505.05192

[2] M. Noroozi, P. Favaro. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. ECCV 2016. https://arxiv.org/abs/1603.09246

[3] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A. A. Efros. Context Encoders: Feature Learning by Inpainting. CVPR 2016. https://arxiv.org/abs/1604.07379

[4] R. Zhang, P. Isola, A. A. Efros. Colorful Image Colorization. ECCV 2016. https://arxiv.org/abs/1603.08511

[5] S. Gidaris, P. Singh, N. Komodakis. Unsupervised Representation Learning by Predicting Image Rotations. ICLR 2018. https://arxiv.org/abs/1803.07728

[6] T. Chen, S. Kornblith, M. Norouzi, G. Hinton. A Simple Framework for Contrastive Learning of Visual Representations (SimCLR). ICML 2020. https://arxiv.org/abs/2002.05709

[7] K. He, H. Fan, Y. Wu, S. Xie, R. Girshick. Momentum Contrast for Unsupervised Visual Representation Learning (MoCo). CVPR 2020. https://arxiv.org/abs/1911.05722

[8] J.-B. Grill, F. Strub, F. Altche, et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (BYOL). NeurIPS 2020. https://arxiv.org/abs/2006.07733

[9] M. Caron, H. Touvron, I. Misra, et al. Emerging Properties in Self-Supervised Vision Transformers (DINO). ICCV 2021. https://arxiv.org/abs/2104.14294

[10] H. Bao, L. Dong, S. Piao, F. Wei. BEiT: BERT Pre-Training of Image Transformers. ICLR 2022. https://arxiv.org/abs/2106.08254

[11] K. He, X. Chen, S. Xie, Y. Li, P. Dollar, R. Girshick. Masked Autoencoders Are Scalable Vision Learners (MAE). CVPR 2022. https://arxiv.org/abs/2111.06377

[12] X. Wang, R. Zhang, C. Shen, T. Kong, L. Li. Dense Contrastive Learning for Self-Supervised Visual Pre-Training (DenseCL). CVPR 2021. https://arxiv.org/abs/2011.09157

[13] Z. Xie, Y. Lin, Z. Zhang, Y. Cao, S. Lin, H. Hu. Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning (PixPro). CVPR 2021. https://arxiv.org/abs/2011.10043

[14] Z. Xie, Z. Zhang, Y. Cao, et al. SimMIM: A Simple Framework for Masked Image Modeling. CVPR 2022. https://arxiv.org/abs/2111.09886