Self-Supervised and Foundation-Model Pretraining for Document Segmentation

Abstract

Self-supervised pretraining now comes in two shapes a practitioner has to tell apart. One is a pretext run on your own unlabelled data whose only value is a representation you then transfer downstream. The other is a foundation model, a large network pretrained by someone else on a very large corpus and offered as a backbone or, in the segmentation case, as a near-ready mask predictor. This survey is about neither in general. It is about the one slice where both meet our work: dense prediction on documents, where the downstream task is a per-pixel mask on a scanned page rather than a single label on a photograph. We separate this deliberately from our earlier pretext-task survey, which asked which pretext to run, and from the applied warm-start primer, which described the recipe. The question here is narrower and more skeptical: does pretraining on unlabelled documents actually transfer into mask accuracy, measured on the mask, and does the foundation-model framing that reshaped document understanding carry over to document segmentation. We credit the transferability studies that first quantified how far a learned feature carries, the document-image self-supervised models that made pretraining a default for layout and reading, and the segmentation-foundation line that promises a generic mask model. We read all three against a real, small baseline from our raster well-log digitiser, CurveNet: a compact encoder-decoder with five encoder residual blocks, a 128-dimensional bottleneck refined by two transformer attention layers, five decoder stages, one grayscale input channel, fine-tuned on 2,000 binary and 15,000 multiclass labelled instances. The finding is that transfer into dense prediction is real but conditional, that it must be verified as a downstream-accuracy result rather than inferred from a low reconstruction loss, and that the foundation-model promise is strongest precisely where a small document-segmentation regime is weakest.

A different question from the two we asked before

We have written twice already on nearby ground and want to fence this piece off from both. The pretext-task survey asked, given a fixed small budget, which family of self-supervised objective is worth running at all, and answered it with a data-budget axis. The applied primer described the reconstruction-first recipe: warm-start the encoder as an autoencoder on unlabelled log crops, then fine-tune the mask head on scarce labels. This piece asks the question that sits underneath both and is easy to wave past: what is the actual evidence that any of this transfers into dense-prediction accuracy on documents, and how much of the modern foundation-model story applies when the document is a scanned scientific log and the task is a thin-curve mask.

The reason to isolate the transfer question is that a reconstruction loss and a mask accuracy are not the same number, and a pretext that drives the first to zero can leave the second untouched. Transfer is a claim about the downstream task, and the honest version of the claim has to be measured there. That is the discipline this survey is built around.

What the transferability studies actually established

The first careful measurement of transfer is older than the current wave and still sets the terms. Yosinski and colleagues took a trained network and asked, layer by layer, how much of its usefulness survived being moved to a new task, and found two things that still hold: early layers transfer broadly because they encode generic structure, while later layers are increasingly specific to the original task, and forcing a transfer of the wrong layers can actively hurt [1]. The shape of that result is the shape of every warm-start argument since. The generic, early part of a network is what a pretext can teach for free; the specific, late part is what the target labels have to teach regardless.

The second study is the skeptical one, and it is the reason this survey refuses to read a pretraining metric as a transfer promise. Kornblith and colleagues showed that better performance on the pretraining task does not reliably translate into better transfer to a downstream task; the ranking can reorder, and the only trustworthy measure of transfer is the downstream metric itself [2]. Applied to self-supervision, this says plainly that a lower reconstruction loss on masked log patches is not evidence of a better mask head. You have to fine-tune and measure.

The third study speaks directly to a small, out-of-domain target like ours. Raghu and colleagues studied transfer learning into medical imaging and found that on small targets far from the pretraining distribution, much of the benefit was not the reuse of high-level features at all but the conditioning effect of a good initialisation, a better-scaled starting point that made optimisation easier [3]. For a scanned log, which is about as far from natural photographs as an image gets, that is the mechanism most likely to be doing the work: not borrowed semantics, but a warm-start that puts the optimiser in a better basin before the scarce curve labels arrive.

The document-image self-supervised models

Between generic transfer and our task sits a body of work that pretrained specifically on documents, and it is the most relevant precedent because it shares our substrate even when it does not share our task. The document image transformer applied masked-image modelling to unlabelled document images, learning a representation that transferred to document layout analysis and table detection, and it made the point that a document is a rich enough substrate to pretrain on without any labels at all [5]. The reconstruction-as-pretext lineage it draws on is the same masked-autoencoder idea whose advantage is explicitly a scaling property [4], which is the tension this survey keeps returning to. The document reading models made a parallel case from the other end of the pipeline: the OCR-free document understanding transformer pretrained an encoder-decoder to read a page end to end, a document-specific self-supervised objective that needed no character-level supervision to learn the structure of a page [6].

What this family establishes for us is a proof of substrate, not a proof of task. It shows that unlabelled documents carry enough structure to pretrain a useful representation, which is exactly the premise our warm-start depends on. What it does not show is that the representation transfers into a thin-curve segmentation mask, because layout analysis, table detection, and reading are region and sequence tasks, not the one-pixel-wide dense prediction our decoder has to produce. The precedent is encouraging and incomplete in the same breath.

The segmentation-foundation promise, and where it lands for us

The most seductive recent framing is that segmentation itself now has a foundation model. Segment Anything trained a promptable mask predictor on a very large mask corpus and offered it as a generic, zero-shot segmenter that a downstream user prompts rather than trains [7]. In parallel, the self-supervised backbone line argued that a single model, pretrained without labels, produces frozen features good enough for dense prediction across tasks, so that segmentation becomes a light head on a general backbone [8]. Both are real advances, and both are pretrained at a scale that is the whole point of the claim.

That scale is exactly where a small document-segmentation regime falls off the edge of the promise. A generic mask foundation model is trained to find object-like regions in natural scenes; a thin ink curve traced across a gridded, cream-coloured log is not an object it has ever been rewarded for finding, and a single grayscale channel is not the input its pretraining distribution contains. The frozen-backbone argument is stronger in principle, because a good general feature could in theory be probed by our decoder, but the same distribution gap applies: a backbone pretrained on internet photographs has learned very little about the texture of a scanned log, which is the structure our task turns on. The evidence below reads this out honestly. The instrument argues the one claim we are willing to stand behind on our own data, that an autoencoder warm-start on unlabelled logs transfers into mask accuracy, and it does not claim the foundation-model transfer we have no data to support.

How self-supervised pretraining on unlabeled documents turns into dense-prediction accuracy. Left: the CurveNet encoder-decoder as a stack, one grayscale channel down through five encoder residual blocks to a 128-dimensional bottleneck refined by two transformer attention layers, then five decoder stages back up to the curve mask. Drag the transfer-depth lever to set how far the reconstruction warm-start reaches into the encoder before the scarce mask labels take over; the warmed stages light teal. Right: mask-head accuracy against the scarce labelled budget it is fine-tuned on, with the two real budgets marked at 2,000 binary and 15,000 multiclass instances. The dashed teal curve is a cold start; the solid teal curve is the warm start; the one orange arrow is the accuracy the warm-start buys at the multiclass budget, and it grows as the transfer reaches the early texture-learning stages and shrinks toward the deeper task-specific ones. The 128-dimensional embedding, the five-plus-five encoder and decoder stages, the two attention layers, the single input channel, and the 2,000 and 15,000 label budgets are sourced from the engagement archive; the accuracy curves and the size of the lift are an illustrative transfer model, not a measured ablation.

The exhibit lays CurveNet out as a stack and lets the reader drag a transfer-depth front down through the five encoder residual blocks, setting how far a reconstruction warm-start reaches before the scarce mask labels take over, while the right panel plots mask-head accuracy against the labelled budget it is fine-tuned on, marking the two real budgets at 2,000 binary and 15,000 multiclass instances. The single orange element is the argument: the lift between the cold-start and warm-start curves at the multiclass budget, the accuracy the warm-start buys. The shape encodes the transferability result directly [1]. The lift arrives when the front reaches the early, generic encoder blocks and flattens as it pushes into the deeper, task-specific ones, which is precisely the layer-wise transfer boundary the first study measured, and the lift is largest where the labels are scarcest, which is the conditioning reading the medical-imaging study predicts for a small, out-of-domain target [3]. The 128-dimensional bottleneck, the five-plus-five stages, the two attention layers, the single input channel, and the two label budgets are sourced from the engagement; the accuracy curves and the size of the lift are an illustrative transfer model and are flagged as such on the canvas, because the honest position of this survey is that the direction of transfer is established by the literature while its exact magnitude on our metric is a number we did not run.

Method

This is a structured reading, and the procedure was kept narrow to keep it honest. We fixed the downstream task, dense prediction on documents, and organised the self-supervised and foundation-model literature by what each body of work actually offers that task. The transferability studies were read for the terms of transfer itself: how far a feature carries, whether a pretraining metric predicts downstream success, and what mechanism supplies the benefit on a small out-of-domain target [1] [2] [3]. The document-image self-supervised models were read as a proof of substrate, evidence that unlabelled documents pretrain a useful representation, while noting that their downstream tasks are not our per-pixel one [5] [6]. The segmentation-foundation and general-backbone work was read as the strongest and most scale-dependent claim, and placed against our regime rather than assumed to hold in it [7] [8]. The reconstruction pretext that underlies our own warm-start, and its domain-specific precedent in medical imaging, were read as the mechanism we can actually run on our data [4] [9]. For each we extracted the same three facts: what is pretrained, what evidence exists that it transfers into dense prediction, and whether that evidence survives the shift to a small document-segmentation regime.

The anchor throughout is one real architecture rather than an abstraction. CurveNet is a compact encoder-decoder: five encoder residual blocks carrying a single grayscale channel down to a 128-dimensional bottleneck, two transformer attention layers refining that bottleneck, and five decoder stages climbing back to the curve mask. It is fine-tuned on 2,000 binary and 15,000 multiclass labelled instances, the two scarce budgets the whole transfer question is in service of. Those figures are sourced; the survey re-measures nothing.

Results

The reading resolves into three findings, each tied to a body of evidence rather than to intuition. First, transfer into dense prediction on documents is real but is a downstream claim, not a pretraining one. The clearest lesson from the transferability line is that a good pretext metric does not certify a good mask, so the warm-start earns its place only when the fine-tuned mask accuracy is measured and moves [2]. On our own data the direction of that move is well supported by the layer-wise transfer boundary and the conditioning mechanism [1] [3], which is what the instrument renders; the exact magnitude on the curve masks is a number we flag as illustrative rather than claim as measured.

Second, the document-image self-supervised models prove the substrate but not the task. They establish that unlabelled documents carry enough structure to pretrain a useful representation [5] [6], which is the premise our warm-start rests on, but their downstream tasks are region and reading tasks, so they are a precedent for the input and not for the thin-curve output. The gap between substrate and task is the reason we still have to verify transfer on the mask ourselves.

Third, the foundation-model promise is real and lands off our regime. A generic mask model and a general frozen backbone are pretrained at a scale that is the substance of their claim [7] [8], and a scanned single-channel log with a one-pixel target sits outside the distribution that scale was built from. The domain-specific reconstruction precedent points at the move that does fit our data: pretrain on the substrate you actually have, as the medical-imaging work did on its own volumes, and transfer that into the small-target segmentation [9]. For us the substrate is renderable without limit, which is the one scale advantage a small document task can manufacture.

Discussion

Read together, the three findings compose into a single decision rule for a small document-segmentation project. The foundation-model route is attractive and, for this regime, mostly aspirational: the generic mask model and the general backbone are trained for a distribution our data does not belong to, and until a document-native or log-native model exists at that scale, borrowing one is a bet against the distribution gap rather than with it. The document-image self-supervised models are the encouraging middle: they prove our substrate is pretrainable, and they point at masked reconstruction on documents as the objective that fits. The transferability studies are the discipline that keeps the whole thing honest: transfer is a downstream number, the pretraining loss does not stand in for it, and on a small out-of-domain target the benefit is as much a warm-start conditioning effect as a reuse of learned semantics.

Where our own work sits is as a user of this literature, not a competitor in it. CurveNet's warm-start is the reconstruction pretext the document-image line validated on our kind of substrate [5], transferred into the scarce-label mask the transferability line tells us to measure directly [1] [2], drawing its benefit from the conditioning mechanism the medical-imaging study isolated for small out-of-domain targets [3]. The foundation-model framing is worth tracking precisely so we know what we are not yet able to use: the day a document-native segmentation foundation model is pretrained at scale on renderable pages, the calculus changes, and the survey's job is to have marked the line clearly enough that we notice when we cross it.

Limitations

This is a survey with a positional argument, and its limits are the honest boundary of that argument. We did not run a pretrain-then-finetune transfer ablation across the methods discussed, so the transfer-yield curve and the lift magnitude in the instrument are an illustrative model of the cited mechanisms rather than a measurement on the curve masks; the only sourced figures are the architecture, the single input channel, and the 2,000 and 15,000 label budgets. The direction of the transfer argument, that the lift is front-loaded onto the early encoder blocks and largest where labels are scarcest, is a prediction from the transferability and conditioning literature [1] [3], not a result we recorded per layer. Our reading of the foundation-model line is a distribution-gap argument rather than a run: we did not fine-tune a general mask model or probe a general backbone on the log task, and it is possible a domain-adapted version would transfer better than the gap suggests, which is a study rather than a foregone conclusion. The survey is period-bounded to its own quarter and to the models that had stabilised by then, so document-native or log-native segmentation foundation models that postdate this reading are out of frame by construction. A reader should take this as a map of what evidence supports transferring pretraining into document segmentation, and where that evidence runs out, not as a benchmark of any method against another.

What to carry from the survey

Two shapes of pretraining meet on this task and must be told apart: a self-supervised pretext run on your own unlabelled documents, whose value is a transferable representation, and a foundation model pretrained by someone else at large scale, offered as a backbone or a generic mask predictor. The survey is about the narrow slice where both touch dense prediction on documents.
Transfer is a downstream claim, not a pretraining one. The transferability literature shows a better pretraining metric does not guarantee better transfer, so a low reconstruction loss on masked log patches is not evidence of a better mask; you have to fine-tune and measure the mask itself.
The document-image self-supervised models (masked-image modelling on document images, OCR-free reading) prove the substrate but not the task: unlabelled documents carry enough structure to pretrain a useful representation, but their downstream tasks are layout, tables, and reading, not a one-pixel-wide curve mask.
The segmentation-foundation promise (a generic promptable mask model, a general frozen backbone for dense prediction) is real and lands off our regime, because it is pretrained on a natural-image distribution that a single-channel scanned log with a thin-curve target does not belong to. The domain-specific reconstruction precedent points at the move that fits: pretrain on the substrate you actually have.
On CurveNet (five encoder residual blocks, a 128-dim bottleneck with two attention layers, five decoder stages, one grayscale channel, fine-tuned on 2,000 binary and 15,000 multiclass instances) the warm-start argument is well supported in direction: the lift is front-loaded onto the early generic encoder blocks and largest where labels are scarcest, which is warm-start conditioning on a small out-of-domain target rather than borrowed semantics. The exact magnitude on the curve masks is flagged as illustrative, not measured.

The smallest habit this survey installs is a refusal to accept a pretraining loss as a transfer receipt. Before crediting any warm-start or any borrowed backbone with helping the mask, fine-tune it and read the downstream number, because on a small document-segmentation task the only honest evidence of transfer is the mask itself, and the only scale advantage the regime can manufacture is a substrate it can render without limit.

References

[1] Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep neural networks? NeurIPS (2014). Measures layer by layer how far a learned feature transfers and where forcing transfer begins to hurt. https://arxiv.org/abs/1411.1792

[2] Kornblith, S., Shlens, J., and Le, Q. V. Do Better ImageNet Models Transfer Better? CVPR (2019). Shows a better pretraining metric does not guarantee better downstream transfer, so transfer must be measured on the target task. https://arxiv.org/abs/1805.08974

[3] Raghu, M., Zhang, C., Kleinberg, J., and Bengio, S. Transfusion: Understanding Transfer Learning for Medical Imaging. NeurIPS (2019). Finds that on small, out-of-domain targets much of the benefit is warm-start conditioning rather than reused high-level features. https://arxiv.org/abs/1902.07208

[4] He, K., Chen, X., Xie, S., Li, Y., Dollar, P., and Girshick, R. Masked Autoencoders Are Scalable Vision Learners. CVPR (2022). Reconstructs heavily masked patches to pretrain a strong, transferable encoder, with an advantage that grows with scale. https://arxiv.org/abs/2111.06377

[5] Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., and Wei, F. DiT: Self-supervised Pre-training for Document Image Transformer. ACM Multimedia (2022). Masked-image modelling on unlabelled document images, transferring to layout and structure tasks. https://arxiv.org/abs/2203.02378

[6] Kim, G., Hong, T., Yim, M., et al. OCR-free Document Understanding Transformer (Donut). ECCV (2022). Pretrains an encoder-decoder to read documents end to end without a separate OCR stage. https://arxiv.org/abs/2111.15664

[7] Kirillov, A., Mintun, E., Ravi, N., et al. Segment Anything. ICCV (2023). A promptable segmentation foundation model trained on a very large mask corpus, proposed as a generic zero-shot mask predictor. https://arxiv.org/abs/2304.02643

[8] Oquab, M., Darcet, T., Moutakanni, T., et al. DINOv2: Learning Robust Visual Features without Supervision. TMLR (2024). A self-supervised model whose frozen features transfer to dense prediction, arguing for a general visual backbone. https://arxiv.org/abs/2304.07193

[9] Zhou, Z., Sodha, V., Rahman Siddiquee, M. M., et al. Models Genesis: Generic Autodidactic Models for 3D Medical Image Analysis. MICCAI (2019). Domain-specific reconstruction pretraining that transfers into segmentation on a small target domain. https://arxiv.org/abs/1908.06912