Autoencoder Pretraining Before Mask-Prediction Fine-Tuning

We were rich in scans and poor in labels, and that single asymmetry shaped everything we did next. The raster archive we were digitizing for the operator held thousands of scanned well-logs, and a scanned log is free to collect. What was not free was a label: a human tracing, pixel by pixel, exactly where each curve runs so the segmenter has a target to learn from. Annotation is slow, it is expensive, and on a thin trace it is genuinely hard, because a well-log curve is roughly one pixel wide against a field that is almost entirely background. Every annotated curve we could afford was a small fortune of attention, and we did not have enough of them. This is the account of how we stopped trying to label our way out of the problem and instead taught the encoder what a log looks like using the scans we already had, for free, before we spent a single label teaching it where the curves are.

The label was the scarce resource, not the scan

It helps to be precise about which part of the pipeline was starving. CurveNet, our segmentation network, is a symmetric encoder-decoder. The encoder reads the scanned image down through five stride-2 stages into a compact bottleneck representation, two transformer attention layers refine that bottleneck, and the decoder upsamples it back to a per-pixel mask. To train the whole thing supervised, every training image needs a hand-drawn mask. That is the expensive ingredient.

But notice that most of what the encoder has to learn is not specific to the masks at all. To segment a curve, the encoder first has to understand the visual grammar of a log: that a log is organized into vertical tracks, that grid lines run at regular intervals, that a curve is a dark continuous trace that wanders across a track, that paper grain and scanner noise and faded ink are texture rather than signal. None of that requires a single label. It is structure that is sitting in plain sight in every unlabeled scan we owned. The supervised objective was paying, with scarce labels, to teach the encoder things the unlabeled data could have taught it for free.

That observation is old, and we did not invent it. The idea that you can pretrain a network on a cheap, label-free objective and then fine-tune it on the scarce labeled task is the foundation that stacked denoising autoencoders were built on [1], and the empirical case that this pretraining genuinely helps optimization and generalization was made carefully more than a decade before our engagement [2]. What was new for us was the realization that our specific bottleneck, a starving sparse-label segmenter on a substrate with an enormous unlabeled corpus, was exactly the situation that argument was made for.

Teaching CurveNet to redraw a log before teaching it to read one

The plan we settled on was a two-phase training schedule for the same encoder. In the first phase we threw the segmentation decoder away entirely and bolted a reconstruction decoder onto the encoder instead, turning CurveNet's encoder into the front half of an autoencoder. The objective was self-supervised in the purest sense: feed the encoder an unlabeled scan, ask the network to compress it to the bottleneck and then redraw the original image from that bottleneck, and penalize the difference between the redrawn log and the input. No human ever touches these images. The label is the image itself.

To make the reconstruction non-trivial, we corrupted the input before the encoder saw it and asked the network to restore the clean version, which is the denoising-autoencoder recipe rather than a plain identity mapping [1]. An autoencoder that is handed a perfect copy can cheat by learning to pass pixels straight through; one that has to repair masked or noised regions cannot, because to fill a hole where a curve used to be it has to have learned what curves look like and where they tend to go. That repair-the-hole framing is the same instinct that drives masked pretraining across modalities, from predicting blanked tokens in language [3] to reconstructing masked image patches in vision [4]. We were applying the family to grayscale raster logs, a narrow and unglamorous corner of it, but the mechanism is identical: prediction of the missing part forces the representation to be about content, not copying.

The phase-two move was the warm start. We discarded the reconstruction decoder, kept the encoder weights the pretraining had shaped, attached the real segmentation decoder, and fine-tuned the whole thing on our scarce labeled masks. The encoder did not begin fine-tuning from random noise. It began from a state that already understood tracks, grid lines, traces, and grain. Fine-tuning no longer had to discover the visual grammar of a log from a handful of labeled examples; it only had to learn the comparatively small last step of turning that grammar into a curve mask.

The turns that made it actually work

The clean story above took several attempts to earn. The first one we want to record is the corruption rate. Our earliest pretraining masked almost nothing, and the autoencoder learned the cheat we feared: it reconstructed beautifully and transferred almost nothing, because near-identity copying never forced it to internalize structure. We had to push the corruption aggressive enough that reconstruction was genuinely hard before the encoder learned anything worth keeping. The representation only became useful once the pretext task stopped being easy.

The second turn was the batch-size constraint, which haunted both phases. Our synthetic and scanned logs vary in size from one example to the next, with widths spanning thousands of pixels, and that variability forced us to train at a batch size of 1. A batch of one is a noisy, slow way to estimate a gradient, and it made every epoch of pretraining a patient affair. We accepted it, because the alternative of resizing every log to a common shape would have distorted the very curve geometry the encoder needed to learn. The constraint was real and we trained inside it rather than around it.

The third turn was learning what to do with the encoder during fine-tuning. The reflex inherited from frozen-feature transfer is to lock the pretrained encoder and train only the new decoder. We tried it and left performance on the table. Our pretraining corpus and our labeled task were close cousins but not identical, so the encoder needed to adapt, not stay frozen. Letting the encoder keep learning during fine-tuning, gently, so the warm start was a starting point and not a cage, was what let the segmenter actually reach for the ceiling.

Reading the convergence: where the warm start pays

The result we cared about was not a single final number but the shape of convergence, and specifically how that shape changes when labels are scarce. The two-track panel below is the exhibit we kept returning to. It plots per-class F1 over fifty epochs for a cold-started run, where the encoder begins from random initialization, against a warm-started run, where the same encoder begins from the autoencoder pretraining. Drag the annotated-curve budget lever to vary how many labeled curves the fine-tuning is allowed to see, and switch between the two curve classes to inspect each.

Two-track convergence panel contrasting a cold-start segmenter (random encoder init) against an autoencoder-warm-start run, where the same encoder is first pretrained as a reconstruction autoencoder on unlabeled logs and then fine-tuned to mask prediction. Drag the annotated-curve budget lever to set how many labeled curves the fine-tuning sees, and switch between the curve 1 and curve 2 classes. Both per-epoch F1 trajectories redraw over 50 epochs: the warm-start run reaches the cold-start plateau early (the orange epoch marker) and then keeps climbing toward the background-class 0.97 ceiling, with the largest lift at the tightest label budgets, which is the whole point of pretraining. The final F1 values (background 0.97, curve 1 0.37, curve 2 0.32 at the full 2,000-label budget), the 50 epochs, the batch size of 1, and the 2,000 and 15,000-instance stages at 110 and 550 minutes are sourced from the engagement archive; the per-epoch trajectory shape and the warm-start lift as a function of label budget are an illustrative learning model.

Two things in that panel carry the whole argument. The first is that the warm-started run reaches the cold-started run's eventual plateau early, well before the fifty epochs are spent, and then keeps climbing past it. Convergence is not just higher, it is sooner, which on a batch-size-1 schedule where every epoch is slow is a real saving in wall-clock training time. The second, and the one that justified the whole detour, is what happens as you tighten the label budget. At a generous budget the two runs end up close, because with enough labels even a cold encoder can learn the visual grammar from the supervised signal alone. As you starve the budget, the gap widens, and the warm start pulls decisively ahead, because the pretrained encoder already knows what a log looks like and the cold one is trying to learn that from a handful of masks at the same time as learning to segment. Pretraining buys the most exactly where labels are scarcest, which is precisely the regime we lived in.

The full-budget multiclass finals these trajectories settle toward are the real engagement numbers: background F1 of 0.97, curve 1 of 0.37, and curve 2 of 0.32. Read the curve-class numbers without flinching, because separating a thin, faded, overlapping trace out of a noisy scan is hard and those scores say so honestly. The background class sits near its ceiling because it is the easy class. The warm start did not make the hard problem easy. What it changed was how few labels the segmenter needed to climb toward those finals, and how quickly it got there.

What this bought the engagement, in plain terms

The honest accounting is that pretraining did not raise the ceiling of what was achievable with unlimited labels. It moved the curve that connects label budget to performance. For any fixed number of annotated curves we could afford, the warm-started segmenter converged higher and sooner than the cold one, and the advantage grew the tighter our budget got. On a project where annotation was the binding constraint and unlabeled scans were essentially free, that trade, spending cheap compute on a self-supervised pretext task to save expensive human labels, was one of the best deals available to us.

There is a quieter dividend worth naming. The pretrained encoder is a reusable asset. Having paid once to teach an encoder the visual grammar of well-logs from the unlabeled corpus, we could warm-start future segmentation variants from the same weights rather than from random noise each time. The label-free corpus we were sitting on turned out to be not just training data we had failed to annotate, but a standing head start for every supervised model that came after. The lesson we carried forward from this part of the build was specific and unsentimental: when your labels are the scarce input and your raw data is abundant, the first model worth training is the one that needs no labels at all, because it is the one that makes every labeled example you do spend go further.

References

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. Journal of Machine Learning Research, 11. https://www.jmlr.org/papers/v11/vincent10a.html
Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. (2010). Why Does Unsupervised Pre-training Help Deep Learning? Journal of Machine Learning Research, 11. https://www.jmlr.org/papers/v11/erhan10a.html
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. https://arxiv.org/abs/1810.04805
He, K., Chen, X., Xie, S., Li, Y., Dollar, P., and Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022. https://arxiv.org/abs/2111.06377

Autoencoder Pretraining Before Mask-Prediction Fine-Tuning

The label was the scarce resource, not the scan

Teaching CurveNet to redraw a log before teaching it to read one

The turns that made it actually work

Reading the convergence: where the warm start pays

What this bought the engagement, in plain terms

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on