Weak and Self-Supervision: Getting Labels Without an Army of Annotators

The blunt fact behind this note is a headcount. A curve segmenter learns from labelled examples, and if you tally the examples a person actually sat down and drew for our raster well-log work, the number is 2,000, all of them binary masks. Everything else the model trained on was manufactured: 15,000 procedurally synthesised multiclass instances, 20,000 synthetic 2-curve logs, and an encoder warm-started by reconstruction with no labels at all. We did not arrive there because manufactured labels are fashionable. We arrived there because the alternative was to hand-annotate a raster archive of 136,771 TIF files, which is not a budget line, it is a wall. This is an account of how to treat the ways you get a label as a supply chain ranked by yield, rather than a single annotation pot you spend down. None of the underlying methods are ours; what is ours is the specific stack we leaned on and the counterfactual it let us walk past.

The archive is the constraint, not the model

Start with the number that sets the whole problem, because it is the one that will not move. The relevant archive holds 136,771 raster TIF scans alongside 7,781 LAS files. A senior interpreter tracing curves off a scan is slow by the standards of anything you would want to do 136,771 times, and the output is a single digitised log, not a labelled training example in any tidy sense. Even at a rate you would be embarrassed to publish, the bill to hand-label the archive is a multiple of the entire engagement, spent before training a single model. That is the constraint, and the model architecture is downstream of it. You can pick the best encoder-decoder in the literature and it changes nothing if the labelled supply to feed it cannot be bought.

Framed that way, the interesting question stops being "which loss, which backbone" and becomes "where does a labelled example come from, and how many does each source yield per unit of human time." A human drawing masks yields examples one at a time and stops when the annotator goes home. A generator that emits an image and its exact mask together yields as many as you will run compute for, and the labels are correct by construction because the generator drew both. Those are different supply curves, and side by side the ranking is not close.

Weak supervision and synthesis answer different shapes of scarcity

It is worth being precise about the two families, because they get blurred and they solve different problems. Weak supervision, in the lineage Snorkel made concrete, is for the case where you have unlabelled data and can write noisy, cheap heuristics that vote on what the label probably is; a label model then reconciles the votes into training targets you never hand-drew [1]. It is the right tool when the label signal is latent in the data and expressible as rules. Our shape of scarcity is different. For a thin curve on a scanned log there is no cheap heuristic that reliably says "this pixel is curve one," which is exactly why the task needs a learned segmenter at all. What we can do instead is run the process forwards: procedurally generate a log with the curves placed by the generator, so the mask is not inferred, it is the thing we drew before rasterising. That is synthesis, and its warrant is domain randomisation, the argument that a generator randomised widely enough makes the real domain look like one more sample the network already saw [2].

So the two are not rivals for the same job. Weak supervision manufactures labels for real data you already hold; synthesis manufactures the data and its labels together. For an archive where the labels are genuinely absent and no rule recovers them, synthesis is the route that produces training supply at all, and it is the one we built on: 15,000 multiclass instances and 20,000 2-curve logs, masks handed back for free.

Reconstruction is a second free label source stacked on the first

Synthesis fills the supervised bucket. A second, quieter source costs no labels either, and it stacks underneath rather than competing. A scanned log is mostly structure the network can study without annotation: paper texture, gridlines, header boxes, the statistics of ink on a page. You can make the encoder learn that structure by asking it to reconstruct log images, an autoencoder pretext, then fine-tune the same backbone for mask prediction. The lineage is old and well credited, from denoising autoencoders learning transferable representations by rebuilding a corrupted input [3] to the masked-autoencoder statement that hiding most of an image and forcing the network to rebuild it yields a strong initialisation with no labels [4]. In our stack it is a warm start: the encoder arrives at the supervised fine-tune already knowing what a log looks like, so the scarce labelled examples buy curve-finding rather than teaching the network that logs have gridlines.

The exhibit makes the stack legible as what it is, a ladder of label sources ranked by yield, with the counterfactual it replaces sitting next to it in orange.

Where the training labels for the curve segmenter actually came from, ranked by how many they yield for zero human annotation. The ladder on the left stacks four sources: a 2,000-instance hand-labelled binary floor, the only rung a human drew; 15,000 procedurally synthesised multiclass instances and 20,000 synthetic 2-curve logs, both generated with the mask handed back for free; and autoencoder reconstruction pretraining, drawn as an open dashed rung because it yields not a count of labels but a free warm start applied to everything above it. The orange card on the right is the counterfactual the ladder replaces: the 136,771-TIF raster archive that is economically impossible to label by hand. Drag the price-per-label lever and the archive's manual-annotation bill climbs off any budget while the 35,000 manufactured examples cost nothing to draw. The example counts, the archive sizes, the reconstruction warm start, and the 80/20 split are sourced from the engagement archive; the per-label rate is an illustrative input used only to price the counterfactual, not a real quoted rate.

Read top to bottom it is one argument. The hand-labelled binary floor is the only rung a human drew, 2,000 instances, and it is deliberately small. Above it sit the two synthetic rungs, 15,000 and 20,000, that cost compute rather than annotator time. Underneath everything is the reconstruction warm start, drawn open because its yield is a head-start applied to every rung above it rather than a count of labels. To the right is the archive: 136,771 files whose hand-annotation bill climbs off any budget you drag the price lever to, while the manufactured supply does not move at all, because compute-priced supply does not scale with an annotation rate. The whole system was then split 80/20 for training and validation.

Manufactured does not mean free of obligations

The honest part of this argument is where it stops. Manufacturing labels removes the annotation bill; it does not remove the work, it relocates it. The generator now carries the assumptions a human annotator would otherwise carry implicitly, and every failure mode you forget to randomise into it is one the model meets cold in the field. A generator that never draws a folded, stained, fourth-generation photocopy has quietly ruled those scans out of scope, and the model inherits that decision without anyone signing off on it. The reconstruction warm start has its own catch: it teaches the network the structure of the images you pretrained on, so if those are cleaner than the field, the head-start is partly a head-start on the wrong distribution. Both routes move effort from labelling into building and auditing the generator, which is real engineering, not a saving that appears from nowhere.

What the stack does buy, unambiguously, is a working segmenter at all against an archive hand-labelling could never reach. That claim is narrower and more defensible than "synthetic data is better." For this shape of scarcity, an absent-label archive too large to annotate, manufacturing the supply is not the clever option, it is the only option that produces training data, and the 2,000-instance human floor exists to anchor and check the manufactured supply rather than to carry the model alone.

The supply-chain habit

The habit this left us with is to ask, before touching architecture, where every labelled example is going to come from and what each source yields per unit of human time. Sometimes the answer is weak supervision, because the labels are latent in real data and expressible as rules [1]. For us it was synthesis, because the labels were absent and the generator could draw image and mask together under domain randomisation [2], with reconstruction pretraining as a no-label warm start beneath it [3] [4]. Ranked by yield, the human-drawn rung is the smallest on the ladder, and that is not a compromise forced by a tight budget. It is the correct shape for a segmenter that has to learn from an archive no army of annotators was ever going to label by hand.

Limitations

This is one engagement's supply chain, not a general recipe, and the numbers are counts, not a benchmark. The 2,000 hand-labelled instances, the 15,000 and 20,000 synthetic sets, and the 136,771-TIF and 7,781-LAS archive sizes are real archive figures; the per-label price in the exhibit is an illustrative rate used only to make the counterfactual bill tangible, not a quoted annotation cost, and the ranking argument does not depend on its exact value. Whether synthesis was the right call is a property of this scarcity shape, an absent-label archive with no cheap labelling heuristic, and it does not transfer as a rule to a task where weak supervision's assumptions hold or where a labelled public corpus already exists. We also do not claim the manufactured supply matched the field distribution; that is exactly the generator-coverage question this note flags as relocated effort rather than eliminated effort, and it is settled by held-out field scans, not by the size of the synthetic set. And nothing here speaks to final segmentation quality, which is governed by the loss, the architecture, and the validation split, and is a separate question from where the labels came from.

References

[1] Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Re, C. Snorkel: Rapid Training Data Creation with Weak Supervision. Proceedings of the VLDB Endowment 11(3), 2017, pp. 269-282. Labelling functions plus a label model that reconciles their disagreements into training targets you never hand-drew. https://www.vldb.org/pvldb/vol11/p269-ratner.pdf

[2] Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IEEE/RSJ IROS, 2017. Randomise a synthetic generator widely enough and the real domain reads as one more sample the network already trained on. https://arxiv.org/abs/1703.06907

[3] Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. Stacked Denoising Autoencoders. Journal of Machine Learning Research 11, 2010, pp. 3371-3408. A network made to reconstruct a corrupted input learns a representation that transfers to a supervised task. https://www.jmlr.org/papers/v11/vincent10a.html

[4] He, K., Chen, X., Xie, S., Li, Y., Dollar, P., and Girshick, R. Masked Autoencoders Are Scalable Vision Learners. IEEE/CVF CVPR, 2022. Mask most of an image, make the network rebuild it, and a strong downstream initialisation falls out with no labels. https://arxiv.org/abs/2111.06377

Weak and Self-Supervision: Getting Labels Without an Army of Annotators

The archive is the constraint, not the model

Weak supervision and synthesis answer different shapes of scarcity

Reconstruction is a second free label source stacked on the first

Manufactured does not mean free of obligations

The supply-chain habit

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on