The fastest way to ruin a synthetic dataset is to make it beautiful. We learned this the way most teams do, by building a procedural generator for raster well logs, rendering thousands of clean curves on crisp grids, training a segmentation mask that scored wonderfully on its own held-out renders, and then watching that same mask come apart the first time it met a real scan with a coffee ring and a crooked feed. The model had not learned to find a curve. It had learned the exact geometry of our renderer, which is a thing that exists nowhere outside our own code. This guide is about the fix, which is not a clever architecture but a discipline of deliberately wrecking your synthetic data before the network ever sees it, and doing the wrecking in the shapes a real scanner actually leaves behind.
The idea that you should train on degraded synthetic data rather than pristine synthetic data is older than this project and the credit belongs to the people who established it. Tobin and colleagues named domain randomization for exactly this purpose, arguing that if you vary the nuisance factors of a simulator widely enough, the real world becomes just another sample the network has already seen [1]. The scene-text community got there a little earlier in practice: Gupta and colleagues rendered text into natural images with realistic blending and lighting so a detector trained on it would transfer [2], and Jaderberg and colleagues had already shown that a synthetic word generator with enough rendering variation could train recognition models that worked on real photographs [3]. Our contribution is not the principle. It is a concrete, log-specific recipe for which degradations matter on a one-to-three-pixel ink trace, in what order to reach for them, and how to tell when you have added enough.
Why a clean label lies to you
Start with the failure, because it is the whole motivation. A clean synthetic log gives you a perfect label for free. The renderer knows exactly which pixels it painted as curve one, which as curve two, and which it left as background, so the mask is correct to the pixel with no annotation effort at all. That perfection is precisely the problem. The mask the network learns is fitted to a curve that is always exactly the same width, always sits on an unbroken ruling, never bleeds into its neighbour, and never rides a page that was fed in at an angle. On real scans, every one of those assumptions is false, and a mask that has internalised them as if they were laws of nature has no idea what to do when they break.
Concretely, a clean-only training run will report a curve-class IoU that looks genuinely good, and that number is a mirage. It is measured on the same flawless renders the model trained on, so it tells you the network can reproduce its own renderer, not that it can read a scan. The honest measurement is the one taken on real raster logs, and there the same model collapses, because the real scans carry artifacts the clean labels never taught it to ignore. The job of realistic noise is to close the gap between those two numbers, which it does by lowering the flattering one until it means something.
The four degradations worth your time
There are infinitely many ways to corrupt an image and only a few that pay on a well log. After enough runs you converge on a short list, ranked roughly by how much robustness each one buys. The interactive studio below lets you switch each of them on and off, drive an overall realism intensity, and watch the synthetic render degrade on the left while the per-class label fidelity on the right settles toward what survives on a real scan. Reach for them in this order.
Page skew and rotation is first because it is the cheapest to add and the most expensive to omit. A scanner fed a page even a couple of degrees off square relocates every depth row at once, and a mask trained only on perfectly level renders will systematically mis-place the curve top to bottom. Rotating the whole synthetic field by a small random angle forces the network to find the curve by its shape rather than by its absolute position, which is the single biggest reason a clean-only model fails on real feeds.
Ink bleed and overprint is second and it is the one that protects the multiclass distinction. Two curves on a log cross, and where they cross on a real scan the ink fuses into a single dark blob that no amount of clean rendering ever produces. If the synthetic data never shows that fusion, the mask has no policy for it and will smear the two curve classes into each other at every crossing. Modelling bleed, by widening and softening the stroke where curves are near each other, teaches the mask to hold the classes apart under exactly the condition that breaks them.
Grid ruling jitter is third. Real log grids are printed, photocopied, faxed, and rescanned until the rulings are faded, broken, and irregular. A network trained on a crisp continuous grid will happily anchor its predictions to that grid and then lose its footing the moment the ruling disappears for a few pixels. Breaking and fading the synthetic gridlines stops the model from leaning on a piece of furniture that is not reliably there.
Paper and scanner noise is last, not because it does not matter but because it saturates fastest. A little fibre speckle and sensor grain stops the mask from trusting that background is pure white, which matters, but the network absorbs that lesson quickly and extra grain past a low level buys almost nothing. It is the seasoning, not the meal.
The studio is calibrated to the real generator envelope rather than to a toy. The logs it represents are rendered anywhere from 3,200 to 12,800 pixels wide and 480 to 640 pixels tall, always with a constant two curves per log against a three-class mask of background, curve one, and curve two. The per-class fidelity it reports lands on the measured anchors from the multiclass runs: a background IoU around 0.94, a curve-one IoU around 0.26, and a curve-two IoU around 0.21. And the orange line is the one that disciplines everything, the 0.51 peak IoU the curve classes top out at on real scans. The intermediate read-outs as you drag are an illustrative response built to make the trade legible; the anchors and the ceiling are the real numbers the labels have to live with.
Reading the ceiling without flinching
That 0.51 ceiling deserves a paragraph of its own, because the first instinct on seeing it is to think something has gone wrong. It has not. A one-to-three-pixel curve against a background that is the rest of the page is a brutally imbalanced segmentation target, and IoU is unforgiving of exactly the small boundary disagreements that are unavoidable when the object you are masking is two pixels wide. The background class sits near 0.94 because it is enormous and easy; the curve classes sit far lower because every pixel of slop costs a large fraction of a tiny union. This is the geometry of the problem, not a defect of the model, and a survey of how the field measures overlap on thin structures will tell you the same thing in more general terms.
What realistic noise does is make the curve numbers you do get honest. A clean-only run might report a curve-one IoU well above the ceiling, and that number is fiction: it is the model memorising its own renderer, and it does not survive contact with a real scan. As you switch on the degradations in the studio, that flattering bar slides down past the orange line and the portion above the ceiling fades out, because that portion was never going to transfer. What is left below the line is the part you can actually bank, and the whole point of baking in noise is to make the training number and the deployment number tell the same story rather than two contradictory ones.
How much is enough, and the failure on the far side
A field guide owes you the part where the advice gets uncomfortable, so here it is: you can absolutely overdo this. Domain randomization is not a license to corrupt without limit, and the broader augmentation literature has catalogued how aggressive transforms can erase the signal along with the nuisance [4]. Crank the realism intensity to the top with every degradation on, and the synthetic curve becomes so buried in skew, bleed, broken grid, and speckle that the mask cannot find a consistent target to learn at all. The training loss stops falling, the curve classes drift down toward noise, and you have traded a model that memorised clean geometry for one that learned nothing.
The procedure we would actually recommend is the unglamorous one. Add skew first and at a realistic magnitude, a few degrees, not a tumble. Add bleed second, sized to the stroke width so crossings fuse the way they really do and no more. Add grid jitter third, enough that the ruling is sometimes absent but still mostly readable. Add paper grain last and lightly. Then, and this is the step people skip, look at the synthetic images with your own eyes and ask whether they look like the real scans on your desk. If your synthetic logs are visibly cleaner than your real ones, you have not added enough; if they are visibly worse, you have gone too far. The render is the spec, and the eye is the gate.
A short word on classification noise versus segmentation noise
One caveat for readers coming from image classification, where a lot of the augmentation folklore was written. The transforms that regularise a classifier are not all safe for a per-pixel mask. Cutout, for instance, which masks a random patch of the input and works well for whole-image labels [6], is dangerous on a thin curve because the patch it erases may be the only place the curve appears in that column, and your label still says curve is there. Likewise the work on compression and additive noise in classification [7] is a useful reminder that networks are sensitive to exactly these corruptions, but the lesson transfers to segmentation only when the corruption preserves the spatial truth of the mask. The degradations in this guide are chosen precisely because they deform the appearance of the curve while leaving its location recoverable, which is what a segmentation label needs. The U-Net family this kind of mask is built on [5] is forgiving of appearance variation and unforgiving of label-location lies, and realistic scan noise respects that line where blunt classifier augmentations do not.
The synthetic data you should want is not the data that looks good in a figure. It is the data that looks like the worst scan you expect to meet, labelled with the certainty only a renderer can give you, degraded right up to the edge where the curve is still findable and not one step past it. Build the generator to make ugly logs on purpose, measure against the ceiling the real scans impose, and let the flattering numbers go.
Field notes
- A perfectly clean synthetic log gives a free perfect label and teaches the wrong thing: the mask fits the geometry of your renderer, which exists nowhere outside your code, and collapses on real scans. The flattering curve IoU it reports is measured on its own renders and does not transfer. The principle of training on degraded synthetic data is domain randomization's and the scene-text generators' before it, and the credit is theirs.
- Four degradations earn their place on a well log, in priority order: page skew and rotation (relocates every depth row, the costliest to omit), ink bleed and overprint (fuses curves at crossings, protects the multiclass distinction), grid ruling jitter (stops the mask anchoring on furniture that is not reliably there), and paper and scanner grain (matters but saturates fastest).
- The studio is calibrated to the real envelope: logs 3,200 to 12,800 pixels wide and 480 to 640 tall, a constant two curves per log against a three-class mask, with measured anchors of background IoU 0.94, curve-one 0.26, curve-two 0.21.
- The 0.51 curve IoU ceiling is the geometry of a one-to-three-pixel target, not a defect. Realistic noise does not raise it; it makes the number you report honest by pulling the flattering clean-only IoU down below the line, where what is left is what actually survives on a real scan.
- You can overdo it. Maxed degradation buries the curve so the mask cannot find a consistent target and learns nothing. Add skew, then bleed, then grid, then light grain, each at a realistic magnitude, and use your own eye against real scans as the gate: synthetic logs should look as bad as the worst scan you expect, no worse. Beware classifier augmentations like cutout that lie about where the mask should be.
References
[1] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, P. Abbeel. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS (2017). The naming of the train-on-degraded-synthetic principle this whole guide applies to raster logs. https://arxiv.org/abs/1703.06907
[2] A. Gupta, A. Vedaldi, A. Zisserman. Synthetic Data for Text Localisation in Natural Images. CVPR (2016). Renders text into images with realistic blending so a detector trained on it transfers, the closest prior analogue to rendering realistic curves. https://arxiv.org/abs/1604.06646
[3] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition. NeurIPS Deep Learning Workshop (2014). An early demonstration that a synthetic generator with enough rendering variation trains models that work on real photographs. https://arxiv.org/abs/1406.2227
[4] C. Shorten, T. M. Khoshgoftaar. A survey on Image Data Augmentation for Deep Learning. Journal of Big Data (2019). The broad catalogue of augmentation transforms and the warning that aggressive ones erase signal along with nuisance. https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0
[5] O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The encoder-decoder family these curve masks are built on, forgiving of appearance variation and unforgiving of label-location lies. https://arxiv.org/abs/1505.04597
[6] T. DeVries, G. W. Taylor. Improved Regularization of Convolutional Neural Networks with Cutout (2017). A classifier augmentation that masks a random patch, cited here as the kind of transform that is unsafe on a thin-curve segmentation label. https://arxiv.org/abs/1708.04552
[7] S. G. Mueller, E. R. Davies. The effect of JPEG compression artefacts and additive noise on neural network image classification (2017). A reminder that networks are sensitive to exactly these corruptions, with the caveat that the lesson transfers to segmentation only when the mask truth is preserved. https://arxiv.org/abs/1711.10564