Synthetic Data for Scientific-Image Segmentation: What the Literature Got Right

Abstract

A scanned paper well log is a scientific document: ruled grid, depth track, two or three overprinted analogue curves, decades of fade and skew. Digitising it back into numbers is a segmentation problem, and like every segmentation problem on legacy scientific imagery it is starved of labels. Hand-tracing curves off a raster to build ground truth is exactly the work the model is meant to replace. We took the route the literature had been recommending for years: generate the training data procedurally, and never label a real scan at all. We built a synthetic-log renderer that emits a paper log together with its pixel-perfect curve masks for free, trained VeerNet (our encoder-decoder CNN with a transformer attention bottleneck) on those renders, and read curves off scanned rasters it had never seen. This note sets our result next to the public work on synthetic and procedurally generated training data for scientific and document image segmentation, credits the ideas we leaned on, and is honest about the one axis where sim-to-real held and the one where it frayed.

Background

The case for training segmentation networks on synthetic data is old enough to be boring, and the boring ideas are the ones that work. The encoder-decoder shape every modern segmenter inherits comes from U-Net (Ronneberger et al., 2015), and the U-Net paper made the synthetic-data argument explicit in its own way: with only a handful of annotated biomedical images, the authors leaned hard on deformation augmentation to manufacture the variation a small labelled set cannot supply. The lesson generalised. When real labels are scarce and the generative process behind the image is known, you can render the supervision instead of collecting it.

Scientific and document imagery is the cleanest place this works, because the documents are synthetic to begin with. A printed well log, a musical score, a circuit schematic, a page of text: each is a deterministic rendering of structured source data through a known drawing process. If you own the drawing process, you own an inexhaustible, perfectly labelled training set. The document-OCR community established this early, training recognisers on text rendered from corpora through randomised fonts, backgrounds, and degradations rather than on hand-annotated photographs of pages (Gupta et al., 2016). The recipe is always the same: render from source, apply a degradation model that imitates the real acquisition channel, and the masks come free because you drew the strokes yourself.

For well logs specifically the prior art is mostly classical. The strongest period-correct reference is a gridlines-elimination pipeline that digitises well-logging parameter graphs with morphological image processing, no learning at all (Yuan and Yang, 2019). It works on clean scans and degrades exactly where classical CV always degrades: overlapping curves, broken gridlines, faded ink, skew. That failure surface is precisely the gap a learned segmenter is supposed to close, and a learned segmenter needs labels the classical pipeline never required. Synthetic data is how you square that circle.

The public well-log datasets do not help here. The Texas Railroad Commission archive we worked against is enormous, roughly 136,771 TIF raster images against only 7,781 already-digitised LAS files, and the open Xeek/FORCE-2020 set of 118 Norwegian Sea wells (McDonald, 2021) ships clean LAS curves, not the scanned paper rasters we actually need to read. The labels we wanted, curve masks aligned to a degraded raster, simply do not exist at scale in public. So we made them.

Method

The renderer is the whole contribution, so it gets the detail. It is a procedural log generator: sample a curve geometry, draw it onto a synthetic paper grid at print resolution, apply a degradation model that imitates scanning, and emit the rendered raster alongside the exact pixel mask of every curve. Nothing is hand-traced. Two configurations matter. The binary setting renders a single curve against background; the multiclass setting renders a constant two curves per log against background, three output classes in total, because two overprinted analogue curves crossing each other is the hard case real logs throw at you and the one classical pipelines choke on.

We deliberately rendered logs at realistic, ragged scales rather than tidy fixed tiles. Synthetic widths spanned 3,200 to 12,800 pixels and heights 480 to 640 pixels, which forced a custom collate function so a batch could hold variably sized images. The binary set ran at batch size 1, memory-bound by the wide rasters; the multiclass set ran at batch size 16 through that collate path. The final synthetic corpora were 2,000 instances for binary segmentation and 15,000 procedurally generated curves for the multiclass v2 set, with an earlier 20,000-log two-curve dataset behind it. We held an 80 percent train split throughout.

The model reading these renders is VeerNet, an encoder-decoder CNN with a transformer refinement stage on the bottleneck. Five residual encoder blocks at stride 2 take a single grayscale channel down to a 128-dimensional feature map; two transformer attention layers refine that bottleneck so the network can reason about a curve as one long coherent stroke rather than a pile of local edges, in the spirit of self-attention as a global mixing operator (Vaswani et al., 2017); five decoder upsampling stages take it back to a full-resolution mask. Channel-and-spatial attention follows the CBAM design, reduction ratio 8 and a 7-pixel spatial kernel (Woo et al., 2018).

Because the curve pixels are a vanishing fraction of any log, every loss we touched was a class-imbalance loss. We evaluated five, Dice, Focal, Lovasz-Softmax, Soft Cross-Entropy, and Tversky, the same imbalance-aware family the segmentation literature converged on: Tversky for tunable precision-recall trade-off (Salehi et al., 2017) and Lovasz-Softmax as a direct IoU surrogate (Berman et al., 2018). The loss comparison is its own story and lives elsewhere; here it matters only as confirmation that thin-structure segmentation on synthetic logs behaves the way thin-structure segmentation always behaves.

Training was cheap, which is the quiet point of synthetic data. The binary model trained on 2,000 instances for 50 epochs in about 2 hours (110 minutes). The multiclass model on 15,000 instances took about 10 hours (550 minutes) for the same 50 epochs. No annotation campaign, no inter-annotator disagreement, no week of a petrophysicist tracing curves. The bottleneck moved from labelling to rendering, and rendering scales the way labelling never will.

The one mechanism worth making visual is how procedural augmentation buys robustness rather than just volume. Naive augmentation deepens an existing point in the data distribution; structured augmentation that mixes several corruption chains and ties the model's predictions across them with a consistency loss is what actually widens the support, the AugMix argument (Hendrycks et al., 2020). The instrument below shows that mechanism directly: several augmentation chains of varying severity applied to one render, mixed under random convex weights, with a consistency tie holding the predictions together no matter how the mix is reweighted.

The article's most concrete mechanism is AugMix, not a benchmark number. The article describes AugMix as three moves: generate several augmentation chains of varying severity ('a family of them') applied to the same sample, mix their outputs with random convex weights into one continuum-of-corruptions sample, then tie the model's prediction on the original to its prediction on the mix with a Jensen-Shannon-divergence consistency loss. Drag the convex-weight handle: the chains re-weight and the mixed sample re-blends live, but the orange JS-consistency tie holds the original and mixed predictions together no matter how the weights move — that, the article argues, is what makes AugMix robust where naive augmentation bakes in distortion. Sourced from the article: AugMix (Hendrycks et al., ICLR 2020, arXiv 1912.02781), the three moves, 'several chains of varying severity', the random convex mixing weights, and the Jensen-Shannon consistency loss. The article does not fix the number of chains, so the three chains shown (A/B/C), their severities, and the live weight split are schematic and flagged as such on the canvas.

That mechanism is exactly why a renderer plus a consistency-style augmentation policy generalises to real scans, and naive copy-the-curve augmentation does not. The render gives you a clean sample; the corruption model and the consistency tie teach the network that the curve survives the scanning channel.

Results

The headline is that sim-to-real held on the axis that pays the bills and frayed on the axis the literature already warned us about. A VeerNet trained only on procedurally generated logs reconstructs the curve as a numeric trace off real rasters at high fidelity. On the multiclass set the best per-curve coefficient of determination reached R-squared 0.9891 under Tversky loss, with the curve-value errors landing where a petrophysicist will accept them: mean absolute error around 0.0277 on the first curve and 0.1241 on the second. The reconstructed signal is good. The network learned the shape of a log curve from drawings and applied it to photographs.

The mask, by contrast, is where synthetic-only training shows its seam, and it shows it exactly where thin-structure segmentation always does. Intersection-over-union is excellent on the background class and poor on the curves themselves: under Dice loss the multiclass IoU was 0.94 on background, 0.26 on the first curve, and 0.21 on the second; F1 followed the same shape, 0.97 background against 0.37 and 0.32 on the curves. This is not a synthetic-data failure. It is the geometry of the problem. A curve one or two pixels wide has almost no area, so a single-pixel registration slip annihilates the overlap metric while barely touching the reconstructed value. The literature has said this for years, which is why IoU surrogates like Lovasz-Softmax exist at all. The gap between a strong R-squared and a weak curve-IoU is the signature of thin-structure segmentation, not of sim-to-real.

Metric	Value	Setting
Peak R-squared (curve reconstruction)	0.9891	Tversky, multiclass
Mean absolute error, curve 1	0.0277	Tversky, multiclass
Mean absolute error, curve 2	0.1241	Tversky, multiclass
IoU, background mask	0.94	Dice, multiclass
IoU, curve 1 mask	0.26	Dice, multiclass
IoU, curve 2 mask	0.21	Dice, multiclass
F1, background	0.97	Dice, multiclass
F1, curve 1 / curve 2	0.37 / 0.32	Dice, multiclass
Synthetic training instances (multiclass)	15,000	rendered, zero hand labels
Training time, multiclass, 50 epochs	550 min	single GPU

The economics are the result that travels. Zero real labels, a renderer that produces 15,000 perfectly masked logs on demand, and a model that reads curves off rasters it never saw in training. Against a public corpus where only about 5.7 percent of the Texas archive is already digitised (7,781 LAS against 136,771 TIF), the cost of closing that gap by hand-tracing is the cost synthetic data deletes outright.

Discussion

So what did the literature get right, and where did we add something. It got the central bet right: when the image is a rendering of known source data, render the supervision. U-Net's deformation-augmentation argument, the document-OCR render-from-source recipe, and the imbalance-loss family all transferred cleanly to scanned well logs without modification. We did not have to invent the idea that synthetic data works for scientific-document segmentation. We had to find where the sim-to-real bridge bends.

It bends at the mask, not the value, and that distinction is the contribution. A team reading this should pick the metric the downstream task is actually graded on. If the deliverable is a digitised LAS curve, grade on R-squared and curve-value error, where synthetic-only training is already strong, and treat curve-IoU as a diagnostic of registration rather than a release gate. If the deliverable is a pixel mask, expect thin-structure IoU to disappoint regardless of how good the data is, and reach for an IoU-surrogate loss before you reach for more renders. The most expensive mistake available here is to read the 0.21 curve-IoU as a verdict on synthetic data when it is a verdict on measuring area-overlap of one-pixel curves.

Limitations

Every number here comes from procedurally generated logs and a fixed rendering and degradation model. A render only covers the corruptions we wrote down. Real scans carry artefacts the renderer never imitates, coffee stains, torn margins, hand annotations across the curve, and those live outside the synthetic support no matter how many logs we generate. Synthetic data closes the labelling gap; it does not close the long tail of real-world degradation. The multiclass results are on a constant two-curve setting; logs with three or more overprinted curves are harder and untested at this fidelity.

Conclusion

The public literature on synthetic and procedurally generated training data for scientific and document image segmentation was right about the thing that mattered: own the rendering process and you own the labels. We confirmed that for raster well-log digitisation with a concrete sim-to-real result, VeerNet trained entirely on rendered logs, R-squared up to 0.9891 reading curves off real scans, 15,000 perfectly masked training instances and not one hand-traced curve. We also found the seam, thin-structure curve-IoU stays low even when the reconstructed value is excellent, which is a property of segmenting one-pixel curves rather than a failure of synthetic data. The right reading is to grade on the axis the petrophysicist cares about, build the renderer first, and treat the public corpus's 5.7 percent digitisation rate as the size of the prize.

References

[1] O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. https://arxiv.org/abs/1505.04597

[2] A. Vaswani et al. Attention Is All You Need. NeurIPS 2017. https://arxiv.org/abs/1706.03762

[3] S. S. M. Salehi et al. Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. MLMI Workshop, MICCAI 2017. https://arxiv.org/abs/1706.05721

[4] M. Berman, A. R. Triki, M. B. Blaschko. The Lovasz-Softmax Loss. CVPR 2018. https://arxiv.org/abs/1705.08790

[5] S. Woo, J. Park, J.-Y. Lee, I. S. Kweon. CBAM: Convolutional Block Attention Module. ECCV 2018. https://arxiv.org/abs/1807.06521

[6] B. Yuan, Q. Yang. Digitization of Well-Logging Parameter Graphs Based on a Gridlines-Elimination Approach. J. Pet. Explor. Prod. Technol., 2019. http://www.jsoftware.us/show-409-JSW15423.html

[7] D. Hendrycks et al. AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty. ICLR 2020. https://arxiv.org/abs/1912.02781

[8] A. Gupta, A. Vedaldi, A. Zisserman. Synthetic Data for Text Localisation in Natural Images. CVPR 2016 (render-from-source training corpus). https://arxiv.org/abs/1604.06646

[9] A. McDonald. Using the missingno Python Library to Identify and Visualise Missing Data Prior to Machine Learning. Towards Data Science, 2021. https://towardsdatascience.com/using-the-missingno-python-library-to-identify-and-visualise-missing-data-prior-to-machine-learning-34c8c5b5f009

Synthetic Data for Scientific-Image Segmentation: What the Literature Got Right

Abstract

Background

Method

Results

Discussion

Conclusion

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on