Most machine-learning projects begin with a dataset and design a model for it. Raster well-log digitisation begins with the opposite problem: there is no dataset, there is no realistic prospect of one, and the absence of one is the entire reason the project exists. The raw material is decades of scanned paper logs sitting in operator archives, every one of them an image with all the information of the original log and none of its queryability, and not one of them carries a label that says which pixels are the gamma-ray trace and which are the grid line behind it. A segmentation network needs that label for thousands of images. Nobody ever produced it, and producing it by hand at the required scale is precisely the manual bottleneck the whole effort is meant to remove. You cannot annotate your way out of an annotation problem.
So we did not annotate. We built an engine that manufactures the dataset, and this whitepaper is the design of that engine taken seriously as a system rather than as a preprocessing step. The idea is old and well-founded: if you generate the image, you already know everything about it, so the label is free, and if you randomise the generator's nuisance parameters widely enough the real world becomes, to the model, just one more sample from the synthetic distribution [1]. The contribution here is not the idea but the design of the specific generator that makes it work for printed curves on scanned paper, and the discipline that keeps a manufactured corpus honest. The downstream architecture, an encoder-decoder segmentation network in the U-Net lineage [3], is upstream context for this document and not its subject. The subject is the data factory.
Why synthesise the mask instead of annotating one
The reflexive plan is to collect real scans, pay annotators to trace the curves, and train on that. For this target it fails on two counts before cost even enters the conversation.
The first is resolution. The thing the network has to find is a curve trace that, on the source raster, is frequently a single pixel wide. A human annotating that trace clicks a polyline through it, and the polyline is an approximation of a one-pixel feature drawn by a hand that cannot reliably resolve one pixel. The label you buy is systematically fuzzier than the signal it is supposed to describe, and a sparse, hairline target punishes that fuzz harder than any natural-image task would. The second is consistency. Two annotators tracing the same degraded log produce visibly different curves, and a network trained on the union learns the disagreement as much as the curve.
A procedural generator inverts both problems. Because the geometry layer draws the curve, the mask is not an annotation of the image, it is the same parametric object the image was rendered from, recorded at exactly the pixels the renderer set. The label is pixel-exact by construction and perfectly consistent across the entire corpus, because the same code drew every one. You are not approximating a one-pixel feature after the fact, you are the authority that placed it. For a one-pixel-wide target that distinction is the difference between a usable mask and a smeared one, and it is the first reason the engine is worth building rather than buying.
The geometry layer: curves as parametric objects
The geometry layer is where a synthetic log starts. Each curve is a parametric trace laid down across a depth axis, with controlled smoothness, amplitude and crossing behaviour, drawn at the native resolution of the target image rather than resized into it. Drawing at native resolution is not an aesthetic choice. A well log encodes value as horizontal deflection against depth, so the shape of the trace is the signal, and any resampling step that smooths a one-pixel line into its neighbours destroys the very thing the network is trained to recover. The generator therefore renders the curve and rasterises its mask in the same pass at the same resolution, and the two are guaranteed to agree pixel for pixel because they come from one parametric source.
The number of curves on a log is itself a parameter, and it is the parameter that separates the two training regimes. The binary stage drew a single foreground trace against background. The final multiclass stage fixed two curves per log against background, three classes in total, because two crossing curves is where the genuinely hard problem lives: the network must keep two thin traces distinct through the regions where they overlap, and a generator that can place two curves with controlled crossing geometry can manufacture exactly the ambiguous cases a real archive would only supply by luck. The input is a single grayscale channel, matching the scanned-paper source, so the generator never invents colour information the real logs do not have.
The annotation layer: furniture the model must learn to ignore
A clean curve on an empty field is not a scanned log, and a network trained on clean curves learns nothing about the clutter that actually surrounds a real trace. The annotation layer exists to put that clutter back, deliberately. It stamps the printed furniture every real log carries: the grid lines and graticule, the depth ticks and scale numbers, the track separators that divide the log into columns, and the printed curve labels and header text. None of it is decoration. Every element is chosen because it is something the model would otherwise mistake for signal, and the only way to teach a network to ignore a grid line is to show it ten thousand grid lines that are never part of the answer.
The design rule for this layer is that it should be adversarial to the model rather than flattering to the eye. A grid line that runs parallel to a curve, a depth tick that clips the trace, a printed label that sits exactly where a curve passes behind it, these are the cases that break a naive model, so they are exactly the cases the annotation layer is tuned to produce often. Crucially, because the geometry layer already fixed the mask, every annotation the layer adds is by definition background. The grid line is drawn into the image and not into the mask, so the supervisory signal stays clean no matter how busy the render becomes. The image gets harder; the label stays exact. That invariant, that anything the annotation layer touches is guaranteed not to be in the mask, is what lets us make the synthetic images arbitrarily cluttered without ever corrupting the ground truth.
The domain-randomisation layer: knobs, not realism
The geometry and annotation layers can produce one convincing log. They cannot, on their own, produce a dataset that covers the real archive, because a real archive is not one log, it is a distribution of logs printed at different scales by different service companies across decades and scanned on different equipment. Covering that distribution is the job of the domain-randomisation layer, and the design insight that governs it is the one the synthetic-data literature arrived at the hard way: the goal is not to make any single synthetic image photorealistic, it is to make the distribution of synthetic images wide enough that the real distribution falls inside it [1][2]. Deliberately varied, even deliberately unrealistic, synthetic data that spans the nuisance parameters widely beats carefully photorealistic data that spans them narrowly, because breadth is what closes the gap between synthetic and real, not fidelity.
Concretely, the layer exposes a set of knobs, and the discipline is that every knob's range is pinned to a statistic of the real archive rather than chosen for convenience. Image width is randomised from 3,200 to 12,800 pixels and height from 480 to 640 pixels, because that is the dimensional span the real scanned logs actually occupy; the generator is not allowed to draw a log narrower or wider than something the archive contains. The number of curves, the single grayscale channel, and a tunable noise and realism level that controls scan speckle, ink density and degradation are the other knobs. The knob board below is that layer made interactive: dial each parameter and the synthetic log regenerates live, while a counter ticks toward the corpus the breadth manufactures.
The point the knob board makes physically is the point of the whole design. Widen the knobs and the generator covers more of the real distribution per render, which is why breadth, not a labelling budget, is the lever that sets how much of the problem the dataset represents. A generator dialled narrow produces a large pile of nearly-identical logs that teach the network one corner of the archive. A generator dialled wide produces a corpus that spans it. The volume follows almost for free, because every render is pre-labelled, so the only cost of another sample is the compute to draw it.
From knobs to corpus: 20,000 logs, zero hand labels
Run the wide-dialled generator and the dataset materialises at whatever scale the compute budget allows, every instance arriving with its pixel-exact mask attached. The engine manufactured a 20,000-log two-curve multiclass corpus this way, and the final version-two training set drew 15,000 curves from it. The number that matters most is the one that is not in those figures: the count of human-annotated logs underneath them is zero. There was no corpus to start from, and there is none now either, in the sense that no person ever labelled a single training image. The labels were manufactured alongside the pixels, by the same code, at the same resolution, across the same dimensional range the real archive spans.
That is the design payoff stated plainly. A segmentation network that could not have been trained at all, for want of a labelled dataset, is trained on a dataset that did not exist until the generator drew it. The sparse-foreground imbalance the two-curve logs create, two thin traces against a vast background, is the kind of imbalance a Tversky-style objective is built to handle [4], but the imbalance only becomes a tractable training problem because the generator can produce as many hard, crossing-curve examples as the optimiser needs. The data factory does not merely fill a gap left by missing labels; it gives the training process a controllable supply of exactly the cases that are hardest to learn.
Keeping a manufactured corpus honest
A generated dataset has a failure mode that a collected one does not: the network can learn the generator instead of the geoscience. If every synthetic curve is a little too smooth, if the noise is always the same flavour of speckle, if the grid lines always sit on the same pitch, the model will quietly overfit to those tells and then meet a real scan that has none of them. The design discipline that guards against this is the reason the randomisation knobs are bounded by archive statistics rather than by taste, and the reason the annotation layer is tuned to be adversarial rather than tidy. The defence against learning the generator is to make the generator's output as varied, within the real distribution's support, as the real distribution itself.
Three habits hold the line in practice. The first is that every knob range is justified by a measured property of the real archive, so the synthetic distribution cannot drift outside the real one's support, and the width and height bounds are the clearest example. The second is that realism is spent on the things the model can exploit, the clutter and degradation the annotation and noise layers introduce, rather than on cosmetic detail the model never sees. The third is validation against real scans, the held-out reality check that tells you whether the breadth you dialled actually covers the target or merely looks like it should. A synthetic corpus is only as good as its coverage of the real distribution, and coverage is something you verify, not something you assume because the renders look convincing.
What the engine is, and what comes after it
Treated as a system, the procedural log generator is three cooperating layers with one invariant. The geometry layer draws curves as parametric objects and hands back a pixel-exact mask for free. The annotation layer stamps the printed furniture a real scan carries and is guaranteed, by the invariant, never to touch the mask. The domain-randomisation layer sweeps every render across the real archive's dimensional range through knobs bounded by real statistics, and it is the breadth of that sweep, not any labelling budget, that decides how much of the problem the dataset covers. Out of those three layers came a 20,000-log multiclass corpus and a 15,000-curve final training set with no human annotation anywhere in the pipeline.
The levers still open are the obvious extensions of the same design. A richer geometry layer can synthesise more curve types and more deliberately pathological crossings, expanding the catalogue of hard cases the optimiser can be fed on demand. The annotation layer can grow to cover rarer furniture, the hand-written marginalia and stamps and stains that the long tail of the archive carries. And the randomisation knobs can be widened further as new archive statistics are measured, always inside the discipline that a knob's range is a measured property of reality and not a guess. The model is downstream of all of it, and it was only ever trainable because the engine underneath it manufactured, rather than waited for, the labelled dataset the field assumes you already have.
Get the full whitepaper
This page is the long-form summary. The complete whitepaper includes the full layer-by-layer generator specification, the parameter ranges and their archive justifications, the annotation catalogue, the synthetic-versus-real coverage validation, and the corpus-generation throughput numbers.