The Quiet Importance of Dataset Curation

There is a particular kind of luxury that quietly changes how you think about a machine-learning problem, and it arrives the moment your training data becomes free to make. Once a generator can render as many labelled examples as you are willing to wait for, the question that organised the whole project, how do we get more data, dissolves and is replaced by a stranger one: of all the data we could trivially make, which should we actually keep. That second question is dataset curation, and it is the part of the work that gets the least attention precisely because it sounds like housekeeping. This piece argues that on a synthetic-data problem it is not housekeeping at all. It is the main lever, and we will trace it through one concrete path, from a 2,000-instance binary set to a deliberately shaped 15,000-instance multiclass set, where the gains came from deciding what to keep rather than from any further change to the model.

We want to be careful about whose idea this is. The framing that the dataset, not the architecture, is where the next gain often hides was sharpened and popularised by Andrew Ng and collaborators, who argued for a move from model-centric to data-centric thinking [1]. The narrower claim we are leaning on here, that curation specifically can beat raw growth, has its own careful literature, and we credit it below rather than presenting it as a discovery. Our contribution is not a curation algorithm and not a theorem. It is one worked example of the discipline, with the real numbers attached, on a problem where the synthetic-data tap was wide open and the only interesting decision left was what to do with the flow.

Free data turns the question inside out

For most of the history of supervised learning, data has been the binding constraint, so the field built its intuitions around scarcity. You collect what you can, you label what you can afford, and you make the model work as hard as possible on the little you have. Synthetic generation inverts that economics. The problem we worked on was raster well-log digitisation: recovering the thin ink traces of logging curves from a scanned paper log as machine-readable signals. Real labelled scans are scarce and expensive, but a renderer that draws its own curves knows the masks exactly, so synthetic logs come pre-labelled and effectively unlimited. The constraint moved. It was no longer how much labelled data we could obtain; it was which of the infinite labelled data we should put in front of the model.

That inversion matters because the instinct trained by scarcity, when in doubt, add more, is actively misleading once data is free. More synthetic data is not automatically more information. If the generator keeps drawing logs that look like ones the model has already mastered, the extra examples add training time and storage and teach almost nothing. The synthetic-data survey literature makes this point with care: generated data substitutes for collected data only to the extent that it carries the variation that matters, and past that point additional volume is inert [2]. So the moment generation became free, the project's real difficulty migrated from acquisition to selection, and selection is curation by another name.

The literature on keeping less

The idea that you can keep less and lose nothing, or even gain, is not folklore; it has been measured. Work on example forgetting tracked which training examples a network learns once and never misjudges again versus which it repeatedly gets wrong across epochs, and found that a large fraction of any set falls in the first group, learned early and thereafter redundant, while a smaller, harder minority carries most of the signal [4]. The practical reading is that a training set is not uniform in value, and a pruned set weighted toward the informative minority can match a full one. Separate measurements on image-classification datasets put a number on the near-duplicate problem directly, showing that a tenth of a typical set can be removed as semantically redundant with negligible cost [5].

The strongest version of the claim is more recent and more pointed. Analysis of neural scaling laws showed that the usual power-law return on dataset size, where each doubling buys a smaller and smaller improvement, can be beaten outright by pruning: keep the right examples and drop the rest, and the error falls faster than raw size would predict, because you have spent your data budget on the examples that actually move the model [3]. That result reframes curation from a tidy-up into a first-class lever. It says the shape of the kept set, not merely its size, sets the ceiling. Everything we did on our problem is a small, concrete instance of that principle, and we make no claim to have improved on it.

One curation path, with the numbers attached

Here is the path, stated so the curation decisions are legible. The first phase of the work used a binary version of the task, foreground curve against background, trained on a 2,000-instance set. That was enough to show the backbone could learn the shape of a curve at all, but it was a deliberately narrow set, and its narrowness was the lesson. To take on the harder multiclass version, three classes covering the background and two distinct curves, we did not simply scale the same generator up and ship whatever came out. The generator produced a raw pool of twenty thousand synthetic logs, and the set we actually trained on was fifteen thousand. The gap between those two numbers is the curation, and it is the part of the story we most want to make visible.

A curation funnel that traces how a raw pool of generated logs is reshaped, not merely grown, into a kept training set. Two regimes share one funnel: the binary segmentation set of 2,000 instances, and the multiclass route where 20,000 synthetic logs were generated and curated down to 15,000 training instances across three classes, each log carrying the same two constant curves. The funnel runs from generated, through the logs that survive a deduplication pass, to the kept set, to the 80 percent train share against the 20 percent validation share. Switch regime and drag the dedup lever to sweep how aggressively near-identical logs are pruned; the kept count is anchored to the sourced figure while a held peak R-squared of 0.9891 stays put, which is the whole argument: accuracy held while the set was reshaped. The generated, kept, split, curve-count, class-count and R-squared figures are sourced from the engagement; the per-step dedup proportions are an illustrative pruning model and are flagged as such.

The funnel above is the path drawn as a sequence of decisions rather than a single number. The widest band is the raw generated pool. The next is what survives a deduplication pass that removes near-identical logs, the synthetic equivalent of the redundancy the image-dataset work measured [5]; there is no point spending training steps on the thousandth log that differs from the nine-hundred-and-ninety-ninth only in noise. Below that sits the kept set, fifteen thousand instances, and below that the eighty percent that became training data against the twenty percent held back for validation. Switch the regime between the binary 2,000 set and the multiclass 15,000 set, and drag the dedup lever, and the count at each stage moves, but the held accuracy read-out stays put. That is the argument the instrument is built to make: the set was reshaped on the way down, and the accuracy survived the reshaping.

A detail inside that funnel is easy to miss and was quietly decisive: composition. Every log in the final multiclass set carries the same two constant curves by construction, so the curation was not only about how many logs to keep but about fixing what each log should contain. A naive generator might have varied the number of curves per log, the easy thing to randomise, and produced a set that was diverse in a dimension we did not care about and thin in the dimensions we did. Deciding that every kept log would hold exactly two curves was a curation choice as real as the deduplication, and it shaped the set toward the multiclass distinction the model actually had to learn. Curation is selection in two directions at once: which examples survive, and what each surviving example is made of.

What held, and what we are not claiming

The reason we trust that the reshaping was sound is that accuracy held through it. Across the multiclass examples the model reached a peak coefficient of determination of 0.9891 on the reconstructed curve values, and it reached it on the curated fifteen-thousand-instance set rather than on the raw twenty-thousand pool. We are deliberately not claiming that the smaller set beat the larger one in a controlled head-to-head; that would require an ablation we did not run, and the honest statement is weaker and still useful. The honest statement is that a set that was pruned by a quarter and fixed to a deliberate composition reached the accuracy we needed, which means the discarded quarter was not carrying the result. On a free-data problem that is the operative fact, because storage and training time are not free even when labels are. The path from two thousand to fifteen thousand was not a growth curve; it was a selection.

We also want to be precise about register, because this is a blog crediting a field rather than a method paper. The principle that curation can beat raw size is established work, measured by others on classification and scaling-law benchmarks [3][4][5]; the encoder-decoder lineage our backbone sits in is the U-Net family, designed for dense labels from scarce data [6]; the data-centric framing that made any of this feel like the main event rather than a footnote is Ng's [1]. What is ours is narrow and specific: the renderer, the decision to deduplicate the synthetic pool, the decision to fix two curves per log, and the resulting fifteen-thousand-instance set that the VeerNet backbone trained on. None of those is a contribution to the science of curation. All of them are an instance of taking that science seriously on a real industrial problem.

A different reflex for the free-data era

The habit worth carrying out of this is small and runs against the grain of a decade of scarcity thinking. When data is expensive, the discipline is to extract every drop of value from what little you have. When data is free, the discipline flips: the value is no longer in making more, it is in refusing most of what you could make. A generator that never says no produces a set that is large, redundant, and shaped by whatever was easiest to vary rather than by what the model needs to learn. The skilled move is to treat generation and curation as one loop, where every example earns its place in the kept set or is dropped, and where the composition of each example is a design decision rather than an accident of the renderer. We did that once, on one problem, from two thousand to fifteen thousand, and the quiet result is that the curation, not a further model tweak, is what carried the work.

Key takeaways

When training data becomes free to generate, the binding constraint moves from acquisition to selection: the scarce decision is no longer how much data to make but which of the infinite examples to keep. That decision is dataset curation.
More synthetic data is not automatically more information. Generated data only helps to the extent it carries variation the model has not already mastered; past that point extra volume is inert, a point the synthetic-data survey literature makes plainly (Nikolenko).
Keeping less can lose nothing or gain: example-forgetting work (Toneva et al.) shows much of a set is redundant while a minority carries the signal, near-duplicate studies (Birodkar et al.) quantify the removable fraction, and data-pruning analysis (Sorscher et al.) shows careful selection can beat the power-law return on raw size. We credit this lineage rather than claiming it.
One concrete path: a 2,000-instance binary set, then a multiclass route where 20,000 synthetic logs were generated and curated down to 15,000 kept instances, split 80/20 for train and validation. The gap between 20,000 and 15,000 is the curation, and it is the part of the work that mattered.
Curation is selection in two directions: which examples survive deduplication, and what each surviving example contains. Fixing every kept log to the same two constant curves shaped the set toward the multiclass distinction the model had to learn, and a peak R-squared of 0.9891 held on the curated set, not the raw pool.

References

[1] Ng, A. A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. DeepLearning.AI (2021). The talk that reframed the next gain as a property of the dataset rather than the network. https://www.youtube.com/watch?v=06-AZXmwHjo

[2] Nikolenko, S. I. Synthetic Data for Deep Learning. Springer (2021). A survey of when generated data substitutes for collected data, and the diversity conditions under which it stops helping. https://arxiv.org/abs/1909.11512

[3] Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., and Morcos, A. Beyond neural scaling laws: beating power law scaling via data pruning. NeurIPS (2022). Shows that careful pruning of which examples to keep can break the power-law return on raw dataset size. https://arxiv.org/abs/2206.14486

[4] Toneva, M., Sordoni, A., Combes, R. T. des, Trischler, A., Bengio, Y., and Gordon, G. J. An Empirical Study of Example Forgetting during Deep Neural Network Learning. ICLR (2019). Finds that many training examples are redundant and removable, while a minority are repeatedly forgotten and carry the signal. https://arxiv.org/abs/1812.05159

[5] Birodkar, V., Mobahi, H., and Bengio, S. Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need. arXiv (2019). Measures how much of a training set is near-duplicate and removable. https://arxiv.org/abs/1901.11409

[6] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The symmetric encoder-decoder lineage our backbone builds on, designed to learn dense labels from scarce data. https://arxiv.org/abs/1505.04597

The Quiet Importance of Dataset Curation

Free data turns the question inside out

The literature on keeping less

One curation path, with the numbers attached

What held, and what we are not claiming

A different reflex for the free-data era

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on