What Data-Centric AI Actually Means Outside the Hype

For most of the last decade the implicit recipe for a stuck model was to reach for a bigger or cleverer one: a deeper backbone, a fresh attention block, another week of architecture search. The data-centric-AI movement is the argument that the recipe has the order wrong, and that for a great many real problems the next gain in performance is sitting in the dataset rather than the network. This piece credits that argument, traces where it came from, and then adds one small, concrete data point to it from our own work on digitising raster well logs. The headline of that data point is simple: with our architecture held fixed, the dataset was the lever.

We want to be careful about authorship up front, because the idea is not ours. The framing was sharpened and popularised by Andrew Ng and collaborators, who put it most bluntly in a 2021 talk arguing for a move from model-centric to data-centric AI [1], and then made it falsifiable with a competition in which the model was frozen and entrants could only edit the training data [2]. That second move matters more than the slogan, because a competition where you are forbidden from touching the architecture is a clean experiment in exactly the claim being made. Our contribution here is not the idea and not a new method; it is one more run that happens to land on the same side of the question.

What the movement actually claims

It is worth stating the claim precisely, because the hype around it tends to inflate it into something it never said. Data-centric AI does not claim that architecture is irrelevant, that you should stop reading model papers, or that any dataset beats any network. It claims something narrower and more useful: that on many production problems the model has already saturated against the information present in a noisy or small dataset, so the highest-leverage engineering hour is spent improving the data rather than the network. Improving the data can mean fixing labels, balancing classes, removing leakage, or generating more of the cases the model is weakest on. The test of the claim is operational, hold the model fixed and improve only the data; if the metric still moves, the data was the binding constraint.

The reason this lands is that the supporting evidence is not just rhetorical. Work on the data side of high-stakes systems documented how undervaluing data work produces what the authors call data cascades, compounding downstream failures that no amount of modelling rescues [3]. Separately, an audit of widely used benchmarks found pervasive label errors in their test sets, enough to reorder model rankings, which means some of the architecture progress the field celebrated was partly noise in the measuring stick [4]. Put those together and the data-centric position stops sounding like a slogan and starts sounding like a correction: if your labels are wrong and your benchmark is wrong, the leaderboard climb you are chasing may be an artefact.

Our problem made the experiment unavoidable

We did not set out to test the data-centric thesis. We arrived at it because the problem forced the issue. The task was digitising raster well logs: taking a scanned image of a paper log and recovering the curves on it as machine-readable signals. The foreground we care about, an ink trace one to three pixels wide, is a sliver of the frame, and real labelled scans are scarce and expensive. That combination, a thin target and little labelled data, is precisely the regime where a bigger network stops helping, because there is not enough labelled signal for the extra capacity to learn from. The encoder-decoder lineage we built on, the U-Net family of symmetric networks with skip connections [5], was designed from the start to learn dense labels from scarce data, so the architecture was never the obvious place to find more headroom.

The place we did find headroom was the data, and specifically synthetic data, generated logs where we control the curves and therefore own perfect labels. The survey literature on synthetic data is clear-eyed that generated data substitutes for collected data only when the generator captures the variation that matters, and stops helping otherwise [6]. So this was not a free lunch; building a generator that produced logs varied enough to transfer was real work. But it was data work, and the bet was that data work would pay better than model work on this problem. That bet is the data-centric thesis applied to one task.

Holding the model fixed so the dataset is the only thing that moves

Here is the run, stated so the experiment is legible. We kept one backbone throughout, the encoder-decoder we call VeerNet: a 5-encoder, 5-decoder convolutional network, trained on the same 50-epoch budget. We did not swap that architecture between the two phases. What we changed was the dataset. The first phase trained on a 2,000-instance set for a binary version of the task, foreground curve against background. The second phase trained the same backbone on 15,000 curated synthetic instances for a harder multiclass version, three classes covering the background and two distinct curves. The architecture was a constant; the data was the variable.

A two-pan balance that weighs the two ways to move the same raster-log digitisation task: change the model, or change the data. The model pan is frozen at the architecture VeerNet kept fixed across the run (a 5-encoder / 5-decoder backbone trained for 50 epochs); the data pan starts empty. Toggle the data moves we actually made onto it, the 2,000-instance binary baseline, the 15,000-instance curated synthetic set, and its 3 multiclass classes, and the beam tips toward data, because in this run it was the dataset and not the backbone that moved the work. The instance counts, class count, encoder and decoder depth, and epoch budget are sourced from the engagement archive; the beam-tilt is an illustrative reading of where the leverage sat, not a measured torque.

The instrument above is a balance, and the point of drawing it as a balance is to make the asymmetry between the two pans visible. The model pan is loaded once and never again: the 5-encoder, 5-decoder backbone and the 50-epoch budget, frozen. The data pan is where all the movement happens, and you can load it yourself, the 2,000-instance baseline, the 15,000-instance curated set, the three multiclass classes, and watch the beam tip. What the picture is trying to argue, and what our run found, is that across the two phases the dataset was the side that carried the weight. The backbone that ran the harder multiclass task in phase two is the same backbone that ran the simpler binary task in phase one; it is not a better network that moved the work, it is a better dataset.

We should name the limits of this honestly, because a balance is a clean picture of a messier truth. The two phases were not the same task at two data sizes; the second was strictly harder, three classes against two, so the comparison is not a controlled A versus B on a single objective. The tilt in the instrument is an illustrative reading of where the leverage sat in this run, not a measured torque, and we have labelled it as such on the exhibit. What we can defend is the structural fact: the architecture was held constant while the dataset was rebuilt and scaled more than sevenfold, and it was that rebuild, not an architecture change, that let the same network take on the harder problem at all. That is one data point, and it points the same direction the movement does.

Why this is a data point and not a proof

It would be easy, and wrong, to inflate a single engagement into a law. One run on one problem does not prove that data beats architecture in general, and the data-centric literature itself is careful not to make that claim [1]. What a single run can do is one of two things: it can contradict a thesis, or it can add to the pile of evidence consistent with it. Ours does the second. It is most useful read alongside the cleaner experiments, the frozen-model competition where only data could change [2], the benchmark audit that found the measuring stick itself was noisy [4], because those establish the general pattern and ours is a worked instance of it in an industrial setting with a genuinely thin foreground.

The reason we think the instance is worth publishing anyway is that data-centric AI is easiest to nod along with in the abstract and hardest to act on in practice. The abstract version, improve your data, is agreeable and useless. The concrete version is a discipline: decide which is your binding constraint before you spend the engineering hour, hold the thing you suspect is not the constraint fixed, and move only the thing you suspect is. On our problem, holding the backbone fixed and moving the dataset is what produced progress, and the value of saying so plainly is that it is a template anyone with a scarce-data, thin-target problem can copy.

The rule we would hand to the next team

If there is a portable lesson here it is not a number, it is a habit. Before reaching for a bigger model, ask whether the model is actually saturated against the data you have, and the way to answer that is to freeze the model and try to move the metric from the data side alone. If the metric moves, you have your answer and your roadmap; the data was the constraint and the data is where the work is. If it does not move, you have learned something equally useful and can go spend your architecture budget with a clear conscience. The data-centric movement gave the field permission to take that first question seriously [1], the surrounding evidence showed why the question is not rhetorical [3][4], and our raster-log run is one more case where the honest answer was: the lever was the dataset.

Key takeaways

The data-centric-AI movement, credited to Andrew Ng and collaborators, claims that on many production problems the next gain comes from improving the dataset rather than the architecture; it made the claim falsifiable with a competition where the model was frozen and only the data could change.
The thin-curve, scarce-label nature of raster well-log digitisation is exactly the regime where a bigger network stops helping, so the dataset, not the backbone, was the obvious place to look for headroom.
We held one 5-encoder / 5-decoder backbone and a 50-epoch budget fixed across both phases and changed only the data: a 2,000-instance binary set, then 15,000 curated synthetic instances for a 3-class multiclass task. The architecture was the constant; the dataset was the variable.
The same frozen backbone took on the strictly harder 3-class task only after the dataset was rebuilt and scaled more than sevenfold, so it was a better dataset, not a better network, that moved the work. This is one supporting data point, not a general proof.
The portable habit: before reaching for a bigger model, freeze the model and try to move the metric from the data side alone. If it moves, the data was your binding constraint and your roadmap.

References

[1] Ng, A. A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. DeepLearning.AI (2021). The talk that named the shift toward improving the dataset rather than the model. https://www.youtube.com/watch?v=06-AZXmwHjo

[2] Ng, A., Laird, D., and He, L. Data-Centric AI Competition. DeepLearning.AI and Landing AI (2021). A benchmark with the model fixed, where only the training data could be edited. https://https-deeplearning-ai.github.io/data-centric-comp/

[3] Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., and Aroyo, L. Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI. CHI (2021). Field evidence that undervalued data work compounds into downstream failures. https://research.google/pubs/pub49953/

[4] Northcutt, C. G., Athalye, A., and Mueller, J. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. NeurIPS Datasets and Benchmarks (2021). Measured label noise in the benchmarks the field reports against. https://arxiv.org/abs/2103.14749

[5] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The symmetric encoder-decoder that learns dense labels from scarce data. https://arxiv.org/abs/1505.04597

[6] Nikolenko, S. I. Synthetic Data for Deep Learning. Springer (2021). A survey of when generated data substitutes for collected data, and where it stops working. https://arxiv.org/abs/1909.11512

What Data-Centric AI Actually Means Outside the Hype

What the movement actually claims

Our problem made the experiment unavoidable

Holding the model fixed so the dataset is the only thing that moves

Why this is a data point and not a proof

The rule we would hand to the next team

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on