The Data-Centric Turn in Computer Vision: A Survey of Dataset-First Methodology

Abstract

This survey examines the data-centric turn in computer vision, the methodological shift that treats systematic iteration on the dataset, rather than search over the space of architectures, as the highest-leverage path to moving a model on a real task. We organise the literature available through the survey quarter into three distinct threads and credit each to its authors: an explicit reframing of the engineering question, accompanied by a competition that froze the model so that only the data could change [1][2]; a body of field study documenting how undervalued data work compounds into downstream failure [3]; and a strand of benchmark auditing that found the test sets the field reports against carry enough label noise to reorder model rankings [4]. We argue that these threads converge on a single operational test: hold the model fixed and try to move the metric from the data side alone. We then locate a synthetic-data-first segmentation program on that map, drawing on one measured result from our own raster-log digitisation work, where a single fixed encoder-decoder configuration (128 feature channels, 5 encoder stages, 5 decoder stages, 2 attention layers) reached an R-squared of 0.9891 once the generated training set was scaled from 2,000 to 15,000 instances. The synthetic-first program is a recognisable instance of the surveyed methodology, not a new branch of it.

What the turn is, and what it is not

The phrase data-centric AI is recent, but the practice it names is older than the label, and a survey is most useful when it separates the claim from the hype that has accreted around it. The strong, indefensible version, that architecture no longer matters and any larger dataset beats any better network, is not what the literature asserts. The version the literature actually defends is narrower and more demanding: on a great many production problems, the model has already saturated against the information present in a small or noisy dataset, so the marginal engineering hour returns more when it is spent improving the data than when it is spent improving the network. The reframing was sharpened and made public in a 2021 talk that put the contrast between model-centric and data-centric work in plain terms [1]. What gave the reframing teeth was a competition built around it, in which the model was frozen and competitors were permitted to edit only the training data [2]. That design matters more than the slogan, because a contest that forbids touching the architecture is a clean experiment in exactly the claim being made: if the leaderboard still moves when nobody can change the network, the data was the binding constraint.

It is worth being precise about whose idea this is. The framing, the competition, and the early evangelism belong to Andrew Ng and collaborators, and the surrounding evidence belongs to the research groups cited throughout this survey. Our role in what follows is not to claim the methodology. It is to survey it accurately and then add a single worked instance from an industrial setting, so that a reader can see what the abstract argument looks like when it meets a real, scarce-label problem.

Three threads that converge on one test

Read closely, the data-centric literature is not one argument but three, arriving from different directions and meeting at the same operational prescription. Keeping them separate is what lets a survey avoid collapsing the movement into a mood.

The first thread is the explicit reframing already described, whose contribution is to change the default question an engineer asks when a model is stuck, from which architecture next to which property of the data is limiting [1][2]. The second thread is empirical and sociological. A field study of high-stakes deployed systems documented what its authors named data cascades, the compounding downstream failures that follow from data work being undervalued upstream, and it did so by interviewing the practitioners who lived through those failures rather than by theorising about them [3]. The lesson of that thread is that the cost of neglected data is not paid at training time, where it is invisible, but in production, where it is expensive and hard to attribute. The third thread is the audit of the measuring stick itself. An examination of the test sets behind several widely used benchmarks found pervasive label errors, and crucially showed that correcting them was enough to reorder the published ranking of models [4]. The implication is uncomfortable and important: some of the architecture progress the field celebrated may have been partly noise in the ruler, which means a portion of the model-centric leaderboard climb was an artefact of bad labels rather than a real gain in capability.

Two further pieces of period-correct infrastructure belong in the survey because they made the data a first-class, inspectable artefact rather than an unmarked input. A documentation standard for datasets proposed recording a dataset's composition, how it was collected, and what it was intended for, on the argument that you cannot reason about a model's behaviour without reasoning about the data that shaped it [5]. That standard is data-centric in the deepest sense: it treats the dataset as something to be designed, versioned, and held accountable, which is the precondition for iterating on it deliberately. Put the three threads and this infrastructure together and they do not merely coexist, they point at the same procedure. Freeze the model, change only the data, and read whether the metric moves. That procedure is the spine of the methodology this survey is describing, and it is the one we apply to our own work below.

Method of the survey, and the boundary of our own claim

This is a literature survey with one embedded data point, and it is honest to mark exactly where each part comes from. The survey portion selects work that was public on or before the survey quarter and that bears directly on the dataset-first proposition, and it characterises each contribution from its published description rather than by re-running it. We make no attempt at a census; the selection is a sample chosen to expose the three threads above and the synthetic-data subliterature the embedded instance sits in.

The embedded instance is a single configuration from our own raster well-log digitisation work, the task of recovering machine-readable curves from scanned paper logs. The relevant facts are these and only these: we held one encoder-decoder configuration fixed across the run, with 128 feature channels at the bottleneck, 5 encoder stages, 5 decoder stages, and 2 attention layers, and trained it under the same budget while we changed the dataset beneath it. The dataset moved from a 2,000-instance set to a 15,000-instance curated synthetic set, and the attained fit on the harder, scaled task reached an R-squared of 0.9891. Those numbers are sourced from the engagement archive and are the only measured quantities we assert. Everything we say about the comparative shape of model-centric versus data-centric returns is a reading of the literature plus this single instance, not a controlled sweep, and the instrument below flags that boundary on its own face.

Why synthetic data is the relevant subliterature deserves a sentence, because it is where our instance lives. Generated data is attractive precisely when real labels are scarce and the target is thin, which is the raster-log regime, but the survey literature on synthetic data is clear-eyed that it substitutes for collected data only when the generator captures the variation that matters and otherwise stops transferring [6]. The technique that made synthetic-first training credible in vision was domain randomisation, deliberately varying the generated scenes so widely that the real distribution looks to the network like just another sample, which was shown to close enough of the reality gap to train transferable networks [7]. The synthetic-first program is therefore not a rejection of the data-centric thesis; it is the most aggressive form of it, in which the data work is not only curation but manufacture.

Results: where the synthetic-first program lands on the map

The result of placing our instance on the surveyed map is that it falls squarely inside the data-centric position, and it does so in the cleanest possible way, because the architecture was a genuine constant. The encoder-decoder that ran the small-data phase is the same encoder-decoder, channel for channel and stage for stage, that ran the scaled-data phase. No depth was added, no attention block was introduced, no backbone was swapped. The only thing that changed between a modest attained fit and an R-squared of 0.9891 was the size and composition of the generated training set. By the operational test the three threads converge on, the verdict is unambiguous: with the model frozen, the metric moved from the data side, so the data was the binding constraint on this task.

The instrument below is built to argue exactly that and to let a reader test it rather than take it on faith. It poses the methodology as a single lever that splits one fixed engineering budget between the two arms the literature contrasts: model-centric effort, which would search for a better architecture, and data-centric effort, which scales the dataset. The architecture arm is drawn frozen, a flat ceiling, because in our run it never changed and so could not climb. The data arm is drawn as the saturating curve our scaling actually traced, anchored at the two measured endpoints, the 2,000-instance floor and the 15,000-instance peak at R-squared 0.9891. As the lever is pushed from the model-centric end toward the data-centric end, the operating point climbs the data curve while the model ceiling stays flat, which is the whole methodological claim rendered as a mechanism.

A single effort lever splits one fixed engineering budget between the two methodologies the data-centric literature contrasts: model-centric effort, which searches for a better architecture, and data-centric effort, which rebuilds and scales the dataset. Drag the lever from the model-centric end toward the data-centric end. The architecture arm stays frozen at the one configuration we held fixed across the run (128 feature channels, 5 encoder stages, 5 decoder stages, 2 attention layers), so effort poured onto it cannot lift the metric above a flat frozen-model ceiling; the data arm scales the synthetic training set from 2,000 to 15,000 instances and the attained fit climbs a saturating curve that tops out at the R-squared of 0.9891 we measured. The feature depth, stage counts, attention-layer count, instance scale, and peak R-squared are sourced from the engagement archive; the flat model line and the shape of the climb between the two measured endpoints are an illustrative reading, not a measured sweep.

What the picture is careful not to overstate is also part of the result. The flat model line is an illustrative ceiling, not a measured architecture sweep; we did not run a battery of alternative networks to prove the architecture arm is barren, we observed that our one frozen network could not have moved because it did not change. The climb between the two endpoints is shaped to show diminishing returns, which is the expected behaviour of added data once a model approaches saturation, but the only two points on it we measured are its ends. The defensible content of the exhibit is structural and it matches the surveyed methodology precisely: a fixed model, a dataset scaled more than sevenfold, and a metric that moved with the data. That is one instance of the data-centric test passing, drawn at the synthetic-first extreme of the data work.

Discussion: a worked instance, not a proof, and where it fits

A survey owes the reader a clear statement of what a single instance can and cannot do for a methodology. It cannot prove the data-centric thesis in general; one run on one task, however clean, is an existence proof that the test can pass, not a theorem that it always will. What it can do is sit alongside the cleaner, larger experiments and corroborate them from an industrial direction the academic literature less often reaches. The frozen-model competition established the pattern under controlled conditions [2]; the benchmark audit established that the measuring stick was noisy enough to mislead model-centric work [4]; the field study established that the cost of neglected data is real and downstream [3]. Our instance adds the observation that on a thin-target, scarce-label task in oil and gas, the same procedure those works recommend produced the gain, and that the gain came specifically from manufacturing data rather than from curating an existing collection.

The placement also clarifies what is genuinely ours and what is borrowed, which matters for honesty in a survey written by people who built one of the systems described. The methodology is not ours; the three threads and the synthetic-data subliterature are credited above to their authors. The encoder-decoder family we used is itself prior art, the symmetric skip-connected network designed for dense labels from scarce data [8]. What we assert authorship of is the VeerNet system that wraps these ideas for raster-log digitisation, the synthetic generator that produced the scaled dataset, and the specific decision to hold the network fixed and spend the budget on data. The contribution of the instance to this survey is therefore modest and exact: it is a data point at the synthetic-first corner of the surveyed map, demonstrating in an industrial setting the same thing the cleaner experiments demonstrate in controlled ones.

There is a practical reading for a team facing a similar problem, and it is the part of the discussion most likely to be useful. The data-centric turn is easy to agree with in the abstract and hard to act on, because abstract agreement costs nothing while the discipline costs an architecture budget you have to refuse to spend. The procedure the survey distils makes the discipline concrete: before reaching for a bigger network, decide which side you suspect is the binding constraint, freeze the side you suspect is not, and move only the side you suspect is. If the metric moves, you have both your answer and your roadmap; if it does not, you have learned that the constraint is elsewhere and can spend the architecture budget without guilt.

Limitations

This survey carries the limitations of its dual form, and naming them is part of meeting the bar it sets for others. As a survey, its selection is a sample rather than a census, restricted to work public on or before the survey quarter and chosen to expose three threads, which means it omits both later contributions and adjacent strands such as active learning and programmatic weak supervision that a fuller treatment would include. As a piece carrying one embedded instance, its empirical content is exactly one configuration from one task: a single fixed encoder-decoder, a single dataset scaling from 2,000 to 15,000 instances, and a single attained R-squared of 0.9891. That instance is not a controlled comparison of methodologies. We did not run a matched model-centric arm in which the architecture was varied under a fixed dataset, so the survey cannot quantify how much a model-centric program would have gained; it can only report that our frozen model could not have gained, because it did not change. The instrument's flat model ceiling and the shape of its data-centric climb are illustrative readings anchored at two measured endpoints, not a measured sweep, and should be read as a mechanism that renders the claim rather than as data in their own right. Finally, the result is domain-specific: a thin-target, scarce-label, single-channel task is the regime where the data-centric test is most likely to pass at the synthetic extreme, and a reader should not generalise an R-squared of 0.9891 obtained here to tasks with abundant labels and large targets, where the binding constraint may well sit on the model side.

What this survey establishes

The data-centric turn is best read as three distinct threads that converge on one procedure: an explicit reframing with a frozen-model competition, field studies of how undervalued data work compounds into downstream failure, and audits showing benchmark label noise can reorder model rankings. Each is credited to its authors.
The operational test the threads share is sharp and falsifiable: freeze the model, change only the data, and read whether the metric moves. A gain under a frozen model means the data was the binding constraint.
A synthetic-data-first segmentation program is the most aggressive form of the methodology, where data work is manufacture, not just curation. It transfers only when the generator captures the variation that matters, the condition the synthetic-data literature is explicit about.
Our one embedded instance lands inside the data-centric position cleanly: a single fixed encoder-decoder (128 feature channels, 5 encoder and 5 decoder stages, 2 attention layers) reached an R-squared of 0.9891 once the generated set scaled from 2,000 to 15,000 instances. The architecture was constant; the dataset was the variable.
This is a worked instance, not a proof. It corroborates the cleaner controlled experiments from an industrial direction and is domain-specific to thin-target, scarce-label imagery; it should not be generalised to label-rich, large-target tasks where the model may still be the constraint.

References

[1] Ng, A. A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. DeepLearning.AI (2021). The talk that named the shift and argued the next gains on production problems come from improving the dataset rather than the model. https://www.youtube.com/watch?v=06-AZXmwHjo

[2] Ng, A., Laird, D., and He, L. Data-Centric AI Competition. DeepLearning.AI and Landing AI (2021). A benchmark in which the model was frozen and entrants could only edit the training data, turning the slogan into a falsifiable experiment. https://https-deeplearning-ai.github.io/data-centric-comp/

[3] Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., and Aroyo, L. Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI. CHI (2021). Field evidence that undervalued data work compounds into downstream failures across deployed systems. https://research.google/pubs/pub49953/

[4] Northcutt, C. G., Athalye, A., and Mueller, J. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. NeurIPS Datasets and Benchmarks (2021). Measured label noise in widely used benchmark test sets, enough to reorder published model rankings. https://arxiv.org/abs/2103.14749

[5] Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daume III, H., and Crawford, K. Datasheets for Datasets. Communications of the ACM (2021). A standard for documenting a dataset's composition, collection, and intended use, making the data a first-class, inspectable artefact. https://arxiv.org/abs/1803.09010

[6] Nikolenko, S. I. Synthetic Data for Deep Learning. Springer (2021). A survey of when generated data substitutes for collected data, and the domain-gap conditions under which it stops transferring. https://arxiv.org/abs/1909.11512

[7] Tremblay, J., Prakash, A., Acuna, D., Brophy, M., Jampani, V., Anil, C., To, T., Cameracci, E., Boochoon, S., and Birchfield, S. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. CVPR Workshops (2018). Demonstrated that deliberately randomised synthetic data can train networks that transfer to real images. https://arxiv.org/abs/1804.06516

[8] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The symmetric encoder-decoder with skip connections, designed to learn dense labels from scarce data. https://arxiv.org/abs/1505.04597

The Data-Centric Turn in Computer Vision: A Survey of Dataset-First Methodology

Abstract

What the turn is, and what it is not

Three threads that converge on one test

Method of the survey, and the boundary of our own claim

Results: where the synthetic-first program lands on the map

Discussion: a worked instance, not a proof, and where it fits

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on