Dataset Distillation and Coreset Selection: Training on Less Without Losing Accuracy

Abstract

Two research programmes attack the same waste from opposite ends. Coreset selection keeps a chosen subset of the real training instances and throws the rest away, betting that most of a large set is redundant and that the informative minority can be found. Dataset distillation keeps none of the originals and instead synthesises a small artificial set that trains a model to nearly the accuracy the full set would have given, betting that the signal a network needs can be compressed into far fewer, denser examples than were ever collected. This survey reads both programmes in their published form, separates the selection rules the coreset literature has converged on from the gradient-matching and kernel objectives that make distillation work, and uses the two of them to frame a measurement question our own pipeline already poses. Our raster well-log digitiser, VeerNet, reaches a peak single-curve goodness-of-fit of R-squared 0.9891 on a 15,000-instance multiclass segmentation set trained for 550 minutes at batch size 16, while a related binary task reaches usable accuracy on only 2,000 instances in 110 minutes. The gap between those two set sizes is exactly the territory coreset selection is about. The central finding is that a well-chosen coreset preserves accuracy on a plateau while cutting wall clock, so the honest survey question is not whether to shrink a 15,000-log set but how far it can shrink before the R-squared degrades, and the answer is a curve with a knee rather than a single number. This is a survey of the published field; the engagement numbers are ours and anchor the argument, not a new benchmark.

Why a smaller set is a real question here

The reflex in deep learning is to collect more data, and for most of its history that reflex was right. But our own numbers make the opposite question concrete. Training the 15,000-instance multiclass set costs 550 minutes of wall clock for fifty epochs; the 2,000-instance binary set costs 110 minutes. Those two figures are sourced, and read side by side they say something the scaling reflex hides: a task that is genuinely simpler learns on roughly an eighth of the instances in a fifth of the time, and the larger set is larger partly because it is more redundant, not only because it is harder. Every synthetic log we render carries a pixel-perfect label for free, so we are not labour-bound the way a hand-annotated project would be, but we are still compute-bound, and 550 minutes per run is a real tax on the iteration speed of a small team on one GPU.

That is the setting in which both surveyed programmes stop being academic. If most of the 15,000 logs are near-duplicates of each other in feature space, then a selection rule that keeps the informative fraction should train a model that fits nearly as well in a fraction of the 550 minutes. And if the informative content can be compressed harder still, a distilled set smaller than either anchor might carry it. The question is not whether shrinking is possible. It is where the accuracy we already measured, the R-squared of 0.9891, starts to give way.

Background: two ways to keep less

Coreset selection began as a geometry problem. Sener and Savarese framed the choice of a training subset as a k-center covering problem: pick the examples whose feature-space neighbourhoods cover the rest of the set, so that no unpicked example is far from a picked one [1]. The intuition is that a model trained on a good cover sees the whole distribution through representatives, and the redundant interior of each cluster adds little. That geometric statement is still the cleanest way to say what a coreset is for.

The next wave replaced geometry with training dynamics, on the observation that a network itself reveals which examples matter. Toneva and colleagues found that during training some examples are learned and then forgotten many times while others, once learned, are never forgotten, and that the never-forgotten examples can be removed with almost no accuracy cost because the network was not relying on them to define the decision boundary [2]. Paul and colleagues turned that observation into a cheap score: the gradient norm or the error of an example early in training predicts how much it will matter, so the redundant bulk can be pruned after only a few epochs rather than a full run [5]. Coleman and colleagues made the scoring itself affordable by computing it with a small proxy model whose example rankings transfer to the large one, so selection does not cost as much as the training it is meant to save [3]. Killamsetty and colleagues closed the loop by selecting the subset whose gradients best serve validation-set generalisation directly, a bilevel objective that ties the coreset to the thing we actually care about [4].

Dataset distillation is the more radical programme, and it keeps none of the real examples. Wang and colleagues introduced it as the surprising claim that a handful of synthetic images, optimised rather than sampled, can train a network to a large fraction of full accuracy [7]. The founding version was expensive to optimise, and the method became practical when Zhao and colleagues reframed the objective as gradient matching: learn a small synthetic set whose training gradients track those the full set would produce, so a model trained on the synthetic set follows nearly the same optimisation path [8]. Nguyen and colleagues gave the idea a tractable closed form through kernel ridge regression on the neural tangent kernel, distilling a set by solving a regression rather than an unrolled inner loop [9]. The through-line is that distillation manufactures density: each synthetic example is worth many real ones because it was built to be.

Both programmes had, by this survey's date, been pulled together by a single empirical result worth stating on its own. Sorscher and colleagues showed that with a good pruning metric the accuracy-versus-data-size relationship need not follow the usual power law at all, and can be pushed toward an exponential, meaning far fewer well-chosen examples hold the accuracy that many random ones do [6]. That is the strongest version of the claim this whole survey rests on: the shape of the retention curve is not fixed by nature, it is set by how well you choose what to keep.

The selection rules, side by side

Read together, the coreset methods are variations on one question asked four ways. The geometric rule keeps a feature-space cover [1]. The dynamics rule keeps the examples the network struggles with and forgets, on the logic that those are the ones defining the boundary [2] [5]. The proxy rule keeps whatever a cheap stand-in model says the expensive model will need [3]. The bilevel rule keeps the subset that most improves held-out generalisation directly [4]. They disagree on the score but agree on the premise: a large set has a small informative core and a large redundant remainder, and the redundant remainder can go.

For a raster-log task the disagreement matters in a specific way. Our logs vary along axes a human can name: the curve shapes, the grid and scan-noise conditions, the width of the scanned track. A feature-space cover risks keeping many logs that look different in pixels but pose the same learning problem, while a dynamics-based score risks over-weighting the genuinely hard scans, which on a digitiser are often the degraded ones the operator least needs digitised well. The published rules do not settle which axis of variety is worth covering on a well log; they settle only that covering the right variety, whatever it is, is what preserves accuracy. That is the gap between the survey and a result on our own data, and naming it honestly is the point of reading the literature rather than assuming it.

Where the curve bends

The retention curve is the object the whole survey is really about, and its shape carries the argument. A good selection rule keeps accuracy nearly flat as the set shrinks from the full size down through the redundant region, because the examples being dropped were not doing much. Then, once the coreset is small enough that it stops covering some part of the distribution, accuracy falls, and it falls faster for a bad selection rule than a good one. The Sorscher result is precisely the statement that a better rule pushes the bend later and makes the plateau longer [6]. The practical question a project has is therefore not a yes or no but a location: where is the knee for my task, my metric, and my selection rule.

How far a 15,000-log training set can shrink before the served model stops holding its fit. The teal curve plots modelled R-squared against retained coreset size for a coreset selector that keeps the rare, informative logs; it stays near the sourced peak of 0.9891 across a plateau, then bends down past a knee once the coreset is too small to cover the curve variety the model must learn. The faint ghost line is uniform random subsampling, which sheds accuracy sooner because it keeps redundant logs and drops rare ones. The single orange element is the knee marker: the smallest coreset still on the plateau, which is the operating point the survey question is actually asking for. Drag the lever to shrink the set from 15,000 toward the 2,000-instance binary anchor; the left column reports the retained fraction, the training budget it buys back against the sourced 550 and 110 minute anchors, and whether the coreset is still on the plateau or past the knee. Sourced from the engagement archive: the 2,000 and 15,000 instance counts, the 110 and 550 minute training budgets, the 0.9891 peak R-squared, the 80/20 train/validation split, and the batch size of 16 for multiclass. The retention curves for either selection strategy and the location of the knee are an illustrative retention model, flagged on the canvas; only the anchors are measured.

The exhibit above frames that question against our two sourced anchors. It sweeps a coreset from the full 15,000-instance set down toward the 2,000-instance binary anchor and reads off the training budget that each coreset would cost against the sourced 550 and 110 minute figures, alongside a modelled R-squared that holds near the sourced peak of 0.9891 across a plateau and then bends down past a knee. A coreset selector, drawn in teal, holds the plateau longer than uniform random subsampling, the faint ghost line, because random keeps redundant logs and drops rare ones in equal measure while a selector keeps the rare ones [1] [6]. The single orange element is the knee marker, the smallest coreset still on the plateau, which is the operating point the survey question resolves to. The retention curves and the knee location are an illustrative model and are flagged as such on the canvas; only the two set-size anchors, the two training budgets, the 0.9891 peak, the 80/20 split, and the batch size of 16 are sourced. The exhibit is there to make the shape of the trade legible, not to assert a knee we measured, because we did not sweep the coreset on this run.

The honest reading of our own numbers through this lens is a hypothesis, not a result, and it is worth stating as one. The 550-minute run and the 110-minute run bracket the sweep: the fact that a simpler binary task learns on 2,000 instances tells us the multiclass set almost certainly carries redundancy, but it does not tell us that 2,000 well-chosen multiclass logs would reach the same 0.9891, because the multiclass task must cover two curves and a background rather than one boundary. The survey's contribution to our own work is to convert a vague sense that the set is oversized into the specific, testable claim that there is a knee somewhere between the two anchors, and that finding it is a measurement we could run.

Selection or distillation, on this task

The two programmes are not equally ready for a raster-log pipeline, and the survey is the right place to say which. Coreset selection keeps real logs, so a selected subset inherits every property of the originals: the exact synthetic labels, the real width distribution, the genuine defect families. It asks only for a scoring pass and a threshold, and the proxy trick makes even that cheap [3] [5]. Dataset distillation keeps no real logs, and on a segmentation task with pixel-perfect masks that is a heavier thing to give up: a distilled synthetic log would need a distilled synthetic mask, and the published distillation work is overwhelmingly demonstrated on small classification images, not on wide, high-resolution dense-prediction inputs [7] [8] [9]. For our task the survey's reading is that coreset selection is the near-term lever and distillation is the research bet, because selection composes with a mask-carrying synthetic pipeline while distillation would have to reinvent the label side of it.

There is also a cost asymmetry the literature is clear about and a project should not forget. Selection scores examples we already have and then trains once on the survivors; its saving is real and immediate [3] [5]. Distillation runs an optimisation to build the synthetic set before any downstream training, and that up-front cost is only worth paying when the tiny set is reused many times [7] [8]. For a digitiser that retrains occasionally as the archive grows, the amortisation case for distillation is weaker than it looks, which is another reason the survey lands on selection first.

Discussion

The unifying claim across both programmes is that the accuracy of a trained model is a function of the information in its training set, not of the raw count of examples, and that most large sets carry far less information than their count suggests. Coreset selection exploits this by finding the informative subset among the real examples [1] [2] [3] [4] [5]; dataset distillation exploits it by manufacturing informative examples that never existed [7] [8] [9]; and the data-pruning scaling result is the proof that doing either well changes the shape of the accuracy curve rather than just sliding along it [6]. For a compute-bound project the consequence is a shift in the default question from how much more data can we collect to how little of what we have do we actually need.

Where our own work sits is at the near edge of this, holding a measured accuracy and two set-size anchors but not yet the sweep between them. The survey does not tell us where VeerNet's knee is; no survey could, because the knee depends on our task, our selection rule, and our metric. What it tells us is that the knee exists, that a good selection rule pushes it later than random subsampling would, and that the 550-minute training budget is a quantity we could plausibly cut without surrendering the 0.9891, if we were willing to run the coreset sweep the literature describes. That is the difference between a slogan about training on less and an experiment we could actually schedule.

Limitations

This is a survey and carries a survey's limits. It synthesises the published coreset-selection and dataset-distillation literature and does not re-implement or re-measure any method it discusses; the numbers it quotes as ground truth, the 2,000 and 15,000 instance counts, the 110 and 550 minute training budgets, the peak R-squared of 0.9891, the 80/20 split, and the batch size of 16, are the real metrics of one engagement, one architecture, and one training regime, used as a measurement anchor rather than as a fresh benchmark of any surveyed method. We did not run a coreset sweep or a distillation on the reference task, so the survey makes no measured claim about where VeerNet's retention knee falls or how much of the 550-minute budget a coreset could recover; the plateau-then-knee shape it argues is a prediction from the cited data-pruning results, not a curve we recorded. The interactive exhibit's retention curves for both selection strategies and the location of its knee are an illustrative model and are flagged as such on the canvas; only the two set-size anchors, the two budgets, the 0.9891 peak, the split, and the batch size are sourced. The survey also scopes itself to the selection and distillation lines the field treats as canonical up to its date and stops there, so later refinements of trajectory-matching distillation and of coreset theory that the field has since produced are out of frame. A reader should take this as a map of when to reach for selection over distillation on a compute-bound dense-prediction task, and as an argument that the retention curve is the right object to measure, not as a substitute for running the sweep on their own data and metric.

References

[1] Sener, O., and Savarese, S. Active Learning for Convolutional Neural Networks: A Core-Set Approach. ICLR (2018). Frames subset selection as a k-center covering problem over feature space, the geometric root of modern coreset selection. https://arxiv.org/abs/1708.00489

[2] Toneva, M., Sordoni, A., Combes, R. T. des, Trischler, A., Bengio, Y., and Gordon, G. J. An Empirical Study of Example Forgetting during Deep Neural Network Learning. ICLR (2019). Shows some examples are forgotten repeatedly and others never, and that the never-forgotten ones can be dropped with little accuracy loss. https://arxiv.org/abs/1812.05159

[3] Coleman, C., Yeh, C., Mussmann, S., Mirzasoleiman, B., Bailis, P., Liang, P., Leskovec, J., and Zaharia, M. Selection via Proxy: Efficient Data Selection for Deep Learning. ICLR (2020). Uses a small proxy model to score which examples the large model needs, making selection affordable. https://arxiv.org/abs/1906.11829

[4] Killamsetty, K., Sivasubramanian, D., Ramakrishnan, G., and Iyer, R. GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning. AAAI (2021). Selects the subset whose gradients best serve validation-set generalisation, a bilevel view of coreset selection. https://arxiv.org/abs/2012.10630

[5] Paul, M., Ganguli, S., and Dziugaite, G. K. Deep Learning on a Data Diet: Finding Important Examples Early in Training. NeurIPS (2021). Scores each example by its early-training gradient norm or error to prune the redundant bulk without a full run. https://arxiv.org/abs/2107.07075

[6] Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., and Morcos, A. S. Beyond neural scaling laws: beating power law scaling via data pruning. NeurIPS (2022). Shows a good pruning metric can turn a power-law data-scaling curve into an exponential one, keeping accuracy on far fewer examples. https://arxiv.org/abs/2206.14486

[7] Wang, T., Zhu, J.-Y., Torralba, A., and Efros, A. A. Dataset Distillation. arXiv (2018). Introduces synthesising a tiny set of images that trains a model to near-full accuracy. https://arxiv.org/abs/1811.10959

[8] Zhao, B., Mopuri, K. R., and Bilen, H. Dataset Condensation with Gradient Matching. ICLR (2021). Learns a small synthetic set whose training gradients match those of the full set, making distillation competitive. https://arxiv.org/abs/2006.05929

[9] Nguyen, T., Chen, Z., and Lee, J. Dataset Meta-Learning from Kernel Ridge-Regression. ICLR (2021). Distils a dataset in closed form through the neural tangent kernel, giving distillation a tractable objective. https://arxiv.org/abs/2011.00050

Dataset Distillation and Coreset Selection: Training on Less Without Losing Accuracy

Abstract

Why a smaller set is a real question here

Background: two ways to keep less

The selection rules, side by side

Where the curve bends

Selection or distillation, on this task

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on