Active Learning on a Budget: How Teams Decide What to Label Next

Every team that trains a model on something more expensive than scraped text eventually hits the same wall: labels cost money, the budget is finite, and the only real decision left is which examples to spend that budget on. Active learning is the sub-field that has been answering that question carefully for thirty years, and the answer it has built is not a single trick but a map. The map has a name for its territories, a literature for each, and a benchmark culture that has compared them to exhaustion. What we want to do in this piece is not add a new continent to that map. It is to point at the exact coordinate where one small-data, image-heavy problem actually sits, credit the cartographers who drew the surrounding terrain, and be honest that our contribution is a budget choice, not a new acquisition function.

The problem that forces the choice is raster well-log digitisation: pulling the thin ink traces of logging curves off a scanned paper log and turning them back into digital signals. The supervision a segmentation model needs is pixel-accurate masks, and a pixel-accurate mask of a one-to-three-pixel curve threading down a tall scanned image is slow, fiddly, expert work. A petrophysicist can label a few, not a few thousand. So the budget is genuinely scarce, the examples are genuinely expensive, and the active-learning question is not academic. It is the whole project.

The map: acquisition functions, briefly and with credit

Pool-based active learning assumes you have a large pool of unlabeled examples and a small budget of labels to request. An acquisition function scores every example in the pool by how informative labeling it would be, and you query the top of that ranking. The canonical reference for the whole family is the survey by Settles, which catalogues the strategies and the theory behind them [1]; almost everything below is in there, and we lean on it rather than re-deriving it.

The oldest and still most-used family is uncertainty sampling, introduced by Lewis and Gale for text classifiers [2]: label the examples the current model is least confident about, on the theory that a confident prediction teaches the model little and an uncertain one sits near a decision boundary where a label moves the most. Uncertainty has three classic flavours, and they are exactly the presets we expose in the instrument below. Least-confident scores an example by one minus the probability of its top predicted class. Margin sampling, from Scheffer and colleagues, scores it by the gap between the top two class probabilities, so a near-tie ranks high [3]. Entropy sampling scores it by the Shannon entropy of the full predictive distribution [4], which generalises the other two and is the right choice when more than two classes can be genuinely in contention. They agree on the easy cases and disagree on the interesting ones, which is why all three survive.

The map did not stop in the 1990s. Gal and colleagues showed how to carry these acquisition functions into deep networks by estimating predictive uncertainty with Monte-Carlo dropout, which made information-theoretic acquisition like BALD practical for image models that do not hand you a clean posterior [5]. And Sener and Savarese pointed out the failure mode of pure uncertainty in the batch regime: querying the single most-uncertain example repeatedly returns near-duplicates, so they reframed the query as a core-set problem that explicitly rewards diversity [6]. The honest summary of the map is that uncertainty tells you where the model is weak, diversity stops you from labeling the same weakness ten times, and the survey literature has spent two decades trading these off. None of that is ours.

Where our problem sits on the map

Here is the coordinate we actually occupy, stated plainly so it is not mistaken for a contribution it is not. We did not have a pool of unlabeled real scans large enough to run textbook pool-based acquisition over, and we could not afford to hand-label real scans at the volume a deep segmentation model wants. We had the opposite shape of problem: a generator that could synthesise labeled logs almost for free, and an expert whose time could label only a handful of real scans. So the budget question reshaped itself. It was not only which real examples to query, but how to split a fixed annotation budget between two utterly different sources of supervision: cheap, abundant synthetic labels that come pre-masked by construction, and scarce, expensive human labels on real scans.

The concrete numbers fix the scale. The synthetic generator produced a pool of twenty thousand logs, each one labeled for free because we drew the curves and therefore knew the masks exactly. Against that, the real-scan budget for held-out validation was eight scans, hand-checked because synthetic data validates nothing about the real distribution it was meant to imitate. Training used an eighty percent train and twenty percent validation split throughout. That is the budget: twenty thousand synthetic logs to learn the shape of the task, eight real scans to keep us honest about whether it transferred.

An interactive budget-allocation map for label spend. A single fixed annotation budget is split between cheap synthetic log generation and scarce human labeling of the hardest residual scans, and the teal curve plots the expected effective label yield as that split moves; the faint dashed teal line shows what synthetic generation alone contributes, so the gap between the two is the value the human labels add. The three preset chips credit the classic pool-based active-learning acquisition functions (uncertainty, margin, and entropy sampling) surveyed by Settles: each preset moves the operating point to where that acquisition rule would spend the marginal human label, and the slider then sweeps the entire split. The pool sizes are sourced from the engagement archive: 20000 synthetic logs, 8 real validation scans, and an 80 percent train/val split. The yield-curve shape and the per-strategy operating points are illustrative; the three counts are not.

The allocator above is the argument in one picture. Sweep the budget toward all-synthetic and you get enormous label volume for almost no cost, but every label is drawn from a distribution you invented, so the yield curve is high in raw count and shallow in real-world information. Sweep it toward all-human and each label is gold, drawn from the true distribution, but you can afford so few that the model starves for volume. The effective-yield curve rises off the synthetic floor, peaks where a modest slice of the budget goes to human labels on the hardest residual scans, and falls again once human spend cannibalises the synthetic volume the model needs to learn the task at all. The peak is the whole point: it is not at either extreme.

Why synthetic-first, then the hardest eight

The pragmatic policy we adopted reads straight off that curve, and it maps cleanly onto the acquisition-function logic we credited above rather than replacing it. Generate the cheap synthetic labels first, in bulk, to teach the model the structure of the task: what a curve looks like, how it threads through a track, what the background is. Then treat the scarce human budget the way uncertainty sampling says to: spend it only where the model trained on synthetic data is most likely to be wrong on real input, which in practice meant the handful of real scans whose appearance the synthetic generator approximated least well. The acquisition function did not pick eight numbers out of a million-example pool; it picked the hardest residual scans out of a tiny one. The principle is identical. The scale is not.

This is also where the deep active-learning literature earns its keep as a warning rather than a recipe. Gal and colleagues' point about needing real uncertainty estimates is exactly why we could not just label any eight real scans and call it active learning [5]: the model has to tell you which scans it is uncertain on, or you are spending the scarce budget at random. And Sener and Savarese's diversity point is why we did not let uncertainty alone choose [6]: eight near-identical hard scans would have been a worse use of eight expensive labels than eight that span the failure modes. The budget is so small that the batch-diversity concern, which is a refinement in the textbook setting, becomes first-order. With eight labels you cannot afford to waste even one on a near-duplicate.

What we claim, and only this, is the budget point. The acquisition functions are Lewis and Gale's, Scheffer's, Shannon's, refined for deep nets by Gal and by Sener and Savarese. The decision to spend almost the entire budget on free synthetic supervision and reserve the human labels for the hardest real residual is the engineering call we made, and it is one labeled point on a map other people drew. Calling it a novel method would be a category error. Calling it the right point for a twenty-thousand-synthetic, eight-real problem is, we think, defensible.

The lesson that generalises past well logs

Strip out the geology and the transferable idea is about the shape of the budget question. The textbook active-learning question assumes one currency: you have unlabeled examples and you spend labels on them. Real projects often have two currencies, a cheap synthetic one and an expensive real one, and the interesting decision is the exchange rate between them, not just the ranking within the expensive pool. The acquisition-function map still applies inside the expensive pool, exactly as the survey describes it [1]. But the prior question, how much of the budget should ever reach that pool at all, is one the classic framing does not pose, because it assumes all your labels are the same kind of expensive.

So the discipline we would offer to anyone with a cheap-synthetic, scarce-real shape of problem is two-step and unglamorous. First, find the peak of the effective-yield curve before you label anything: roughly how much synthetic volume does the model need before more synthetic data stops teaching it, and how many real labels can you actually afford. Second, inside the real-label budget, use a real acquisition function with an honest uncertainty estimate and a nod to diversity, because at small budgets the diversity concern is not optional. Neither step is novel. Both are on the map. Knowing which coordinate is yours is the part you have to do for yourself.

Key takeaways

Active learning is a well-charted map of acquisition functions, not a single trick: uncertainty sampling (Lewis and Gale), margin sampling (Scheffer et al.), entropy sampling (built on Shannon), all catalogued by Settles' survey, with deep-net refinements from Gal et al. (real uncertainty via MC-dropout) and Sener and Savarese (batch diversity). We credit that lineage and add no new acquisition function.
The raster-log-digitisation problem has a two-currency budget the textbook framing does not pose: cheap, pre-masked synthetic labels versus scarce, expensive human labels on real scans. The first decision is the exchange rate between them, not just the ranking inside the expensive pool.
Concrete scale from the engagement: a pool of 20,000 synthetic logs, 8 real validation scans, and an 80% train/val split. The effective-yield curve peaks at a modest human-label share, not at either all-synthetic or all-human.
Our one claimed contribution is the budget point, not a method: generate cheap synthetic labels first to teach the task structure, then spend the scarce human budget only on the hardest residual real scans, exactly where an uncertainty acquisition function says the synthetic-trained model is most likely to be wrong.
At an 8-label budget the diversity concern that is a refinement in the textbook setting becomes first-order: you cannot afford to spend even one expensive label on a near-duplicate, so an honest uncertainty estimate plus a nod to diversity is mandatory, not optional.

References

[1] Settles, B. Active Learning Literature Survey. University of Wisconsin-Madison, Computer Sciences Technical Report 1648 (2009). The standard catalogue of pool-based acquisition functions and query strategies this piece draws from. https://minds.wisconsin.edu/handle/1793/60660

[2] Lewis, D. D., and Gale, W. A. A sequential algorithm for training text classifiers. SIGIR (1994). The original uncertainty-sampling acquisition rule. https://arxiv.org/abs/cmp-lg/9407020

[3] Scheffer, T., Decomain, C., and Wrobel, S. Active hidden Markov models for information extraction. IDA (2001). Margin sampling, the smallest-top-two-gap acquisition rule. https://link.springer.com/chapter/10.1007/3-540-44816-0_31

[4] Shannon, C. E. A Mathematical Theory of Communication. Bell System Technical Journal, 27 (1948). The entropy measure that entropy sampling maximises over the predictive label distribution. https://ieeexplore.ieee.org/document/6773024

[5] Gal, Y., Islam, R., and Ghahramani, Z. Deep Bayesian Active Learning with Image Data. ICML (2017). Acquisition functions carried into deep networks via Monte-Carlo dropout uncertainty. https://arxiv.org/abs/1703.02910

[6] Sener, O., and Savarese, S. Active Learning for Convolutional Neural Networks: A Core-Set Approach. ICLR (2018). A diversity-based, batch-aware alternative to pure uncertainty acquisition. https://arxiv.org/abs/1708.00489

Active Learning on a Budget: How Teams Decide What to Label Next

The map: acquisition functions, briefly and with credit

Where our problem sits on the map

Why synthetic-first, then the hardest eight

The lesson that generalises past well logs

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on