Skip to main content

Blog

Active Learning to Spend Your Annotation Budget Wisely

A labelling budget is money, and most of ours was being spent re-confirming a background the model already scored 0.94 IoU on while the two curve classes sat at 0.26 and 0.21. This note is about treating model uncertainty as the spending signal: reading which classes the model is least confident about, and routing the next label there rather than spreading effort evenly. It is deliberately not a reprise of the VeerNet whitepaper's dataset-growth story. Growing the corpus from 2,000 to 15,000 synthetic instances was one lever; this is a different one, about where inside a fixed budget the marginal label lands. We tie the priority to calibration rather than raw count, because a model that is badly calibrated will point you at the wrong class, and we are honest that the return curve we sketch is a labelling-return argument, not a logged active-learning experiment. The per-class scores, the instance counts, and the class_weight of 42 are the real archive numbers; VeerNet, the digitiser whose confidence we read, is ours.

Tannistha Maitiby Tannistha Maiti9 min read
EarthScan insight

A labelling budget is money, and for a while we were spending most of ours on the one class that did not need it. The curve-segmentation runs behind VeerNet, the encoder-decoder EarthScan uses to lift well-log curves off scanned paper, produce a three-class prediction: the paper background and the two plotted curves. On the multiclass Dice-loss run the background reaches 0.94 IoU and 0.97 F1, which is to say the model has essentially solved it. The two curve classes are a different story. Curve-1 sits at 0.26 IoU and 0.37 F1, curve-2 at 0.21 IoU and 0.32 F1. If you are about to fund another round of annotation, the question that actually matters is not how many more logs to label. It is which pixels inside those logs are worth a human's attention, and the honest answer is: almost none of the background, and nearly all of the curves.

This note is about making that answer mechanical. The idea is old and well documented, and none of the machinery is ours: let the model tell you where it is uncertain, and route the next label there. Settles gave the field its reference survey of exactly this, the family of query strategies in which a learner chooses its own annotations and, on a good day, reaches a target accuracy with a fraction of the labels a random draw would need [1]. What we add is one worked application to a segmentation run whose per-class scores make the case almost too cleanly, and one caveat, about calibration, that decides whether the whole scheme helps or quietly misleads.

The budget is not the corpus

It is worth being precise about which lever this is, because there is an adjacent one it is easy to confuse it with. The VeerNet corpus grew from 2,000 synthetic instances in the binary setting to 15,000 in the multiclass setting, and that growth mattered. But dataset size and annotation priority are different dials. Growing the corpus asks how many total instances the model trains on. Annotation priority asks, given that you are going to pay for the next N labels either way, where those N labels land. You can double a corpus and still spend the new budget re-confirming background, and the per-class gap barely moves; hold the corpus fixed and re-route the same budget to the low-confidence classes, and the gap moves a lot. This note is only about the second dial.

The reason the second dial has so much room to work here is imbalance. The plotted curves are thin: two constant curves per log, a few pixels of ink against a page of background. The run already carries a BCE class_weight of 42 to stop the loss from being dominated by the easy majority class, which is the standard move for this shape of problem and the reason overlap-based losses and per-class overlap metrics exist at all, since a pixel-accuracy summary would read near 0.94 and hide the entire curve failure [4]. Class weighting rebalances the loss. It does not rebalance where a human spends their labelling hours. A model can be weighted to care about curves and still be handed a fresh batch of annotations that is mostly background, because whoever assembled the batch sampled uniformly. Uncertainty routing is the fix for that gap between what the loss is told to care about and what the annotation queue actually contains.

Uncertainty is a spending signal, if it is honest

The strategy, stated plainly, is uncertainty sampling: score every candidate region by how unsure the model is, and prioritise the least-confident ones for human labelling [1]. On this run the model's own scores already sort the classes for you. It is confident and correct on background. It is neither on the curves. So the marginal label is worth far more when it lands on a curve pixel the model is currently guessing at than on a background pixel it would have gotten right for free. The exhibit below makes that concrete: it splits a fixed annotation budget between spending at random and spending steered by uncertainty, and projects where curve-2 ends up against the 0.94 background ceiling it is chasing.

ANNOTATION BUDGET · UNCERTAINTY AS A SPENDING SIGNAL0.77projected curve-2 IoURoute labels to the classes the model is least sure of, not evenlyMEASURED START · PER-CLASS SCORE, DICE LOSSbackgroundIoU 0.94F1 0.97model already surecurve-1IoU 0.26F1 0.37low confidencecurve-2IoU 0.21F1 0.32low confidencegap, curve-2 to background0.73 IoUWHY THE FOREGROUND IS SCARCE2,000binary instances15,000multiclass instances42xBCE class_weight2 curvesthin foregroundSTEERED SPEND CLOSES THE GAP FASTER THAN RANDOM0.000.250.500.751.00background ceiling 0.940255075100percent of budget steered by uncertaintyrandom spendANNOTATION BUDGET LEVERdrag: share of a fixed label budget steeredby uncertainty vs spread at random025507510060%steered0.77random0.33gap closed76%sourced: IoU 0.94 / 0.26 / 0.21, F1 0.97 / 0.37 / 0.32, 2,000 vs 15,000 instances, class_weight 42 · the return curve is illustrative
A labelling budget read as a spending decision. The measured start is the three-class Dice-loss run: background scores 0.94 IoU and 0.97 F1 and the model is already sure of it, while curve-1 sits at 0.26 IoU / 0.37 F1 and curve-2 at 0.21 IoU / 0.32 F1, the classes the model is least confident about. The scarcity that drives this is real: 2,000 binary against 15,000 multiclass instances and a BCE class_weight of 42 already carried to counter thin foreground. The lever splits a fixed annotation budget between two policies, spread at random or steered by uncertainty, and the right panel projects where curve-2 lands: a shallow line when labels land uniformly (mostly re-confirming the background) and a steeper line when they are routed to the low-confidence class, both chasing the 0.94 background ceiling. The orange marker is the only element that argues: the curve-2 gap-closure point climbing toward the ceiling faster the more of the budget uncertainty is allowed to steer. The start IoU and F1, the instance counts, and the class weight are sourced from the engagement archive; the gap-closure trajectory between those measured endpoints is an illustrative labelling-return sketch, not a logged active-learning curve.

Drag the lever from zero, where every label is spread uniformly, toward one, where every label is steered to the class the model is least sure of, and the projected curve-2 score climbs toward the ceiling far faster on the steered line than the random one. That is the entire argument in one motion. The gap the budget is closing is the 0.73 IoU between curve-2's 0.21 and background's 0.94, and the claim is not that uncertainty routing closes all of it, but that it closes it per dollar faster than labelling blind.

There is a load-bearing condition on all of this, and it is the reason the article's spine is calibration rather than raw count. Uncertainty sampling is only as good as the uncertainty estimate. If the model's confidence is miscalibrated, its "least confident" pixels are not actually the ones it is most likely to be wrong on, and you will spend the steered budget on the wrong regions with more conviction than random would have. This is not a hypothetical worry. Guo and colleagues showed that modern high-capacity networks are systematically overconfident: their raw softmax scores are numerically much higher than their actual accuracy warrants, so a naive confidence read is a distorted signal until it is recalibrated [2]. A segmentation head is not exempt. If curve-2's softmax looks confident on pixels it in fact gets wrong, uncertainty sampling on that raw score will skip exactly the pixels a human most needs to correct.

So the honest version of the strategy has two steps, not one. First, get an uncertainty estimate you can trust, which in practice means recalibrating the softmax or using an estimate designed to be more faithful than a point softmax. Gal, Islam, and Ghahramani did the second: they used a Bayesian epistemic-uncertainty estimate as the acquisition signal and showed it beats both random and raw-softmax acquisition on high-dimensional image data, the regime a raster log lives in [3]. Second, and only then, route the budget by that estimate. Skip the first step and the second step's confidence is borrowed against a number that has not earned it.

What we actually did, and did not do

The intervention was modest and I want to be exact about its size rather than dress it up. We read the per-class scores, saw that background was solved and the curves were not, and stopped treating a new annotation batch as something to sample uniformly. When we had human review time, it went to the curve classes and to the regions where the model's per-pixel confidence was lowest, checked against whether that confidence was believable rather than trusted blind. That is uncertainty sampling in its plainest form, applied by hand rather than through a formal acquisition loop, and it is enough to move a per-class gap that uniform labelling leaves stuck.

What we did not do is run a controlled active-learning experiment with a logged acquisition curve, and the instrument is careful to say so. The per-class IoU and F1, the instance counts, and the class weight are measured archive numbers. The return curve the lever draws between the measured start at 0.21 and the measured ceiling at 0.94 is a labelling-return sketch, an argument about shape, not a plot of realised accuracy against labels spent. That the shape is steeper for steered spend than for random is what the literature supports [1] [3]; its exact curvature on this dataset we did not measure and do not claim.

Limitations

The central caveat is the one the middle section is built around: this whole approach inherits the trustworthiness of the uncertainty signal, and an overconfident model will route the budget confidently to the wrong place [2]. We reduced that risk by sanity-checking the model's low-confidence regions against human judgement rather than trusting the softmax as a calibrated probability, but we did not run a formal calibration analysis, so I cannot report a reliability diagram or an expected-calibration-error number for these runs. Second, the gap-closure trajectory in the exhibit is illustrative geometry between two measured endpoints, not a logged learning curve; the endpoints are real, the path between them is a sketch of the expected direction and should be read as such. Third, the per-class scores are from one loss setting on one synthetic corpus, and a different operator's logs, a different curve count, or a different loss would give different starting gaps and possibly a different priority order. Fourth, uncertainty routing optimises where the next label lands; it says nothing about whether the labels are correct, whether the synthetic curves resemble field paper, or whether a higher curve-2 IoU actually yields a usable digitised curve, which remain the questions that decide whether the model is any good. Active learning spends a budget well. It does not tell you the budget was worth spending.

Where the label should land

The habit this left us with is to stop treating an annotation budget as something to spread evenly and start treating it as something to aim. The model already knows where it is weak, if you are willing to read its confidence honestly and recalibrate it where it is not, and on a run where background is solved at 0.94 and the curves are stuck at 0.26 and 0.21, that reading points in exactly one direction. Spend the next label where the model is least sure and most likely wrong, not where it is already right. The corpus can keep growing on its own schedule. The budget, separately, should go to the curves.

References

[1] Settles, B. Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison (2009). The reference survey of query strategies, including uncertainty sampling, and the case that a learner choosing its own labels can hit a target accuracy with far fewer annotations than random sampling. https://minds.wisconsin.edu/handle/1793/60660

[2] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On Calibration of Modern Neural Networks. ICML 2017. Modern high-capacity networks are systematically overconfident, so raw softmax scores are a distorted confidence signal until recalibrated. https://proceedings.mlr.press/v70/guo17a.html

[3] Gal, Y., Islam, R., and Ghahramani, Z. Deep Bayesian Active Learning with Image Data. ICML 2017. A Bayesian epistemic-uncertainty acquisition signal that beats random and raw-softmax acquisition on high-dimensional image data. https://proceedings.mlr.press/v70/gal17a.html

[4] Sudre, C. H., Li, W., Vercauteren, T., Ourselin, S., and Cardoso, M. J. Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations. DLMIA 2017, LNCS 10553. Why overlap-based losses and per-class overlap metrics expose a minority-class failure that a pixel-accuracy summary hides. https://link.springer.com/chapter/10.1007/978-3-319-67558-9_28

Go to Top

© 2026 Copyright. Earthscan