A burn rate is the most-read and least-understood number a research team produces. It arrives on a status slide as one figure, ours was EUR 88,400 for November 2021, and it invites exactly one reaction: is that a lot. But the same 88,400 can describe a month that quietly built something we would keep and a month that only bought us more calendar. The two are not distinguishable from the total, only from the mix. This note is about the mix inside one month of a research team's spend: what the euros actually went to, and why the composition, not the headline, is the thing a sponsor should read.
This is deliberately narrow. It is not the month-by-month burn ledger that tracks the total rising across the engagement, and it is not the delivery post-mortem about running a sixteen-week build instead of a thirty-two-week one. It is one month, opened up, to see what proportion of the money is people, what is rented compute, what is data, and what is the software the work runs on. The proportions are the argument.
The month is not a number, it is a composition
Start with the shape, because the shape is the whole point. A small applied-research team's month is dominated, overwhelmingly, by headcount. For the accelerated track this was six people, and six salaried researchers and engineers is not a line item you can move without moving the team. Everything else, the rented GPUs, the annotation effort, the tooling seats and cloud plumbing, sits in a thin band on top of that dominant slice. There is no portfolio of comparable costs to weigh; there is one enormous cost and a handful of small ones, and reading the month starts with accepting that asymmetry rather than averaging it away.
The rented-compute line makes the asymmetry concrete. A GPU on this engagement rented for somewhere between 750 and 1,800 EUR per card per month depending on tier, so even a small fleet run flat out is a rounding item against six people's fully loaded cost. That is the opposite of the frontier-training story, where compute is the headline and grew by orders of magnitude over a few years [3]. On an applied team doing curve digitisation, the useful instinct is the reverse: assume the GPU bill is small, confirm it, and move on. That leaves the interesting part, which is not the dominant slice at all. It is the thin remainder.
The thin slices are where the leverage is
Here is the move that reading a burn rate teaches. Because headcount is so dominant, it is also the least informative slice: it will be large no matter what kind of month you had, whether the six people built something durable or spent four weeks fighting a data pipeline. The slices that tell you which month you had are the small ones. Labelling and tooling, together a modest share of the 88,400, are the part of the mix that discriminates.
Labelling is the clearest case. In euros it is a small line; in effect it is load-bearing, because a segmentation model cannot exceed the quality of the masks it learns from, and label errors propagate straight through to the metric the sponsor eventually sees. The literature is blunt: labelled test sets carry pervasive errors that destabilise the benchmarks built on them, so the annotation line is not a commodity you buy the cheapest version of [2]. A month that spent a little more getting the masks right bought durable capability, even though on the top-line it looks identical to a month that under-spent and will pay for it later in a model that will not clear the acceptance bar. You cannot see that decision in the 88,400, only in the labelling slice.
Tooling is the case teams most often mistake for overhead. The training-loop framework, the experiment tracking, the cloud plumbing that lets six people share GPUs without colliding, all reads like a tax on the real work. It is not. The durable cost of a machine-learning system lives in exactly this surrounding machinery rather than in the model code, which is a small fraction of the whole [1]. A tooling slice that looks a touch heavy is often the month the team stopped losing hours to manual glue, and that saving shows up not as a smaller burn but as more capability bought per euro of the dominant headcount slice. Underspend tooling and the headcount slice gets less efficient without the total changing at all.
Two ways to read the same euros
The exhibit above reads the identical 88,400 two ways, because a sponsor and an engineer ask different questions of the same number.
The runway reading is the cash-calendar one: headcount against everything else. It is the honest way to answer "how long does this last," because the dominant slice draws down the account and another month is mostly another month of six people. It is correct, and it is also where most conversations stop, which is the problem, because it treats the small slices as noise.
The capability reading regroups the same euros by a different question: what compounds past the engagement, and what evaporates when the invoice clears. People carry the know-how forward and labelled data outlives any single model, so both are durable even though one is the largest line and the other one of the smallest; GPU rent and tooling seats are gone the moment the billing period ends. Regrouping this way changes no total by a cent, only what the total means. A month whose durable share is high bought capability that survives; a month whose durable share is thin bought mostly consumption, even when the two print the identical 88,400.
What the neighbours confirm
Put the month beside its neighbours and the argument sharpens. October came in as two biweekly draws of 38,820 and 51,220 EUR. November was the 88,400 we have been dissecting. December ran to 111,220. The top-line climbs steadily across the three, and the reflex is to read that climb as the story. It is not. The mix inside each month is roughly the same shape, the same six-person slice with the same thin remainder on top, so the rising total is mostly the team spending more weeks doing what it was already doing. The number moved; the mix, and the decision it encodes, held steady. That is the discipline a burn rate is supposed to install and usually fails to: read the composition first, the total second. A rising total with a healthy durable share is a team getting more done; a rising total with a thinning durable share is a team burning faster without building faster, and only the mix can tell those apart.
Limitations
The caveat is about which numbers are load-bearing and which are illustrative. The total of 88,400 EUR, the six accelerated-track people, the 750 to 1,800 EUR per-card GPU band, and the neighbouring months at 38,820 plus 51,220 and 111,220 are archive figures. The split of the 88,400 into headcount, labelling, GPU, and tooling shares respects those anchors but its specific percentages are illustrative, the shape of the mix rather than an audited breakdown; the argument is about proportion, not precision, and nothing turns on whether labelling was eight percent or eleven. The claim also does not transfer as a constant. A different research-to-engineering ratio, a compute-heavy training regime, or a team paying external annotators at scale could see compute or labelling stop being thin slices at all. What transfers is the method, open the month and read the composition, not our particular percentages.
The number you read is not the number that matters
The habit this left us with is to distrust a burn rate stated as a single figure, including our own. The total is the number everyone asks for and the least useful one to answer with, because two months that built entirely different amounts of durable capability can print the same 88,400. What matters is underneath it: how much was the headcount slice doing work that compounds, how much was labelling buying data that outlives the model, how much was tooling making the expensive people efficient, and how much was rented consumption that vanished when the billing period closed. Read those, and the top-line stops being a verdict and becomes what it always was, a sum. The verdict was in the mix the whole time.
References
[1] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., and Dennison, D. Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems 28 (NeurIPS 2015). The observation that model code is a small fraction of a real ML system and that the surrounding data and tooling are where sustained cost lives. https://proceedings.neurips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
[2] Northcutt, C. G., Athalye, A., and Mueller, J. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. NeurIPS 2021 Datasets and Benchmarks Track. Evidence that labelled data is a fragile, expensive asset whose quality bounds what a model can reach. https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/f2217062e9a397a1dca429e7d70bc6ca-Abstract-round1.html
[3] Amodei, D., and Hernandez, D. AI and Compute. OpenAI (May 2018). The account of how fast frontier-training compute demand grew, useful here by contrast: on a small applied team the compute line is a thin rented slice, not the dominant cost. https://openai.com/research/ai-and-compute