Skip to main content

Blog

MIG-Partitioning an A100 to Run 60 Experiments at Once

A technique note on squeezing a research bench out of a single card. During an energy-price shock that pushed our compute spend past budget, Multi-Instance GPU partitioning let one A100 fleet carry 60 to 90 concurrent training runs at 80 to 90 percent sustained utilisation, up to 28 hardware-isolated instances sharing the hardware. This is the partition-sizing rule we settled on: fine slices for a swarm of small-data runs, coarse slices for the few large ones, and how we read the tradeoff off one grid.

Narendra Patwardhanby Narendra Patwardhan7 min read
EarthScan insight

Phase 2 of our subsurface engagement with a mid-sized Middle East carbonate operator ran into a wall that had nothing to do with the models. The supervised and unsupervised tracks split into two parallel pipelines, each spawning its own sweep of runs, and the compute bill for that phase came in at 2.2 times what we had budgeted. The models were fine. The way we were using the hardware was not. A single training run on a raster well-log segmenter takes 6 to 18 hours, and we were queuing dozens of them behind one card at a time, one after another, while an energy-price spike made every idle GPU-hour more expensive than the last.

The fix was not a bigger machine. It was slicing the machine we already had. Multi-Instance GPU partitioning splits one A100 into several hardware-isolated instances, each with its own dedicated compute and memory, so a swarm of small runs can share the card at once instead of taking turns. Across the A100 fleet we could stand up as many as 28 of these MIG containers, and at peak we held 60 to 90 concurrent training runs against them at 80 to 90 percent sustained utilisation. That is a research bench built from one card, and it is the practical answer to a compute crunch: you do not buy your way out, you partition your way out.

Why partitioning beats queueing for a run swarm

The economics of a research phase are not the economics of a production service. In production you serve one trained model over and over, and throughput is a serving property. In research you are running an experiment sweep, dozens of variants of the same job differing by a hyperparameter, a data split, an augmentation setting. Each is small in its own right, most of them do not saturate a full A100, and they all want to run now so the sweep finishes this week rather than next month.

Run them sequentially and the card sits at a fraction of its capacity for most of every run, because a small-data job cannot fill the memory or the compute of the whole device. Run them naively in parallel on one undivided GPU and they contend for the same memory allocator, and one job that overruns takes the others down with it. MIG is the middle path. It carves the card into instances at the hardware level, so each run gets a fixed, isolated slice, and a crash in one slice cannot stall the rest. The sum of the slices packs the device far closer to full than any single small run could, which is where the 80 to 90 percent sustained number comes from.

The lever you actually pull is partition grain. Fine slices give you many small instances and let a large swarm of small-data runs share the card. Coarse slices give you a few large instances for the handful of runs that genuinely need the whole device. The grid below reads that tradeoff directly.

ONE A100, PARTITIONED · CONCURRENT RUNS ON A SHARED CARD82concurrent runs on one cardSlice the card into isolated instances and a swarm of small-data runs shares itFiner partitions pack more runs at higher sustained utilization; coarse ones leavea few large runs to hold the whole card.AT THIS PARTITION GRAINMIG instances21 / 28concurrent runs82runs per instance3.9sustained GPU util87%stays inside the sourced 80-90% sustained bandTHE ENVELOPE IT RUNS ON4-8x A100DGX A100 GPUs640 GBtotal GPU memory2.5-5 PFpetaFLOPS AI6-18 hper training run2.2xPhase-2 compute overran budgetwhich is what made bench efficiency non-optional.MORE INSTANCES, MORE CONCURRENT RUNS (5-7x OVER PRIOR PHASES)90 runs peak60 runs41016222821 OF 28 ISOLATED INSTANCES CARRYING A RUNeach slice is hardware-isolated: a crash in one run cannot stall the othersPARTITION GRAINdrag: a few large runs (coarse)to many small-data runs (fine)coarsefine21 inst
One A100 fleet, partitioned into up to 28 isolated Multi-Instance GPU containers, read as a research bench. Drag the partition-grain lever from a few large runs (coarse slices) to many small-data runs (fine slices): the grid fills with active instances, the concurrency band shows the run count that grain buys inside the sourced 60-90 peak band, and the utilization bar holds inside the sourced 80-90% sustained range. The orange marker is the only element that argues: concurrent runs sliding across the band as slices get finer, which is how a single card absorbed a swarm of 6-to-18-hour runs after Phase-2 compute overran its budget 2.2x. The 28-container ceiling, the 60-90 concurrent-run band, the 80-90% sustained utilization, the 6-18 hour run length, the 5-7x growth over prior phases, the DGX A100 envelope (4-8x A100, up to 640 GB GPU memory, 2.5-5 petaFLOPS AI), and the 2.2x Phase-2 overrun are sourced from the engagement archive; the per-grain mapping between instance count and run count is a monotone interpolation between those sourced endpoints, not a measured curve.

Reading the grid: grain in, concurrency out

Drag the partition-grain lever and three numbers move together. Finer grain lights more of the 28 instance cells, pushes the concurrent-run count up toward the top of the 60-to-90 band, and holds sustained utilisation inside the 80-to-90 percent range. Coarse grain does the opposite: fewer, larger instances, fewer concurrent runs, the card handed to a few big jobs. The concurrency line stays inside the sourced band across the whole sweep, because the band is the envelope we actually observed, not an extrapolation. The one element that argues is the orange marker sliding across that band as the slices get finer, which is the whole claim in one motion: grain buys concurrency.

The mapping between a given instance count and a given run count in the middle of the sweep is an interpolation between the two endpoints we measured, so treat the interior of the line as a guide rather than a data point. The endpoints and the band are real. The 28-container ceiling, the 60-to-90 peak, the 80-to-90 percent sustained utilisation, and the 6-to-18-hour run length are all from the phase record.

How we sized partitions in practice

The decision rule was almost entirely a function of one question: how much data does this run touch. Most of our experiment sweeps were small-data by construction, ablations over a few wells, augmentation-ratio scans, loss-function comparisons, each fitting comfortably inside a fraction of the A100's memory. Those went onto fine slices, many to a card, and they are what let the concurrent-run count grow 5 to 7 times over what the earlier phases sustained. A handful of jobs did need the whole device: a full multi-well training pass on the largest dataset, or a run whose batch geometry pushed memory to the edge. Those got a coarse slice or the undivided card, and we scheduled them when the swarm was thin.

Two rules kept it honest. First, size the slice to the run's memory footprint, not to its wishful priority. A run that fits in a small instance does not get a large one just because someone wants it done sooner. Second, keep the isolation strict. The reason we could pack the card so hard is that a MIG instance is a hardware boundary, not a scheduler hint. When a run crashed, and long runs crash, it took down its own slice and nothing else, so the other 20-odd runs kept going. On an undivided card a single bad allocation would have stalled the whole sweep.

The batch-size mechanics underneath these runs, the variable-dimension collate and the checkpoint-resume that survives an unreliable box, we treated separately; that reliability story is told in our note on batch scheduling around a GPU that keeps going down, and this piece does not re-derive it. Here the unit of the argument is the whole card, not the batch inside one run.

The envelope, and why efficiency stopped being optional

The card the swarm ran on sat inside a DGX A100: 4 to 8 A100 GPUs, up to 640 GB of total GPU memory, and 2.5 to 5 petaFLOPS of AI compute. That is a lot of hardware, and the temptation with a lot of hardware is to run jobs one at a time and let the machine coast. The 2.2x Phase-2 overrun is what took that option off the table. When the energy shock made every GPU-hour more expensive and the budget was already spent, the only free capacity left was the capacity we were wasting to idle. Partitioning recovered it. Reading a card at 80 to 90 percent sustained instead of a small fraction is not a marginal gain when you are running the machine around the clock for weeks; it is the difference between finishing the phase inside a revised budget and asking for more.

None of this required a new model, a new dataset, or a new box. It required treating one A100 as a bench of many small machines rather than one large one, and matching the slice to the run. The compute crunch was real, and the answer to it was already sitting in the rack, undivided.

Limitations

The concurrency band and the utilisation range are observed peak figures from one engagement phase on one A100 fleet, not a controlled benchmark, and they reflect a specific workload of mostly small-data subsurface training runs; a different mix of larger jobs would shift both numbers down. The per-grain mapping in the instrument between instance count and concurrent-run count is a monotone interpolation between the two sourced endpoints, not a measured curve, and should be read as a guide to the tradeoff rather than a prediction. MIG partition geometries on the A100 are fixed to a small set of supported profiles, so real slice sizes are quantised rather than continuous, and the smooth lever is a simplification of that. Finally, the 2.2x overrun is a phase-level compute-hour figure and does not, on its own, isolate how much of the recovery came from partitioning versus scheduling and other efficiency work done in the same window.

Go to Top

© 2026 Copyright. Earthscan