MIG-Partitioning an A100 to Run 60 Experiments at Once

Phase 2 of our subsurface engagement with a mid-sized Middle East carbonate operator ran into a wall that had nothing to do with the models. The supervised and unsupervised tracks split into two parallel pipelines, each spawning its own sweep of runs, and the compute bill for that phase came in at 2.2 times what we had budgeted. The models were fine. The way we were using the hardware was not. A single training run on a raster well-log segmenter takes 6 to 18 hours, and we were queuing dozens of them behind one card at a time, one after another, while an energy-price spike made every idle GPU-hour more expensive than the last.

The fix was not a bigger machine. It was slicing the machine we already had. Multi-Instance GPU partitioning splits one A100 into several hardware-isolated instances, each with its own dedicated compute and memory, so a swarm of small runs can share the card at once instead of taking turns. Across the A100 fleet we could stand up as many as 28 of these MIG containers, and at peak we held 60 to 90 concurrent training runs against them at 80 to 90 percent sustained utilisation. That is a research bench built from one card, and it is the practical answer to a compute crunch: you do not buy your way out, you partition your way out.

Why partitioning beats queueing for a run swarm

The economics of a research phase are not the economics of a production service. In production you serve one trained model over and over, and throughput is a serving property. In research you are running an experiment sweep, dozens of variants of the same job differing by a hyperparameter, a data split, an augmentation setting. Each is small in its own right, most of them do not saturate a full A100, and they all want to run now so the sweep finishes this week rather than next month.

Run them sequentially and the card sits at a fraction of its capacity for most of every run, because a small-data job cannot fill the memory or the compute of the whole device. Run them naively in parallel on one undivided GPU and they contend for the same memory allocator, and one job that overruns takes the others down with it. MIG is the middle path. It carves the card into instances at the hardware level, so each run gets a fixed, isolated slice, and a crash in one slice cannot stall the rest. The sum of the slices packs the device far closer to full than any single small run could, which is where the 80 to 90 percent sustained number comes from.

The lever you actually pull is partition grain. Fine slices give you many small instances and let a large swarm of small-data runs share the card. Coarse slices give you a few large instances for the handful of runs that genuinely need the whole device. The grid below reads that tradeoff directly.

One A100 fleet, partitioned into up to 28 isolated Multi-Instance GPU containers, read as a research bench. Drag the partition-grain lever from a few large runs (coarse slices) to many small-data runs (fine slices): the grid fills with active instances, the concurrency band shows the run count that grain buys inside the sourced 60-90 peak band, and the utilization bar holds inside the sourced 80-90% sustained range. The orange marker is the only element that argues: concurrent runs sliding across the band as slices get finer, which is how a single card absorbed a swarm of 6-to-18-hour runs after Phase-2 compute overran its budget 2.2x. The 28-container ceiling, the 60-90 concurrent-run band, the 80-90% sustained utilization, the 6-18 hour run length, the 5-7x growth over prior phases, the DGX A100 envelope (4-8x A100, up to 640 GB GPU memory, 2.5-5 petaFLOPS AI), and the 2.2x Phase-2 overrun are sourced from the engagement archive; the per-grain mapping between instance count and run count is a monotone interpolation between those sourced endpoints, not a measured curve.

Reading the grid: grain in, concurrency out

Drag the partition-grain lever and three numbers move together. Finer grain lights more of the 28 instance cells, pushes the concurrent-run count up toward the top of the 60-to-90 band, and holds sustained utilisation inside the 80-to-90 percent range. Coarse grain does the opposite: fewer, larger instances, fewer concurrent runs, the card handed to a few big jobs. The concurrency line stays inside the sourced band across the whole sweep, because the band is the envelope we actually observed, not an extrapolation. The one element that argues is the orange marker sliding across that band as the slices get finer, which is the whole claim in one motion: grain buys concurrency.

The mapping between a given instance count and a given run count in the middle of the sweep is an interpolation between the two endpoints we measured, so treat the interior of the line as a guide rather than a data point. The endpoints and the band are real. The 28-container ceiling, the 60-to-90 peak, the 80-to-90 percent sustained utilisation, and the 6-to-18-hour run length are all from the phase record.

How we sized partitions in practice

The decision rule was almost entirely a function of one question: how much data does this run touch. Most of our experiment sweeps were small-data by construction, ablations over a few wells, augmentation-ratio scans, loss-function comparisons, each fitting comfortably inside a fraction of the A100's memory. Those went onto fine slices, many to a card, and they are what let the concurrent-run count grow 5 to 7 times over what the earlier phases sustained. A handful of jobs did need the whole device: a full multi-well training pass on the largest dataset, or a run whose batch geometry pushed memory to the edge. Those got a coarse slice or the undivided card, and we scheduled them when the swarm was thin.

Two rules kept it honest. First, size the slice to the run's memory footprint, not to its wishful priority. A run that fits in a small instance does not get a large one just because someone wants it done sooner. Second, keep the isolation strict. The reason we could pack the card so hard is that a MIG instance is a hardware boundary, not a scheduler hint. When a run crashed, and long runs crash, it took down its own slice and nothing else, so the other 20-odd runs kept going. On an undivided card a single bad allocation would have stalled the whole sweep.

The batch-size mechanics underneath these runs, the variable-dimension collate and the checkpoint-resume that survives an unreliable box, we treated separately; that reliability story is told in our note on batch scheduling around a GPU that keeps going down, and this piece does not re-derive it. Here the unit of the argument is the whole card, not the batch inside one run.

The envelope, and why efficiency stopped being optional

The card the swarm ran on sat inside a DGX A100: 4 to 8 A100 GPUs, up to 640 GB of total GPU memory, and 2.5 to 5 petaFLOPS of AI compute. That is a lot of hardware, and the temptation with a lot of hardware is to run jobs one at a time and let the machine coast. The 2.2x Phase-2 overrun is what took that option off the table. When the energy shock made every GPU-hour more expensive and the budget was already spent, the only free capacity left was the capacity we were wasting to idle. Partitioning recovered it. Reading a card at 80 to 90 percent sustained instead of a small fraction is not a marginal gain when you are running the machine around the clock for weeks; it is the difference between finishing the phase inside a revised budget and asking for more.

None of this required a new model, a new dataset, or a new box. It required treating one A100 as a bench of many small machines rather than one large one, and matching the slice to the run. The compute crunch was real, and the answer to it was already sitting in the rack, undivided.

Limitations

The concurrency band and the utilisation range are observed peak figures from one engagement phase on one A100 fleet, not a controlled benchmark, and they reflect a specific workload of mostly small-data subsurface training runs; a different mix of larger jobs would shift both numbers down. The per-grain mapping in the instrument between instance count and concurrent-run count is a monotone interpolation between the two sourced endpoints, not a measured curve, and should be read as a guide to the tradeoff rather than a prediction. MIG partition geometries on the A100 are fixed to a small set of supported profiles, so real slice sizes are quantised rather than continuous, and the smooth lever is a simplification of that. Finally, the 2.2x overrun is a phase-level compute-hour figure and does not, on its own, isolate how much of the recovery came from partitioning versus scheduling and other efficiency work done in the same window.

MIG-Partitioning an A100 to Run 60 Experiments at Once

Why partitioning beats queueing for a run swarm

Reading the grid: grain in, concurrency out

How we sized partitions in practice

The envelope, and why efficiency stopped being optional

Limitations

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on