MLOps for Geoscientists: Borrowing Practices From Software, Carefully

MLOps reached geoscience secondhand, as a set of habits that worked for teams shipping web services and were passed along with the assumption that they would work for us too. Most of them do. Version your data, version your models, make a run reproducible, keep an iteration loop tight enough that you actually learn between experiments: none of that is controversial, and a subsurface team that skips it is choosing to relearn expensive lessons the software industry already paid for. But a raster well-log digitisation model is not the workload those habits were tuned on. It trains on grayscale rasters that range from 3,200 to 12,800 pixels wide, and a single fifty-epoch pass costs 550 minutes on the 15,000-log multiclass set and 110 minutes on the 2,000-log binary set, on a GPU rented at 750 to 1,800 EUR a month. Under those numbers, some borrowed practices fit without alteration and one important one does not. This note sorts the toolkit honestly, so a geoscience team knows which parts of the playbook to take on faith and which to re-cut first.

This is not the story of how we built the model, which we have written up separately. It is one level up: given a working model, which of the surrounding engineering practices the ML systems literature recommends should a geoscience team adopt unchanged, and which need a modification. We lean on the systems-and-process end of that literature: Sculley and colleagues on the machinery around a model and how little of a real ML system is the model itself [1], Amershi and colleagues on where ML departs from prior software engineering by treating data as a first-class dependency [2], and Breck and colleagues on a scorable rubric that separates universally worthwhile practices from ones that only matter with round-the-clock traffic [3]. What we add is the geoscience calibration: which items transfer to a 550-minute-run reality and which do not.

The two practices that transfer without a modification

Start with the good news, because it is the larger part. Dataset and checkpoint versioning transfers intact, and the reasons software teams adopted it are sharper here. When a training set is synthetic, generated by a curve simulator whose knobs we tune between versions, the exact contents of the 15,000-log set that produced a given checkpoint are not something memory should be trusted with. Hashing the dataset and pinning the checkpoint to that hash is the same discipline a web team uses to tie a model to its training snapshot, and nothing about our image dimensions makes it harder: a 12,800-pixel raster hashes as cleanly as a 224-pixel thumbnail. This is the data-dependency management Sculley and colleagues warned goes unmanaged at your peril [1] and Amershi and colleagues found teams treat as seriously as code [2], and it drops into a geoscience pipeline with no adjustment.

Reproducible training transfers just as cleanly, and here the fit is almost embarrassingly good. A reproducible run means a fixed seed, a pinned 80/20 train and validation split, a fixed 50-epoch schedule, and enough logged configuration to recreate it. Because our runs are long and expensive, the payoff is higher than for a team that can rerun a cheap job ten times before lunch: when a 550-minute run surprises us, we want to know it was the change we made and not a shuffled split. The ML Test Score rubric lists reproducibility among the practices any serious team should score regardless of scale [3], and we adopt it as written, because the constraint that makes our runs painful is the same one that makes reproducibility pay.

The practice that needs re-cutting

Then there is the one that does not transfer cleanly, and it is worth naming because software teams take it most for granted: the budgeted iteration loop. In a web-scale setting the loop is cheap. You change something, retrain, look, change again, fast enough to run dozens of iterations against a hypothesis in a day, and a great deal of MLOps tooling assumes that cadence. But the cadence is a function of run cost, and ours is not web-scale. A single multiclass pass is 550 minutes. On a GPU rented by the month, a run that long does not just slow the loop; past a certain iteration count it crowds the loop out entirely, because the month does not contain enough box-time for both the number of experiments the habit wants and the length each one takes.

The modification is not to abandon iteration; it is to budget it against run cost instead of wall-clock impatience. At 110 minutes, on the binary set, the loop behaves almost the way the playbook expects and a generous iteration count is affordable. At 550 minutes it is not, and pretending otherwise produces a plan that quietly runs out of month. The discipline is to decide how many real iterations a rented month affords for the workload in front of you and spend that budget on the changes most likely to move the model. What makes this bite is that we cannot fully buy the speed back with a bigger batch. On the multiclass set a custom collate function lets the run pack a batch of 16 across the variable widths, and even so a fifty-epoch pass is 550 minutes; the binary run, at a batch of 1 forced by the same variable image sizes, is the workload where the loop still fits the month.

The exhibit below is that sort made tangible. Each borrowed practice sits on a transfer axis, high meaning it copies cleanly. Versioning and reproducibility sit high and stay there no matter how you move the controls. The budgeted iteration loop is the one that moves: pick the workload, drag the iterations you want to ask of a rented month, and watch the loop's transfer score cross the clean-copy line far sooner on the 550-minute multiclass workload than on the 110-minute binary one. The two constraints, the 3,200-to-12,800-pixel variable widths and the batch of 1 they force on the binary run, sit pinned low as the reasons the copy breaks.

A selective-import reader for a geoscience team deciding which software MLOps practices to adopt unchanged and which to re-cut for subsurface training. Each practice sits on a transfer axis where high means it copies cleanly from a web-scale pipeline. Dataset and checkpoint versioning and reproducible training pin high regardless of the levers, because nothing about the batch cap or a 3200-12800px input makes a hashed dataset or a fixed-seed, fixed-split, fixed-epoch rerun any less clean to copy. The two constraints, the image-size-capped batch and the variable widths, pin low as the reasons a naive copy stalls. Lever A toggles the workload, multiclass at 15,000 logs, a batch of 16 packed by a custom collate function, and 550 minutes, or binary at 2,000 logs, a batch of 1 forced by the variable image sizes, and 110 minutes, which sets the run cost. Lever B drags the iterations asked per rented month, the one genuinely contested practice: a budgeted iteration loop borrowed from software copies cleanly only while a rented month still fits the iterations asked of it, and a single 550-minute run crowds it out faster than a 110-minute one. The orange marker is the only element that argues, riding that contested loop as the lever moves it across the clean-copy line. The run costs, batch sizes, width range, GPU tiers, split, and epoch count are sourced from the engagement archive; the transfer scores and the iterations-per-month band are illustrative judgement, and the verdict is an adoption heuristic, not a benchmark.

Why the constraints do the sorting

It is worth tracing why image dimensions decide which practices transfer. Variable widths from 3,200 to 12,800 pixels mean a training batch cannot be a clean stack of identically shaped tensors without extra work: either you pad to the widest case, which wastes memory on the narrow logs, or you write a custom collate function that packs uneven rasters together. On the multiclass set we did the latter and ran a batch of 16, and even with the batch filled a fifty-epoch pass is 550 minutes, because the set is 15,000 logs of large rasters. On the binary set the same variable sizes were handled at a batch of 1, one raster a step, and a fifty-epoch pass there is 110 minutes on 2,000 logs. Either way the run cost is downstream of the image geometry and the dataset scale, and the iteration budget is downstream of the run cost, so a physical fact about scanned well logs propagates all the way up into a process decision about how many experiments a month can hold. Amershi and colleagues describe data as the dependency that makes ML engineering different from ordinary software engineering [2]; this is a concrete instance. Versioning and reproducibility operate on runs after the fact and do not care how long a run took, so they transfer. The iteration loop's whole value depends on runs being cheap, so it does not. Sorting the toolkit is really just asking, per practice, whether it is sensitive to run cost.

What we actually adopted

This left us with a short and unglamorous set of decisions. We took dataset and checkpoint versioning wholesale, and reproducible training wholesale with the seed, split, and epoch schedule pinned, valuing it more than a cheaper-to-run team would because our runs are too expensive to repeat casually. We kept an iteration loop but budgeted it against run cost per workload rather than running experiments on reflex, tightening the count on the 550-minute multiclass work and loosening it on the 110-minute binary work. And we treated the always-on serving machinery the production-readiness rubric describes [3] as out of scope for a model that digitises a fixed archive rather than serving live traffic: not every practice on a good checklist is one you need. The mistake is not importing MLOps, which is worth importing; it is importing all of it on the assumption that a subsurface workload behaves like the web workload the habits were tuned on. Version freely, reproduce religiously, and budget iteration against the clock the GPU actually runs on.

Limitations

This is a sort of practices against one engagement's numbers, not a general theory of MLOps in energy. The run costs of 550 and 110 minutes, the batch of 16 on the multiclass run and 1 on the binary run, the 3,200-to-12,800-pixel width range, the 80/20 split, the fifty-epoch schedule, and the 750-to-1,800 EUR monthly GPU tiers are the real archive figures, but the transfer scores the instrument plots are a hand-scored judgement of how cleanly each practice copies, not a measured quantity, and the iterations-per-month band is an illustrative sweep, not a logged experiment count. A different operator with fixed-size logs, a fatter affordable batch, or cheaper compute would find the loop transfers more cleanly, which is the point: the sort is a function of run cost, and run cost is a function of your data. The note also says nothing about the serving and monitoring end of MLOps, because our model digitises a fixed archive rather than serving live traffic. And a practice transferring cleanly is not the same as a team executing it well: versioning that no one checks and reproducibility that no one reruns are borrowed in name only.

The habit this left us with

What we kept is a single question we now ask of any practice the software world offers us: does its value depend on runs being cheap. If not, we borrow it without ceremony, because versioning and reproducibility are strictly better than the alternative and our constraints do not touch them. If so, we re-cut it around the run costs we actually pay, because a loop tuned for 110-minute runs will run out of month on 550-minute ones, and no amount of good tooling fixes a cadence the data will not support. MLOps for geoscientists is not a different discipline from MLOps for software. It is the same discipline with one extra step: before you adopt a practice, check whether the subsurface made your runs too expensive for it, and if it did, keep the practice and change the plan.

References

[1] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., and Dennison, D. Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems 28 (NIPS 2015). The account of the systems machinery around a model and how small a fraction of a real ML system the model itself is. https://proceedings.neurips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html

[2] Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B., and Zimmermann, T. Software Engineering for Machine Learning: A Case Study. ICSE-SEIP 2019. The study of ML teams that treats data as a first-class dependency and lays out where ML engineering departs from prior software engineering. https://www.microsoft.com/en-us/research/publication/software-engineering-for-machine-learning-a-case-study/

[3] Breck, E., Cai, S., Nielsen, E., Salib, M., and Sculley, D. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. IEEE International Conference on Big Data (2017). The rubric that separates reproducibility and versioning, which any team should adopt, from the production-serving practices that only matter with live traffic. https://research.google/pubs/pub46555/

MLOps for Geoscientists: Borrowing Practices From Software, Carefully

The two practices that transfer without a modification

The practice that needs re-cutting

Why the constraints do the sorting

What we actually adopted

Limitations

The habit this left us with

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on