Skip to main content

Research

Air-Gapped and On-Prem Model Serving: A Survey for Regulated and Sensitive Domains

Managed cloud serving is the default the machine-learning-operations literature was written for, from the technical-debt accounting of Sculley and colleagues through the prediction-serving systems of Crankshaw and colleagues and Olston and colleagues to the deployment-case survey of Paleyes and colleagues. Regulated subsurface data breaks that default: when an operator's raster archive cannot leave a boundary it controls, the model has to be served where the data already sits. This survey reads the serving-posture question as a matrix of four postures, public serverless, private virtual private cloud, on-prem, and fully air-gapped, scored on data residency, latency, and operations burden, and grounds it against a real reference footprint from our raster well-log digitiser, VeerNet: a batch-of-one, single-channel model with GroupNorm groups of 16 and five encoder plus five decoder stages. The central finding is that moving off the public cloud recovers residency control at a real cost in operational convenience, and that on-prem and air-gapped postures buy most of that convenience back only when the served model is small enough for one card in a locked room, which is exactly what a batch-of-one dense-prediction network is. The public works and systems belong to their authors; the reference footprint is ours and is used as a worked example, not as a benchmark.

The EarthScan Teamby The EarthScan Team10 min read
Research

Abstract

Almost every account of how to serve a machine-learning model assumes the model can go to the cloud. The literature that defined machine-learning operations was written on that assumption, from the technical-debt accounting that named the maintenance costs of such a system [2], through the low-latency prediction-serving layers built to sit between applications and model frameworks [3] [4], to the deployment-case survey that catalogued where real projects stumble [5]. Regulated subsurface data does not honour it. When an operator's raster well-log archive cannot leave a boundary the operator controls, the model has to be served where the data already sits, and the convenient default is gone before the first design decision. This survey treats serving as a posture matrix: four postures, public serverless, a private virtual private cloud, an on-prem tier, and a fully air-gapped enclave, scored against the three axes a regulated project trades between, data residency, latency, and operations burden. We ground the reading against one real reference footprint from VeerNet, our raster well-log digitiser: a memory-bound model that serves at batch size one, takes a single grayscale channel, uses GroupNorm with a minimum group of sixteen, and runs five encoder and five decoder stages. The finding is that moving off public cloud recovers residency control at a real cost in convenience, and that on-prem and air-gapped postures give most of that convenience back only when the served model is small enough for one card in one locked room, which is precisely the shape a batch-of-one dense-prediction network already has.

Why the default posture fails here

The public prediction-serving systems the field standardised on are excellent, and none of them is the problem. Clipper interposes a general low-latency layer between an application and whatever framework trained the model, adding caching and adaptive batching so a request stream meets a latency target [3]. TensorFlow-Serving hardens the model-lookup and inference paths for throughput and handles the version churn of models moving from training to production [4]. Both assume the served model and the request both live in an environment the serving team administers, usually a managed cloud, and both are worth reaching for when that assumption holds.

For a regulated subsurface archive the assumption is exactly what does not hold. The raster logs an operator wants digitised are frequently subject to residency and transfer constraints that treat sending the data to a third-party region as the regulated event, independent of any contract [1]. The data cannot cross the boundary, so the model has to be brought to the data rather than the other way around. That single inversion is what turns serving from a solved problem into a posture decision, and it is the decision this survey maps.

The three axes a regulated project trades between

A serving posture is not scored on one number. Three quantities move against each other as a project walks off the public cloud, and naming them is what keeps the comparison honest. The first is data residency: how tightly the operator's archive stays inside a boundary the operator controls, which rises monotonically as serving moves from a shared public region toward a locked room with no external route [1]. The second is latency, which for a batch-of-one digitiser is mostly about proximity: a model sitting on the same network as the archive answers without a round trip to a distant region. The third, and the one the field has documented most carefully, is operations burden: the on-call, the patching, the retraining pipeline, and the quiet maintenance debt that a serving system accrues whether or not anyone budgets for it [2] [5].

The public-cloud posture optimises the third axis and ignores the first. Someone else carries the on-call and the elastic scale; the residency question is answered by trusting a region. Every step off that posture trades operations convenience back for residency control. The survey's whole argument is about the shape of that trade, and specifically about when it can be made cheap.

The reference footprint that changes the trade

The reason the trade is not fixed is that its cost depends on the model being served. A posture that would be ruinous for a model needing a cluster of accelerators can be almost free for a model that fits on a single card, because the operational apparatus that makes the cloud convenient, elastic scale-out most of all, is what a small model does not need.

Our reference footprint is that kind of model. VeerNet's digitiser serves at a physical batch size of one, memory-bound by log images that are variable and very wide. It takes a single grayscale channel, because a scanned log is a grayscale raster. It normalises with GroupNorm at a minimum group of sixteen rather than batch normalisation, the choice that lets a batch of one train stably at all, since group statistics do not collapse when the batch has one member the way batch statistics do. And it is five encoder stages down and five decoder stages back up, plain enough that one commodity card holds it comfortably. Each is a design decision forced by the data, and together they describe a model a locked room can host without a cluster.

SERVING POSTURE SURVEY · REGULATED SUBSURFACE DATA66%cloud convenience recovered when air-gappedResidency rules push serving off public cloud; only a small model buys the convenience backON-PREM-FRIENDLY REFERENCE FOOTPRINTbatch 1memory bound1 channelgrayscale logGN 16min group5+5enc + dec stagesA batch-of-one, single-card model fitsone locked room. A cluster does not.served footprintreferenceair-gapped keeps66%SMALL FOOTPRINToff-cloud stays practical0255075100CONVENIENCE RECOVERED (%)Public serverlessPrivate VPCOn-premFully air-gappedless residency controlmore residency controlDATA-RESIDENCY CONTROL →FOOTPRINT LEVERdrag: grow the served model fromthe reference footprint to a clusterref2x4x6x8xrefreset to referencesourced: batch 1, 1 channel, GroupNorm 16, 5+5 stages · posture axis scores are an illustrative survey rubric
A survey of four serving postures for a regulated subsurface archive, scored on two axes. The x-axis is data-residency control, which rises as serving moves off the public cloud: public serverless keeps the least, a fully air-gapped enclave the most. The y-axis is how much managed-cloud operational convenience -- elastic scale, someone else's on-call, no image to ship -- survives the move. The teal frontier connects the postures at the current footprint; its downward slope from left to right is the tension a regulated operator lives with. The footprint lever grows the served model from the project's real reference footprint (batch size 1, single grayscale channel, GroupNorm groups of 16, 5 encoder and 5 decoder stages) up to a heavy hypothetical needing a cluster. At the reference footprint the on-prem and air-gapped postures lift back toward the top, because one small model fits one locked room; drag the footprint up and they collapse, because a locked room cannot elastically grow a cluster. The one orange element that argues is the recovered-convenience marker on the fully air-gapped posture, the hardest residency posture, which stays high only while the model stays small. The reference-footprint numbers are sourced from the engagement archive; the posture scores on both axes are an illustrative survey rubric, and this is a decision aid, not a benchmark.

The exhibit above reads the posture matrix against that footprint. The horizontal axis is residency control, rising as serving moves off the public cloud through a private virtual private cloud and an on-prem tier to a fully air-gapped enclave. The vertical axis is how much managed-cloud convenience survives the move. The teal frontier connects the four postures at the current footprint, and its downward slope is the tension a regulated operator lives with: more control costs convenience. The lever grows the served model from our real reference footprint up to a heavy hypothetical that would need a cluster, and dragging it is the argument. At the reference footprint the on-prem and air-gapped postures lift back toward the top, because one small model fits one locked room and the lost elastic scale is scale a batch-of-one model was not using. Grow the footprint and those postures collapse, because a locked room cannot add cards the way a cloud region can. The one element in the scarce orange is the recovered-convenience marker on the fully air-gapped posture, the hardest residency posture, which stays high only while the model stays small. The reference-footprint numbers are sourced from the engagement archive; the posture scores are an illustrative survey rubric, flagged as such on the canvas.

Reading the postures against the axes

The four postures fall in a clear order on residency and a reverse order on convenience, and the batch-of-one footprint compresses the reverse order. Public serverless tops the convenience axis and bottoms the residency axis: it is the posture the serving systems were built for [3] [4], but it answers residency by trusting a region rather than controlling it, which a regulated archive cannot accept [1]. A private virtual private cloud recovers some residency by pinning the data to a controlled network segment while keeping much of the managed operations model, and it is the step most projects reach for first. On-prem moves serving onto hardware the operator owns, answers residency and latency well, and hands the operator the full operations burden the field warned about [2] [5]. Fully air-gapped severs the external route entirely: the strongest residency posture and, for most models, the harshest convenience penalty.

The finding is that for the reference footprint the penalty at the two hardest postures is far smaller than the general case suggests. The operations burden of on-prem and air-gapped serving scales with what the model needs to run: a cluster demands orchestration, failover, and a scale-out story, while a single card demands a box, a power supply, and a retraining cadence that runs on a schedule rather than on demand. A batch-of-one, single-channel, GroupNorm-sixteen model with five stages each way is the second kind, and for it the air-gapped posture is not a sacrifice but a reasonable default, because the convenience it gives up is convenience it was never spending.

What the survey does not claim

It would be dishonest to read this as a case against the cloud in general. The public prediction-serving work is the right tool when residency permits it, and its caching, batching, and version-management machinery solve real problems that an on-prem deployment then has to solve again by hand [3] [4]. The point is narrower and, we think, sturdier: when residency forbids the cloud, the cost of obeying that constraint is not fixed, and a project can make it small by keeping the served model small. The reference footprint is our evidence that the small-model regime is reachable for a genuinely useful dense-prediction task, not a toy. The posture axes are a decision rubric for thinking about the trade, not a scoreboard, and the exhibit says so on its face.

Limitations

This is a survey and a decision aid, and it inherits the limits of both. It reads the public machine-learning-operations and serving literature and does not re-implement or re-measure any of the systems it discusses; where it names caching, batching, or version management, those are the reported properties of the cited systems [3] [4], not results we reproduced. The posture matrix's scores on residency, latency, and operations burden are an illustrative rubric for ordering the trade, not measured quantities, and they are flagged as illustrative on the exhibit; only the reference-footprint facts are sourced. Those facts, batch size one, a single grayscale channel, GroupNorm with a minimum group of sixteen, and five encoder plus five decoder stages, are the recorded design of one architecture on one engagement and describe a model that happens to be small, not a proof that every regulated task can be made small. We did not benchmark serving latency or throughput across the four postures on real hardware, so the survey makes no timing claim; the latency axis is a proximity argument, not a measurement. The residency constraint we treat as decisive varies by jurisdiction and operator [1], and a project in a permissive regime may find the cloud posture entirely acceptable. Take this as a map of when moving serving off the public cloud is cheap and when it is not, and run your own numbers on your own model and your own regulator before committing to a posture.

References

[1] European Parliament and Council. Regulation (EU) 2016/679 (General Data Protection Regulation). Official Journal of the European Union, 2016. Establishes the residency, transfer, and processing constraints that push regulated data toward serving inside a controlled boundary. https://eur-lex.europa.eu/eli/reg/2016/679/oj

[2] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., and Dennison, D. Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems (NeurIPS), 2015. Names the ongoing maintenance costs specific to machine-learning systems, the operations burden a serving posture inherits. https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems

[3] Crankshaw, D., Wang, X., Zhou, G., Franklin, M. J., Gonzalez, J. E., and Stoica, I. Clipper: A Low-Latency Online Prediction Serving System. 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2017. A general low-latency prediction-serving layer with caching and adaptive batching between applications and model frameworks. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/crankshaw

[4] Olston, C., Fiedel, N., Gorovoy, K., Harmsen, J., Lao, L., Li, F., Rajashekhar, V., Ramesh, S., and Soyke, J. TensorFlow-Serving: Flexible, High-Performance ML Serving. arXiv:1712.06139, 2017. A production model-serving system whose model-lookup and inference paths are optimised for throughput and version management. https://arxiv.org/abs/1712.06139

[5] Paleyes, A., Urma, R.-G., and Lawrence, N. D. Challenges in Deploying Machine Learning: A Survey of Case Studies. ACM Computing Surveys, 55(6), Article 114, 2022. Reviews reported deployments across industries and extracts the practical obstacles that arise at each stage of the machine-learning deployment workflow. https://dl.acm.org/doi/10.1145/3533378

Go to Top

© 2026 Copyright. Earthscan