Every conversation about deploying subsurface AI eventually collides with a slide that is not about models at all. It is about where the GPUs live. For a national oil company, that question is not a line item in an infrastructure budget — it is a data-sovereignty decision, a security-perimeter decision, and a long-run cost decision braided into one. In a roughly twenty-month engagement with a mid-sized Middle East carbonate operator we partnered with, where we built and trained a Detection-Transformer pipeline to pick fractures and beddings from borehole image logs, the infrastructure question was never academic. The training corpus was the operator's competitive edge — years of proprietary image logs that, by contract, could not leave the operator's network. That single constraint quietly settled an argument that vendors usually like to keep open. This piece is about why, for confidential subsurface AI, on-prem DGX-class compute tends to beat public cloud on both IP protection and total cost — and where hybrid earns its place.
The data is the whole argument
Start with the raw object, because its physics drives the entire infrastructure design. A single processed high-resolution borehole image log in this dataset is a strip roughly 690,000 pixels tall and 360 pixels wide — the unrolled borehole wall, hundreds of metres of logged section — weighing in at about 1.5 GB per image before a single augmentation. The vertical resolution is unforgiving: one pixel of binary wireline-log depth corresponds to roughly 3 cm of true depth, so the logs are large precisely because they are precise. Multiply ~1.5 GB across a corpus of a dozen-plus wells, then multiply that by the overlapping-patch expansion and 5x–10x augmentation a Detection Transformer needs to train, and the working set is not a file you casually sync to a bucket. It is a multi-terabyte, IP-laden raster that the model has to read, re-read, and re-read across a hundred training epochs.
Two properties of this corpus dictate the deployment decision more than any cloud-versus-on-prem checklist ever could.
First, gravity. When the training data is terabytes of high-resolution rasters and the compute has to stream them repeatedly, you move the compute to the data, not the data to the compute. Egress and ingress of a corpus this size is not a one-time copy; it is a recurring tax every time you retrain.
Second, confidentiality. These image logs are the operator's geological memory of its own reservoirs. They are exactly the asset that cannot be replicated, and exactly the asset that must never traverse a network the operator does not control. That is not a preference — for a Middle East NOC operating in a competitive basin, it is policy.
What "public cloud" actually costs here
The public-cloud pitch is elastic GPUs and zero capex. For a transient, bursty workload that pitch is genuinely good. Confidential subsurface model training is neither transient nor bursty — it is a sustained, multi-month, data-resident workload, and that is where the economics invert.
The bill-shock comes from three places that rarely show up in the headline GPU-hour rate:
- Egress and storage of a crown-jewel corpus. Every retrain re-reads the full augmented set. On metered storage and egress, a multi-terabyte image-log corpus that is read hundreds of times becomes a recurring line item that grows with every well you add — and unlike a fixed asset, it never amortizes.
- Sustained reserved GPU time. A run that occupies high-end accelerators continuously for weeks is the precise opposite of the spiky workload cloud spot pricing rewards. Pay on-demand and the meter never stops; reserve, and you have effectively bought the hardware at a premium without owning it.
- The exfiltration surface itself. Putting the corpus in a multi-tenant environment is not free even when nothing goes wrong — it imports a compliance and audit burden that, for an NOC, can be the most expensive line of all.
None of this means cloud is bad engineering. It means cloud is priced for the wrong shape of workload when the workload is a confidential corpus trained in place over months.
The on-prem stack we actually ran on
The counter-argument to cloud is not "buy a server." It is a tiered on-prem fabric sized to the workload, and in this engagement it spanned three deliberate tiers — training, data management, and an economy bench — each with a job.
At the top sits DGX-class training compute. An NVIDIA DGX A100 node carries 4 to 8 A100 GPUs, up to 640 GB of total GPU memory, 6 NVSwitches, and delivers 2.5–5 petaFLOPS of AI throughput (5–10 petaOPS at INT8). Stitch 5–10 of those nodes into a SuperPod — 25–50 PFLOPS of AI compute, 3–6 TB of aggregate GPU memory, knit together over 200 Gb HDR InfiniBand — and you have a sovereign training cluster that never touches a third party. For this dataset, even a single well-configured node was decisive: a 7.7 TB SSD / 512 GB RAM / 320 GB GPU-RAM DGX server chewed through the augmented image-log corpus comfortably, because the data sat on local NVMe rather than at the far end of an object store.
Below the training tier is the data-management tier — a server with 4 TB SSD and 128 GB RAM fronting a self-hosted storage backbone of 1 TB networked capacity over 4 TB of redundant flash SSD. This is the layer that makes gravity work for you instead of against you: the corpus lives next to the GPUs, versioned and backed up, never leaving the building.
And critically, there was an economy tier. Not every experiment needs a SuperPod. A bench of consumer-grade boxes — at the floor, a GTX 1080 Ti stack at 8 GB per machine, alongside custom 2 TB SSD / 64 GB RAM / 11 GB GPU-RAM and 2 TB SSD / 128 GB RAM / 64 GB GPU-RAM nodes — handled rapid iteration, debugging, and small ablations. The lesson there is one most cloud-first teams learn late: a great deal of model-development work is cheap, and you do not want to be paying premium GPU-hour rates to babysit a dataloader bug. Owning the floor of your compute pyramid is what keeps the experimentation loop fast and free.
This is the heart of the cost case. With on-prem, the marginal cost of one more retrain is electricity and an engineer's afternoon. With on-demand cloud, it is another invoice. Over a twenty-month programme with many retrains across a growing well count, that difference is not a rounding error — it is the budget.
Inference is where governance becomes architecture
Training is the expensive half; inference is the half that touches the operator's daily workflow, and it is where the security perimeter stops being a slide and becomes a port number.
The interpretation tools we shipped — automated fracture and bedding picking, plus a well-to-well correlation engine — had to be consumed inside the operator's environment, by their geoscientists, against their live data. So the production layer was a containerized Streamlit application, packaged with Docker and exposed on port 8501, served over the LAN and wired directly into the operator's own earth/geoscience interpretation platform. No model artifact, no inference request, and no result ever crossed the network boundary. The application ran where the data was governed.
That architecture is the literal embodiment of the governance requirement. A model behind a LAN-only Docker endpoint cannot leak a corpus it never sends anywhere. The deployment topology is the data-protection policy, expressed in infrastructure rather than in a contract clause. And the payoff was operational, not merely defensive: the automated picking tools ran interpretation roughly 5x faster than the manual baseline, and the well-to-well correlation tooling lifted interpreter productivity by about 60% and interpretation accuracy by about 75% — gains realised by the operator's own staff, on the operator's own machines, behind the operator's own firewall.
The MLOps discipline that makes this work is worth naming explicitly, because it is the part that separates a research demo from a production system. Containerizing the inference service pins the runtime so the model behaves identically on the bench and in production. Versioning the corpus on the data-management tier makes every retrain reproducible. Exposing a single, audited LAN port shrinks the attack surface to one well-understood interface. This is applied-AI systems engineering — computer-vision pipelines, a Detection-Transformer training stack, and geomatics-platform integration — as much as it is research, and it is the layer where cloud-versus-on-prem stops being a cost debate and becomes a security architecture.
So where does hybrid fit?
Hybrid is not a compromise to default into; it is a scalpel for specific, governance-approved tasks. The honest answer from this engagement is that hybrid earns its place exactly where the public component touches no confidential data.
Three legitimate hybrid patterns:
- Pre-training and architecture research on public or synthetic data. Backbone sweeps and loss-function experiments that never see the operator's image logs can run anywhere. Borrowed time on a vendor's accelerators — short-loan GPU access for a burst experiment — is a perfectly sound hybrid move when the input is public.
- Non-sensitive tooling and CI. Build pipelines, dependency management, and documentation can live in the cloud without ever co-locating with the corpus.
- Disaster-recovery posture, encrypted and operator-keyed. If governance explicitly permits an encrypted off-site backup under the operator's own keys, hybrid storage can harden resilience — but this is a decision the operator's security office makes, not the ML team.
The deciding question for every workload is blunt: does this component need to see confidential subsurface data? If yes, it stays on-prem, full stop. If no, the cloud is fair game. Hybrid done well is just that question, answered honestly, workload by workload — not a blanket architecture and certainly not a way to smuggle the corpus past a policy that exists for good reason.
The pattern, stated plainly
The build-versus-buy and cloud-versus-on-prem decision in subsurface AI is not really about infrastructure fashion. It is sorted by what an operator owns: the deeper and more confidential the proprietary subsurface corpus, the stronger the case to build on sovereign, on-prem compute and keep it there. An NOC sitting on years of irreplaceable image logs is on the build-and-host end of that spectrum almost by definition. A mid-tier independent with no comparable corpus can reasonably license and consume off-the-shelf tooling. The corpus is the moat, and the infrastructure should be shaped to defend it.
We have built and deployed this pattern with operators across the Middle East and the United States, and the same physics keeps producing the same answer: when the data is heavy and confidential and the training is sustained, you move the compute to the data, you keep the perimeter inside the LAN, and you reserve the cloud for the work that genuinely has nothing to hide.
Key takeaways
- The deciding constraint for confidential subsurface AI is the data, not the model. At ~1.5 GB per image log and a multi-terabyte augmented corpus that retrains read hundreds of times, data gravity and IP confidentiality push compute on-prem before any cost spreadsheet is opened.
- Public cloud is priced for transient, bursty workloads. Confidential model training is sustained, data-resident, and multi-month — so cloud bill-shock comes from corpus egress/storage, weeks of reserved GPU time, and the compliance burden of putting crown-jewel logs in a multi-tenant environment.
- On-prem DGX-class compute wins on both axes. A DGX A100 node (4-8 A100s, up to 640 GB GPU memory, 2.5-5 petaFLOPS AI) or a 5-10 node SuperPod (25-50 PFLOPS, 3-6 TB GPU memory, 200 Gb InfiniBand) keeps the corpus sovereign, while an economy tier (1080 Ti / custom nodes) keeps experimentation cheap and fast.
- Inference governance is architecture: a containerized Streamlit app on Docker port 8501, served LAN-only into the operator's own interpretation platform, means the corpus is protected by topology, not just by contract — and it still delivered ~5x faster interpretation and ~60% productivity / ~75% accuracy gains for the operator's own staff.
- Hybrid is a scalpel, not a default. It is sound only where the public component touches no confidential data — public/synthetic pre-training, CI, or operator-keyed encrypted DR. The governing test for every workload: does it need to see confidential subsurface data? If yes, it stays on-prem.