Skip to main content

Blog

Productizing a Paper: Turning a Borehole Transformer Into AutoFrac, AutoVug, and Well-to-Well

A peer-reviewed borehole transformer with a 65% F1 abstract is not a product. The gap from that number to AutoFrac, AutoVug, and a well-to-well tool running on-prem — 5x faster picking, +60% productivity, 95% target precision — is packaging, serving, and CI/CD, not a new model architecture.

Tarry Singhby Tarry Singh10 min read
EarthScan insight

There is a quiet, expensive lie that applied-research teams tell themselves: that once the paper is accepted, the hard part is over. We learned the opposite in a roughly twenty-month subsurface-AI engagement with a mid-sized Middle East carbonate operator we partnered with. The model that anchored that work — a Detection Transformer that picks fractures, beds, and vugs off the image logs of two different microresistivity imaging tools as a set of sinusoids — earned its peer-reviewed metrics honestly: an F1 around 65% at a 3 cm depth threshold for fractures, dip and azimuth accuracy in the high eighties and nineties at sensible tolerances. Respectable research numbers — and completely useless to a petrophysicist facing a backlog of more than eighty wells, because a model is not a tool. The distance between a published architecture and a thing an interpreter opens on a Monday morning is almost entirely an engineering distance — packaging, serving, evaluation harnesses, and CI/CD — and it is the distance most teams catastrophically underestimate.

This piece is about how that one transformer became three deployed products — AutoFrac, AutoVug, and a well-to-well correlation tool we called W2W — shipped as a v0.8 stack onto the operator's own on-premise hardware. The headline outcomes are the kind that survive a procurement review: roughly 5x faster interpretation on the fracture and vug tools, and on W2W a +60% lift in interpreter productivity, +75% improvement in interpretation accuracy, 95% precision against the target horizons, and 90% stratigraphic correlation success. None of those numbers came from a better loss function. They came from the unglamorous layer above the model.

A model is a function; a product is a contract

The cleanest way to see the gap is to look at what the research artefact actually is. The model is a function: image patch in, a set of typed sinusoids out, each with a regressed depth, dip, and azimuth and a class probability you threshold at 0.8 at inference. That function is the entire content of the paper — and, on its own, unusable. It assumes a clean, dynamically-scaled patch of known pixel height. It emits tensors, not the eight-column interpretation-software-ready output — predicted bedding boundaries, sinusoids, fractures, and vugs — an interpreter loads beside the raw log. It has no notion of a whole well, only a 1.8-metre patch. And it runs in a notebook on a research GPU, not as a service a non-coder can invoke.

A product, by contrast, is a contract. AutoFrac promises: hand me the binary wireline log file for a well, and I will hand you back depth-registered fracture picks, in your interpretation software's format, in minutes, repeatably, on your hardware, behind your firewall. Honouring that contract is what consumed the back half of the engagement, and almost none of it touched the transformer's weights.

THE 85% UNDER THE MODEL · 6-LAYER STACK~50%of pilots never ship3 / 6 layers load-bearingBuild the stack up — the model is only the capA model is only as production-ready as the weakest layer below it.production ceilingModel — ~15% of the journey⤓ detached — POC purgatoryHPCbuilt · load-bearingData engineeringbuilt · load-bearingData unificationbuilt · load-bearingAI / MLdrift watch is decorativeAgentsunauditablePlatform & deploymentoutside the perimeterbuild linedrag the build line ↑ · column sizing schematicWHY THE PILOT STALLS3 layers missing below the model.Lowest gap — AI / ML:drift watch is decorative.The model can't reach production overan incomplete stack. It joins the ~50%that never ship — a failure of plumbing.The working model is ~15% of the journey.The other ~85% is the six-layer stack —and pilots die where the stack has seams.Own the stack: data + weights stay in your perimeter.~15% model / ~85% stack, the six named layers & the ~50%-never-ship figure are the whitepaper's own · column sizing schematic
Pilots don't stall because the model is weak. The working model is only ~15% of the journey; the other ~85% is a six-layer engineering stack (HPC → Data engineering → Data unification → AI/ML → Agents → Platform/deployment), and a project ships only when every layer below the model is built to production grade. Drag the build line up the load-bearing column: with all six built the model reaches the production ceiling; with any gap below it the model detaches into POC purgatory — the ~50% that never ship. The ~15%/~85% split, the six layers and the ~50% figure are the whitepaper's own; the equal-sixths column sizing is schematic.

The five layers between the abstract and AutoFrac

It helps to name the layers explicitly, because each one is a place where a "finished" model quietly fails to become a tool.

1. The data-engineering layer — the model's actual API. The transformer was trained on a very specific input distribution: dynamically-normalised image-log patches, 1.8 m tall, image-log values scaled to the unit interval, augmented 5x to 10x to inflate an original set of roughly 900 image–ground-truth pairs into more than 55,000 — a 65x expansion. Every one of those preprocessing decisions is now a runtime contract. If the ingestion pipeline that turns a raw image-log file into model-ready patches diverges from the training transform by even the depth-registration offset, the model degrades silently and the interpreter blames the model. So the first product we built was not a model server at all; it was a deterministic ingestion-and-tiling pipeline that reproduces the training transform exactly — depth indexing, dynamic scaling, overlapping-patch tiling — under test. The preprocessing is the API.

2. The post-processing layer — from tensors to geology. The decoder emits a fixed-size set of queries per patch; most are the no-object class. Turning that into the eight-column geological output a petrophysicist trusts means stitching overlapping-patch predictions back into a continuous wellbore, de-duplicating sinusoids that span a patch boundary, converting normalised parameters back into physical depth/dip/azimuth, and writing it all out in an interpretation-software-importable layout. This is undifferentiated plumbing, and it is exactly where research code is thinnest — and where a product lives or dies.

3. The serving layer — packaging for someone else's machine. The research model trained on a stack of NVIDIA 1080Ti cards with only 8 GB of memory per machine; production inference had to run on the operator's own tiered hardware, from those economy GPU boxes up to a DGX-class server, entirely on-premise. There is no cloud endpoint in this story — the operator's data is confidential field imagery and never leaves their walls. That single constraint dictates everything: the model is containerised with a pinned CUDA and dependency set, batch sizes are tuned to the 8 GB floor, and the whole thing is delivered as an artefact that installs behind the firewall. "It runs on my laptop" is not a deployment; "it runs on their DGX with no internet" is.

4. The evaluation layer — the metric the interpreter actually believes. A published F1 at a 3 cm threshold is a validation-set number. A product needs a standing evaluation harness: blind-zone tests, repeatability checks across re-runs, and a depth-thresholded scorecard the operator's own geoscientists can re-run on new wells. This harness is what converts "the paper says 65%" into "on your last well, against your interpreter's picks, here is the agreement." It is also the guardrail that catches the day the input distribution drifts and the model needs retraining.

5. The orchestration layer — packaging the three tools into a workflow. AutoFrac and AutoVug are sibling inference paths over the same backbone. W2W is the integration product: it consumes their per-well outputs and correlates stratigraphy across wells, which is where the +60% productivity and 95% target precision live. None of that exists at the model level — it is an application built on top of three model contracts.

Where the speed actually comes from

The 5x speedup is the number people remember, so its provenance matters — because it is an engineering result, not a modelling one. End to end, the productized pipeline interprets roughly a metre of image log in about 30 seconds, against the four-to-five minutes a manual or classical-CV pass takes per metre — closer to an eightfold edge once you account for the interpreter never leaving their software. The model's forward pass is a small fraction of that 30 seconds; the rest is the ingestion, tiling, stitching, and export we just walked through. Throughput is a systems property, not a weights property: you cannot buy this speedup with a bigger model, you earn it by making every layer above the model fast and deterministic.

And here is the part product leaders consistently get backwards. A 5x throughput dividend reads, to a CFO, as "cut four of every five interpreters." That is the wrong move, and it throws the dividend away. In a region carrying a backlog of more than eighty wells, the value of the dividend is coverage: the same scarce interpreters now clear five times the acreage, attack the backlog instead of perpetually triaging it, and spend their judgement on the 5% of picks the model flags as low-confidence rather than on the 95% it gets right. Hold the headcount and the dividend becomes capacity. Spend it on cuts and you are back to a backlog, just with a smaller team.

THE 10× DIVIDEND · 6–18 WEEKS → HOURS10× acreageinterpreted at 100% of the teamARTICLE’S THESISSpend the 10× dividend — on a smaller team, or on more acreage?Cutting to hold output flat throws the dividend away; holding the team converts every point of it into coverage.← spend on headcount cutsspend on more acreage →team → 10% · acreage → 1×team → 100% · acreage → 10×← drag to allocate · ←/→ keysTeam retainedAcreage covered100%10×of 10× ceilingsame headcount → ten times the interpreted acreage~10× throughput (6–18 weeks → hours) is the article's own · acreage = team × throughput.Headcount-retained axis (10–100%) is the arithmetic inverse of 10× — illustrative; no headcount number is sourced.
The naive read of a 10× productivity jump is 'cut 90% of the team.' The article argues the opposite: at the same headcount, the dividend buys ten times the interpreted acreage. Drag the allocator — spend the dividend on headcount cuts and acreage stays at 1×; hold the team and acreage climbs toward 10×. The single orange marker is the article's own position. The ~10× throughput multiple (6–18 weeks → hours) is sourced; the headcount-retained axis is the arithmetic inverse of 10× and is labelled illustrative, since the article states no specific headcount figure.

CI/CD is the product, not a chore

If there is one discipline that separates a deployed v0.8 from a clever notebook, it is treating the whole stack — preprocessing transform, model weights, post-processor, and evaluation harness — as a single versioned, continuously-integrated artefact. When the operator's geoscientists report a new failure mode on a fresh well, the response is not "retrain and email a new checkpoint." It is: add that well to the regression set, reproduce the failure in the evaluation harness, fix the layer responsible, and let CI prove that the eight-column output is unchanged on every prior well before the new build ships. The transformer's weights are one versioned component among several. The augmentation regime, the 0.8 inference threshold, the patch geometry, the export schema — all of them are configuration that must move together or the contract breaks.

This is the mindset shift the title is really about. The paper produced a model. Productizing it meant building everything the paper assumed away — and then wrapping that everything in the same release engineering you would give any other piece of production software. Across our engagements with operators in the Middle East and the United States, the pattern holds without exception: the teams that ship are the ones who treat the model as the easy part.

What the gap actually costs — and why it is worth it

The uncomfortable lesson is that the second half of an applied-AI programme — the productization half — is often larger than the first, and it decides whether anyone outside the research team ever benefits. The model got the operator a 65% F1 on a slide. The stack got them AutoFrac, AutoVug, and W2W on their own DGX, behind their own firewall, clearing their own backlog, with +60% productivity and 95% target precision. The architecture was the price of admission; the packaging, serving, evaluation, and CI/CD were the product.

Key takeaways

  1. A peer-reviewed model is a function, not a product. The borehole transformer's ~65% F1 abstract was honest research; turning it into AutoFrac, AutoVug, and W2W — 5x faster picking, +60% productivity, +75% accuracy, 95% target precision, 90% stratigraphic success — came from the engineering layers above the weights, not a new architecture.
  2. The model's training-time transform is a runtime contract: deterministic ingestion that reproduces the dynamic-scaling, 1.8 m patch tiling, and 5x–10x augmentation regime exactly is the model's real API. Diverge from it and the model degrades silently while users blame the model.
  3. The 5x/~8x speedup (≈30 s per metre vs 4–5 min) is a systems property, not a weights property. The forward pass is a small fraction of end-to-end time; ingestion, stitching, and eight-column interpretation-software export dominate — so you earn throughput by making every layer fast and deterministic.
  4. Serving is dictated by the operator's constraints: on-prem only (confidential field imagery never leaves the firewall), containerised with pinned CUDA, batch sizes tuned to an 8 GB GPU floor and scaling up to a DGX-class server. 'Runs on my laptop' is not a deployment.
  5. Treat the throughput dividend as coverage, not headcount cuts: with an 80+ well backlog, the same interpreters clear 5x the acreage and focus judgement on the low-confidence minority. And ship the whole stack — transform, weights, post-processor, eval harness — as one CI/CD-versioned artefact, because that is the actual product.
Go to Top

© 2026 Copyright. Earthscan