There is a moment, on every applied-AI engagement that lasts longer than a quarter, when someone asks the question that quietly decides whether the work was science or theatre: can you reproduce the number on slide 14? In our work with a mid-sized Middle East NOC carbonate operator we partnered with — a roughly twenty-month programme to detect fractures and beddings on borehole image logs with a Detection-Transformer-derived model — that number was a 14-well ablation table. It said classification error fell from 93.1% on three wells to 2.5% on fourteen, that dynamic borehole-image imagery crushed static, that a from-scratch ResNet-10 beat every deeper backbone. Those numbers carried the entire scientific argument. And they were only worth anything because, by the time we ran them, every dataset and every training run that produced them was versioned, named, and recoverable. This piece is about the boring infrastructure that made that possible: data versioning with DVC and experiment tracking with Weights & Biases. It is the least glamorous part of a subsurface-AI build, and it is the part that decides whether anyone can trust the rest.
The failure mode you don't see coming
The trap is not that researchers are careless. It is that the natural unit of a research codebase — a folder of data and a script that trains on it — has no memory. Six months into a programme you have a model checkpoint that performs well, a directory of preprocessed image patches, and a results figure. Then a reviewer asks a sharp question, you re-run the experiment to answer it, and the number comes out different. Now you are debugging a ghost. Was it a different split? A different augmentation seed? Did the patch-extraction stride change? Did someone re-export the ground-truth picks? You cannot tell, because none of those three things — the data, the code, the run — was pinned to the others.
This is acute in subsurface work for a reason that has nothing to do with deep learning. The data itself is enormous, slow-moving, and constantly re-derived. A single borehole-image log is around 1.5 GB. Raw downhole-log files are converted to image strips, normalised, sliced into overlapping patches, and matched against expert dip picks exported from interpretation software. Every one of those steps is a transformation with parameters, and every parameter is a place where two "identical" datasets can silently diverge. When we grew the working set from roughly 900 image–ground-truth pairs to over 55,000 — a 65× expansion driven by overlapping patches and augmentation — the dataset stopped being something a human could eyeball and started being something only a content hash could verify.
Versioning the data, not just the code
Git is excellent at versioning code and useless at versioning a 55,000-image dataset. DVC (Data Version Control) exists to close exactly that gap. It keeps a small text pointer in Git — a .dvc file containing a content hash and a remote path — while the heavy bytes live in object storage. Checking out an old commit checks out the pointer; dvc pull then materialises the exact dataset that commit was trained on. The effect is that a dataset acquires a commit history the same way code does, and a training run can be tied to a precise data state rather than to "whatever was in the folder that week."
That binding is the whole point. Reproducibility is not a property of data or of code in isolation; it is a property of the pair. DVC makes the pair addressable: one Git SHA now resolves to one code state and one data state, so "re-run experiment X" stops being an act of faith.
The discipline it enforced on us was as valuable as the tooling. Early on, datasets were named like research output usually is — by whatever was salient that day, sometimes by an opaque hash. That works until you have a dozen of them and need to explain to a client geologist which dataset produced which curve. We ended up running a data-management surface through twelve numbered iterations over the programme, and the naming convention mattered as much as the storage: a dataset name had to encode its provenance — which wells, dynamic or static imagery, what patch stride — because a name that cannot be decoded is a name that will eventually be mistrusted. When we cut the patch stride from 40 to 80 pixels across a roughly 92,000-patch overlapped-and-augmented set, that was a new dataset version, not an edit to the old one. The old one still existed, still pulled, still reproduced its own results.
Tracking the experiments, not just the best model
Versioned data answers what did I train on. Experiment tracking answers what happened when I did. Weights & Biases was the system of record for the second half. Every training run logged its hyperparameters, its loss curves, its evaluation metrics, and — critically — a pointer back to the data version and the resulting checkpoint. A run was no longer a transient event in someone's terminal; it was a durable, queryable artifact.
The convention that made this legible was baking the experiment's identity into the checkpoint name itself. A best-model file in this programme looked like 0.0004_250_resnet10_l1_focal_<timestamp> — learning rate 0.0004, 250 epochs, a ResNet-10 backbone, an L1 regression loss on depth/dip/azimuth, and a focal classification loss, with a Unix timestamp pinning the exact run. Read that filename and you have reconstructed the experiment without opening a single config. It is a small thing. It is also the difference between a checkpoint you can defend in a review and one you found in a folder and hope is the right one. Pair it with the W&B run that produced it — same hyperparameters, same logged data version — and the artifact is fully self-describing.
The instrument above is from a different, later engagement, but the mechanism it dramatises is the one at stake here. Its argument is that the expensive thing is rarely the retrain — it is the window in which nobody could tell a model had gone stale, because there was no lineage to compare against. Versioned data and tracked experiments are what collapse that window: when every run is pinned to a data state, you can see, the moment a new well lands, exactly what changed and whether the model genuinely improved or merely moved. Without the lineage, "the model is better now" is an assertion. With it, it is a diff.
What lineage let us actually prove
The payoff was not abstract. The model's quality came in observable steps, and we could attribute each step to a specific change because the data and runs were versioned. The W&B record traced a clean evolution across three milestones — call them the July, September, and November checkpoints — as the training set moved from an early three-well dynamic-imagery dataset to an eight-well one, and the classification, depth, dip, and azimuth errors fell in lockstep. Each milestone was a tracked run against a pinned dataset, so the improvement was a comparison between two known states, not a vibe.
That lineage is also what let us trust the ablations that carry the scientific claim. The well-count sweep is the cleanest example: classification error of 93.1% at three wells, 18.4% at six, 1.06% at nine, 0.82% at eleven, and 2.54% at the full fourteen-well fractures dataset. A curve like that — steep, non-monotonic at the tail — is exactly the kind of result a sceptical reviewer probes. The only honest way to defend it is to be able to re-run any single point on demand, with the precise data and code that produced it. We could, because each point was a tracked run over a versioned dataset. A subtler win came from a quieter comparison: moving from eight wells to eleven improved the mean absolute error on depth, dip, and azimuth by only about 0.007 in normalised units. That is a small enough delta that, without pinned data states on both sides, you would have no way to know whether it was a real gain or measurement noise. Lineage is what turns a 0.007 into a defensible finding instead of a rounding error.
The engineering reading
It is tempting to file all of this under "good hygiene" and move on. That undersells it. For an applied-AI programme — and this was a deep-learning, computer-vision build with a from-scratch ResNet-10 feature extractor feeding a transformer detector, trained on bespoke geomatics-grade image pipelines — data versioning and experiment tracking are not hygiene; they are the substrate the science runs on. The architecture, the augmentation policy, the loss design, the backbone sweep: every one of those decisions was validated by a comparison, and a comparison is only valid if both sides of it are recoverable. DVC made the data side recoverable. Weights & Biases made the run side recoverable. Together they made the comparison — which is to say, the result — trustworthy.
The practical advice is unglamorous and load-bearing. Version your data with the same seriousness you version your code, and bind the two so one commit resolves to both. Track every run, not just the winner, and make the artifact self-describing — encode the experiment in the checkpoint name and pin it to its data version. Adopt a dataset-naming convention that survives a year and a dozen iterations. Do this from the first week, not the week before the model goes to production, because the lineage you did not capture is gone, and the number on slide 14 is only as good as your ability to reproduce it on demand. We have built these pipelines for operators across the Middle East and the United States; the data, the geology, and the model change from one to the next, but this rule does not.
Key takeaways
- Reproducibility is a property of the (code, data) pair, not of either alone. Git versions code and is useless on a 55,000-image, 65×-augmented dataset; DVC pins the heavy data to a content hash so one commit resolves to one exact code-and-data state.
- In subsurface work the data is the volatile part: a single borehole-image log is ~1.5 GB, and every preprocessing step (log-to-image conversion, normalisation, patch stride, augmentation) is a place two 'identical' datasets silently diverge. Cutting patch stride 40→80 px is a new dataset version, not an edit.
- Track every run, not just the best model. Weights & Biases logs hyperparameters, losses, metrics, and a pointer to the data version; baking the experiment into the checkpoint name (e.g. 0.0004_250_resnet10_l1_focal_<timestamp>) makes the artifact self-describing.
- Lineage is what makes a comparison defensible. The W&B milestone trail (3-well → 8-well dataset across three checkpoints) and the well-count ablation (93.1% → 18.4% → 1.06% → 0.82% → 2.54% class error from 3 to 14 wells) are only trustworthy because every point can be re-run from its pinned data and code.
- Set up versioning and tracking in week one. The 0.007-MAE gain from 8→11 wells is only distinguishable from noise because both data states were pinned. Uncaptured lineage cannot be reconstructed after the fact — and an unreproducible ablation table is just a slide nobody can defend.