The first time a data scientist loads a borehole image log and plots it, the picture has stripes — black vertical bands marching down the length of the image, regularly spaced, cutting clean through every geological feature. The instinct is to assume a corrupted file. It is not corrupted: those bands are the parts of the borehole wall the tool never touched, and learning to see them correctly is the first real skill in subsurface machine learning. In a roughly twenty-month engagement with a mid-sized Middle East carbonate operator, across 14 vertical wells logged with two different microresistivity imaging tools in a fractured carbonate play, the most underestimated step in the whole pipeline was not the model — it was understanding, and then filling, those holes. This piece is about where the holes come from, why they are coded the way they are, and why that understanding forces a specific, slightly counter-intuitive engineering choice.
The tool only touches part of the wall
A formation micro-imager does not photograph the borehole. It drags an array of electrodes, mounted on articulated pads, along the rock and measures micro-resistivity where metal meets wall. Between the pads, the tool measures nothing — there is no sensor in contact with that arc, so there is no reading to record.
How much it misses depends on the hole size and the tool. The headline number for the high-resolution borehole image log is coverage of roughly 80% of the borehole circumference in a typical hole, and only that high because the pads are wide and the hole is not enormous; widen the well and the same pads subtend a smaller fraction, so coverage drops further. The two tools here also differ sharply in how deep they read: the high-resolution borehole image log has a depth of investigation around 30 inches, while the compact microresistivity tool reads only about 0.90 inches into the formation. Same physics, very different footprints — and very different gap structure to reckon with downstream.
So when you unroll the cylindrical wall into the flat strip a model consumes — an image that in this dataset averaged roughly 690,000 × 360 pixels per well — the columns the pads covered are real measured resistivity, and the columns between them are empty. Those empty columns are the inter-pad gaps. They are not noise to be denoised; they are an absence of measurement, structurally periodic, and present in every image log you will ever load.
The -9999 sentinel, and why it becomes NaN
A logging system cannot leave a pixel blank. The file format needs a number in every cell, including the cells where nothing was measured. The convention the industry settled on is a sentinel value: an impossible, out-of-range number that means "no data here." In this dataset, and in most microresistivity image-log exports you will meet, that sentinel is -9999.
This matters the moment a model touches the data, because -9999 is a perfectly valid float. Resistivity is never negative, so no real reading is -9999 — but a neural network does not know that. To a normalisation routine, -9999 is an extreme value that dominates the mean and crushes the real signal into a sliver of the dynamic range; to a convolution, it is a giant spike that smears across the receptive field. Feed raw image logs with the sentinel intact into a model and you are training on garbage with extreme prejudice.
The first transformation in any honest pipeline, therefore, is to replace every -9999 with NaN — to convert the operator's "impossible number" into the language's explicit "not a number." That is a one-line map and one of the highest-leverage lines in the codebase, because it turns a silent landmine into something every downstream library will refuse to average, fit, or train on until you have dealt with it deliberately. The hole is now honest. It is also now your problem.
Why you cannot just train through the gaps
A reasonable engineer might ask: why not let the model ignore NaN columns and learn around them? Because the geological feature you are trying to detect runs through the gap. Unroll the borehole and a planar fracture or bedding plane projects to a sinusoid — a clean sine wave whose amplitude encodes dip and whose phase encodes azimuth — and that sinusoid crosses every inter-pad gap on its way down the image. A detector that sees a sine wave shattered into disconnected arcs, a vertical NaN band slicing each one, is being asked to learn the geometry of a curve from its fragments. The gaps do not sit politely beside the signal; they amputate it.
This is why imputation is not optional preprocessing in this domain — it is the load-bearing step that makes everything after it (the patching, the augmentation, the Detection Transformer that ultimately picks the sinusoids) even possible. Get the fill wrong and you have taught the model to trace a curve that was never really there.
Three candidate fills, and the property that decides between them
For this engagement the team evaluated three families of NaN-filling method: 1D interpolation, a GAN-based inpainting approach, and KNN imputation. They look interchangeable on a slide. They are not, and the criterion that separates them is one word: continuity. The unrolled feature is a continuous curve, so a good imputation has to hand the detector back a curve that is still continuous where the gap used to be. Score the three methods on that and a clear, slightly surprising ranking emerges.
- 1D linear interpolation is dazzlingly fast — and it cheats. Filling each NaN row by drawing a straight line between its left and right measured neighbours flattens the curve through any wide gap and leaves vertical-line artifacts. The curve technically connects, but its shape is a lie through the gap.
- GAN inpainting (the team trialled a GAIN-style imputer) produces texture that looks locally realistic — it can hallucinate convincing rock — but is globally wrong. The adversarial objective optimises for plausible local patches; it never constrains the two sides of the gap to belong to the same sinusoid. In the team's own assessment, GAIN imputation was simply "not very good" for these image logs, because continuity was not preserved.
- KNN imputation fills each missing pixel from its 5 nearest neighbours (
n_neighbors = 5), applied across both the dynamic and static channels. Because the nearest neighbours of a point on a sinusoid lie along that sinusoid, KNN interpolates with the curve rather than across it, and the recovered sine wave stays continuous through the null band.
The verdict was unambiguous: KNN was selected as the best imputer precisely because it preserves image-log feature continuity while being less computation-intensive than the alternatives, where the GAN and the iterative approaches failed the continuity test outright.
The counter-intuitive part: the fancy method loses
Here is where new data scientists are most often led astray. A GAN is the more sophisticated, more modern, more impressive tool. Reach for it and you will lose. Imputation quality is not measured by how realistic the fill looks but by whether it preserves the structure the next model depends on. An adversarial network optimising for local realism has no incentive to keep a sinusoid coherent across a gap of tens of pixels, and so it does not. A humble nearest-neighbours imputer, which makes no attempt at realism and simply borrows from the curve's own neighbourhood, wins on the only metric that matters here.
The compute numbers seal it, with a twist worth internalising. On a short 4-metre interval, 1D interpolation ran in about 0.115 seconds on average while the KNN imputer took about 2.625 seconds — KNN is more than twenty times slower at that scale. Scale up and the picture inverts: across a whole well, 1D interpolation finished in roughly 11 seconds, while a naive KNN pass over an entire well never finished in the team's runs. The operational answer was not "pick the fastest." It was to run the continuity-preserving KNN imputer where it matters — on the patches the detector will actually consume — rather than brute-forcing it over hundreds of thousands of rows at once. The engineering win was choosing the method by its effect on the downstream model and then scoping where it runs, not chasing raw speed or raw sophistication.
What this means for the pipeline you build
Strip the geology away and the transferable lesson is a data-engineering one. Missing data in subsurface imagery is structured, sentinel-coded, and load-bearing. Structured, because the gaps are periodic, not random — they recur at the pad spacing, so any imputation has to respect a known spatial pattern rather than treat each NaN independently. Sentinel-coded, because the absence arrives disguised as a valid number, -9999, that silently poisons normalisation until you convert it to NaN on purpose. And load-bearing, because the feature you care about runs straight through the holes, so the fill becomes part of the signal the model learns — which makes imputation a modelling decision, not a cleaning chore.
That reframing is the whole point. Across our subsurface engagements — with operators in the Middle East and the United States — the teams that move fastest treat the hole as a first-class object: they characterise its geometry, make its absence explicit, and choose the fill by what it does to the next stage of the pipeline. Understanding the hole is genuinely the prerequisite to filling it well. And once you understand it, the choice between KNN and a GAN stops being a question of which tool is more impressive and becomes a question of which tool keeps a sine wave a sine wave.
Key takeaways
- Microresistivity imaging tools measure resistivity only where pads contact the wall — about 80% of the borehole circumference for the high-resolution borehole image log in a typical hole, and far less in wider wells. The unmeasured inter-pad arcs appear as periodic vertical gaps in the unrolled image (here averaging ~690,000 × 360 pixels per well).
- The two tools differ in reach: the high-resolution borehole image log's depth of investigation is ~30 inches versus ~0.90 inches for the compact microresistivity tool — same physics, very different footprints and gap structure downstream.
- Unmeasured pixels are written as the sentinel -9999. Because -9999 is a valid float that wrecks normalisation, the first pipeline step is to map -9999 → NaN, turning a silent landmine into an explicit, must-handle hole.
- Geological features are sinusoids that run through the gaps, so the gaps amputate the signal. Imputation is therefore load-bearing preprocessing, not optional cleaning — the fill becomes part of what the detector learns.
- Among 1D interpolation, GAN/GAIN inpainting, and KNN, the deciding property is continuity across the gap. KNN (n_neighbors = 5, applied to the dynamic and static channels) interpolates along the sinusoid and stays continuous; the GAN produced locally realistic but globally broken fills; 1D flattens the curve. KNN was selected as the best imputer.
- Compute is a scoping problem, not a tiebreaker: 1D ran ~0.115 s vs KNN ~2.625 s on a 4 m interval, and 1D finished a whole well in ~11 s while a naive whole-well KNN pass never finished. The answer was to run continuity-preserving KNN on the patches the detector consumes — choose the fill by its effect on the downstream model, then scope where it runs.
References
[1] Buuren, S. van, and Groothuis-Oudshoorn, K. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software (2011). Background on iterative and neighbour-based imputation, including the continuity-versus-cost trade-offs that drive the KNN choice for image logs. https://www.jstatsoft.org/article/view/v045i03