Every paper about transformers on borehole imagery shows you the architecture diagram — the ResNet backbone, the encoder-decoder stack, the object queries — and almost none of them show you the thing that actually decides whether the model can learn at all: how a single well's image log becomes a pile of training tensors. In our work with a mid-sized Middle East carbonate operator, that translation was the unglamorous, load-bearing core of the whole supervised pipeline. A Detection Transformer cannot ingest a high-resolution borehole image log the way it sits in the raw binary log file. The log is a strip roughly 690,000 pixels tall and 360 pixels wide — the full unrolled borehole wall, hundreds of metres of section, weighing in at about 1.5 GB per image. No transformer takes a 690,000-row tensor. So before any architecture choice matters, you have to answer a data-design question: how do you cut this monster into patches, and how tall is a patch? This piece is about why that single number — the patch height — is a dataset decision that quietly sets the ceiling on everything the model can do afterward.
The raw object: what comes out of the binary log file
A processed high-resolution borehole image log is a cylinder unrolled into a plane. The 360-pixel width is the borehole circumference mapped to azimuth, 0 to 360 degrees; the height is depth. The vertical resolution is fine and unforgiving: in this dataset one pixel of log depth corresponds to about 3 cm of true depth, which means a single mislabelled or misaligned row carries a built-in ±3 cm depth error before the model does anything. Multiply 3 cm per pixel across a few hundred metres of logged section and you arrive at that ~690,000-pixel height. The file is a dense raster, and the geological features you care about — fractures, beddings — are sinusoids written across that raster, because a planar feature cutting a cylinder traces a sine wave when you flatten the cylinder.
That geometry is the whole reason patching is non-trivial. You are not tiling a photograph of cars where a box can land anywhere. You are slicing a continuous depth axis along which sine waves of known physical extent are drawn, and your slices must not routinely guillotine those sine waves in half.
Why you cannot feed the whole log to the model
Three independent reasons keep the 690,000×360 raster out of the transformer intact, and they all push the design the same way.
The first is brute memory. Self-attention is quadratic in sequence length, so a backbone feature map derived from a 690,000-pixel-tall image yields a token sequence no realistic GPU will hold, let alone attend over — and at ~1.5 GB per raw image, even staging the full log in device memory is a non-starter.
The second is supervision density. The ground-truth picks are sparse and local: a sinusoid lives in a window a couple of metres tall, not across the entire well. Hand the model the whole log per step and a handful of labelled sinusoids drown in a sea of unlabelled rows — the set-prediction objective has nothing crisp to assign its queries to.
The third is statistical: one well is one sample. If the well is the training example, fourteen wells is fourteen examples, and a data-hungry set predictor will never converge. Patching is what turns fourteen wells into thousands of training instances — and on this engagement, coverage of the logged interval ran from roughly 45% to 85% across wells, so the raw material per well is irregular too. Patching is also what lets you balance and augment around the sparse, unevenly distributed sinusoid-bearing windows.
The patch height is a geology decision, not a hyperparameter
Here is the claim this whole piece is built around: the patch height is not a knob you grid-search alongside learning rate. It is fixed by the physical size distribution of the features you are trying to detect, and you read it off the data before you train anything.
We measured the height of every labelled sinusoid in the corpus. The distribution is lopsided in a useful way. The average fracture height is about 2 m, the average bedding height about 0.25 m, and — the number that matters — more than 95% of fracture and bedding sinusoids are under 2 m tall. Convert that to pixels at this log's resolution and you land on 800 pixels ≈ 2.2 m as the patch height. At 800 pixels, a single patch contains a whole sinusoid for more than 95% of features in the dataset. The model gets to see the complete sine wave — full amplitude, full phase — which is exactly what it needs to regress dip and azimuth, rather than a truncated arc from which those parameters cannot be recovered.
The long tail is what forces the "more than 95%" framing rather than "all." The tallest fractures in the set reach ~9 m, about 3,200 pixels — four times the patch height. Some features will always be cut by any fixed window, and chasing them with a 3,200-pixel patch would quadruple every tensor and wreck the memory budget to catch a few percent of objects. So 800 pixels is the principled compromise: tall enough to contain the overwhelming majority of complete sinusoids, short enough to keep the token sequence and dataset size tractable. That is the entire argument — and it is a data argument, settled before the model exists.
Stride: turning a tall image into a balanced dataset
Choosing the patch height fixes what one sample looks like. Stride fixes how many samples you get and how they overlap. We slid the 800-pixel window down the depth axis with a stride of 40 pixels — a deliberately small step, roughly 11 cm, far smaller than the patch itself.
A stride that much smaller than the patch height does two things at once. It guarantees almost every sinusoid appears whole in some patch even when clipped at a neighbour's edge — a feature sliced by one window boundary is comfortably interior to the next. And it densely multiplies the dataset: overlapping windows convert a handful of wells into thousands of patches, the raw count a set-prediction model needs to learn the assignment between queries and ground-truth sinusoids. The final supervised image patches were 800×360 — full azimuthal width preserved, height set by the geology, count set by the stride.
The cost of a small stride is real and worth naming: heavy overlap inflates the dataset and can leak near-duplicate context between train and validation if you split carelessly. The discipline that goes with it is splitting by well or depth interval, never by random patch, so overlapping neighbours never straddle the boundary. Get the split wrong and the small stride flatters your metrics; get it right and it is pure signal.
The rest of the binary-log-to-tensor contract
Patch height and stride are the headline decisions, but a few more transformations turn a patch into a tensor the transformer can train on, and they are easy to underrate.
Normalise the parameters to the unit interval. Each sinusoid's targets — depth, dip, azimuth — live on wildly different scales: depth in metres, dip in 0–90 degrees, azimuth in 0–360 degrees. Feed those raw and the regression gradient is dominated by whichever parameter has the largest numeric range. We scale all three to 0–1 (dip by 90, azimuth by 360, depth within the patch) so no single parameter hijacks the L1 loss. This is a tiny line of code with an outsized effect on convergence.
Pick the right image channel. High-resolution borehole image logs ship as both static and dynamic normalisations, and the choice is not cosmetic — dynamic imagery preserves the local contrast that makes a sinusoid legible, and downstream that single choice is the difference between a model that learns and one that does not. The patching pipeline must match and carry the correct channel through; a mismatched or low-range channel silently degrades every patch it touches.
Guard the input distribution. Real binary wireline log files are not clean. We routinely saw logs whose pixel values arrived in a corrupted 0–15 range instead of the expected 0–255, and a depth axis that disagreed between the log-file header and the operator's interpretation by hundreds of metres in places. None of that is the model's problem to solve — it is the patching pipeline's job to detect, flag, and reconcile before a single tensor is written. The cheapest place to catch a data fault is upstream of the dataset, not in a confusing validation curve three days later.
Why the order of operations matters
Step back and the shape of the argument is the real takeaway. The sequence is: measure the feature-size distribution, derive the patch height from it, choose a stride small enough to keep features whole and multiply the data, preserve full azimuthal width, normalise targets, validate the channel, QC the raster — and only then hand tensors to the transformer. Every step is a data-engineering decision made before the model's first forward pass, and each one caps what the model can achieve no matter how good the architecture is. An 800-pixel patch derived from a real size distribution is not the same artefact as an 800-pixel patch picked because it is a round number: the first contains whole sinusoids by construction, the second only by luck.
This is the part of applied subsurface ML that never makes the architecture diagram, and it is the part that most often decides whether the project works. The transformer is interchangeable. The dataset is not.
Key takeaways
- A processed high-resolution borehole image log is a ~690,000x360-pixel raster weighing about 1.5 GB. No transformer ingests it whole: self-attention is quadratic in sequence length, supervision is sparse and local, and one un-patched well is only one training sample.
- The 800-pixel (~2.2 m) patch height is a DATASET decision read off the data, not a hyperparameter. It is fixed by the feature-size distribution: average fracture ~2 m, average bedding ~0.25 m, and >95% of sinusoids under 2 m fit whole inside 800 pixels — so the model sees complete sine waves to regress dip and azimuth from.
- The long tail forces the compromise: the tallest fractures reach ~9 m (~3,200 px), four patch-heights. Sizing the patch to contain those few percent would quadruple every tensor and break the memory budget, so 800 px is the principled tradeoff.
- A 40-pixel stride (much smaller than the 800-pixel patch) keeps almost every sinusoid whole in some window and multiplies a handful of wells into thousands of patches — the count a data-hungry set predictor needs. The cost is heavy overlap, so split by well/depth interval, never by random patch.
- The rest of the binary-log-to-tensor contract is underrated: normalise depth/dip/azimuth to 0-1 so no parameter dominates the L1 loss, carry the correct (dynamic) image-log channel, and QC the raster for corrupted 0-15 pixel ranges and depth-axis disagreements before writing a single tensor. These upstream choices cap the model's ceiling regardless of architecture.