Most computer-vision tutorials open by resizing every image to a square, usually 224 by 224, and never mention it again. That single line of preprocessing is invisible because for natural photographs it is almost free: a cat is still a cat at half the resolution. It is not free for a scanned well log. A raster log is a tall, thin strip of paper that has been photographed and shipped as a grayscale image, and the information you are after is a curve one to three pixels wide tracing across it. Squash that strip to a square and you have thrown away the very signal you came for. So the first design question on this task was not which backbone to use. It was a more basic one that the tutorials skip: how wide is an input allowed to be, and what does its width cost you?
This post is about that question and nothing else. It is deliberately narrower than the work it comes out of, which also involved a procedural generator that synthesised training logs from scratch. Here I am setting all of that aside to look at one variable, the input width, and trace how it propagated through the data loader, the memory budget, and ultimately the batch size we could afford. The synthetic-data story is its own piece. This one is about dimensions.
The shape of the data is the whole problem
Start with the measurements, because they frame everything that follows. Across the corpus we trained on, the logs ran from about 3200 pixels wide at the narrow end to 12800 pixels wide at the broad end, a factor of four. Their heights, by contrast, sat in a tight band: roughly 480 to 640 pixels, a spread of well under one and a half times. The inputs are not just large, they are large in one direction and almost constant in the other.
That asymmetry is not a quirk of our scans. It is the geometry of the source document. A well log is a depth-ordered record: depth runs down the page, and the few measured curves run across a fixed set of narrow tracks. A deeper interval, or a coarser depth sampling, makes the image taller in display but the digitised raster preserves a near-constant track layout across its short axis. The long axis is where the variability lives. Once you accept that the input is fundamentally a variable-width ribbon, the rest of the engineering follows from it rather than fighting it.
Width is the free variable, height is nearly fixed
The cost of an image to a GPU scales with its pixel area, height times width. When height is effectively constant and width swings by 4x, the per-image memory footprint is a function of width alone. That is unusual. It means you can reason about the entire memory budget by reasoning about one number, and it means the obvious lever for controlling that budget, resizing, attacks the wrong axis if you apply it naively.
Why the obvious fix is the wrong fix
The reflexive solution to variable-size inputs is to make them not variable: pick a target size and resize everything to it. It is the line from the tutorials, and on this data it fails in two different ways depending on which way you resize.
Resize down to tame the widest logs and you destroy resolution on a target that has none to spare. The curve you are segmenting is already one to three pixels wide. A downsampleReducing an image's pixel dimensions by interpolating neighbouring pixels into fewer pixels. It is lossy: detail finer than the new pixel pitch cannot be recovered. On a one-to-three-pixel curve there is no fine detail to lose without losing the curve itself. that shrinks a 12800-pixel log toward a manageable width drags the curve below one pixel in places, and a fractional-pixel curve is a broken curve. You would be solving a memory problem by manufacturing a labelling problem.
Resize up, padding or stretching the narrow logs to match the wide ones, and you waste enormous amounts of compute on emptiness. Stretch a 3200-pixel log to 12800 and three quarters of every forward pass is spent convolving over interpolated filler. You have made the cheap images as expensive as the expensive ones, which is exactly backwards.
The honest conclusion is that a single fixed input size is a category error for this data. The point of a fully convolutional segmenter is that it does not need one. Since Long, Shelhamer, and Darrell showed that a convolutional network can consume an input of arbitrary spatial extent and emit a correspondingly sized prediction [1], there has been no architectural reason to force every image into the same box. The reason people still do it is not the model. It is the data loader.
The batch is where variable size actually bites
Here is the part the architecture papers gloss over. A model can accept any single image size, but a training step does not see single images. It sees a batch, and a batch is a tensor, and a tensor is a rectangular block of numbers. You cannot stack a 3200-wide image and a 12800-wide image into one tensor any more than you can stack two sheets of paper of different sizes into a neat rectangular ream without trimming or padding. Something has to give before the batch exists.
This is the precise point at which our two pipelines diverged, and it is worth being concrete about both.
The binary segmentation pipeline, our first and simpler model, never resolved this. Faced with variable widths and a fixed memory budget, it fell back to the only thing that always works: a batch of one. With batch size 1 there is no second image to be incompatible with, so the width can be whatever it is, and the tensor is just that single image. It trains, and it trains correctly, but every optimiser step sees exactly one log. On the 2000-instance binary dataset that was survivable, but it is slow, and the gradient estimate from a single wide image is noisier than you would like.
The multiclass pipeline is where we actually engineered the problem instead of avoiding it. The fix is unglamorous and it lives entirely in the data loader: a custom collate_fnIn a PyTorch DataLoader, the function that assembles a list of individual samples into one batched tensor. The default stacks identically shaped samples; a custom one can pad, sort, or otherwise reconcile samples of different shapes before stacking, which is exactly what variable-width logs require.. Instead of demanding that every image already share a size, it takes whatever images the sampler happened to draw, finds the widest one in that particular batch, and pads the rest out to match it before stacking. The batch tensor is sized to its widest member, not to a global maximum, so a batch that happens to contain only narrow logs stays cheap. With that one change the multiclass model held a batch size of 16 across most of the width range, sixteen times the gradient signal per step that the binary pipeline ever got.
“The model was never the thing that could not handle variable width. The batch was. Move the reconciliation into a custom collate_fn and the same architecture goes from batch 1 to batch 16 without touching a single layer.
”
The 32-pixel rule nobody chooses on purpose
There is a catch in padding-to-widest, and it is the kind of detail that produces a cryptic shape-mismatch error at three in the morning rather than a clean exception. You cannot pad to any width. You have to pad to a width the network can cleanly fold in half and unfold again.
The encoder in our segmenter has five stride-2 stages. Each one halves the spatial resolution of the feature map, so by the bottleneck the width has been divided by two five times over, a 32x downsample. The decoder then runs the same five steps in reverse, doubling back up. If the input width is not an exact multiple of 32, one of those halving steps lands on an odd number, the integer division rounds, and when the decoder doubles back up the upsampled feature map no longer matches the resolution of the encoder feature it is supposed to be concatenated with. The skip connection, the defining feature of an encoder-decoder segmenter since U-Net [2], silently fails to line up. The cleanest way to avoid the whole class of bug is to never let an unaligned width into the network: pad every batch up to the next multiple of 32.
This is not a number we invented. It is the same constraint that segmentation frameworks expose as an output-stride parameter, the factor by which the network downsamples before it upsamples, which in DeepLabv3+ and its relatives is set explicitly precisely so that the input dimensions and the output dimensions stay commensurable [3]. The downsample factor is a property of the encoder you chose. Five stride-2 stages buys you a large receptive field and a compact bottleneck, and it charges you a 32-pixel alignment rule in return. Pick a different depth and you get a different alignment number. There is no free lunch, only a divisibility rule you can either respect on purpose or trip over by accident.
So the real padding rule the collate_fn enforces is two rules at once: pad every image up to the widest member of its batch, and then round that target up to the nearest multiple of 32. The instrument below makes the consequence tangible. Drag the width ruler and watch the binary pipeline sit flat at batch 1 across the entire range while the multiclass pipeline holds batch 16, until the very widest logs make a padded sixteen-image step too large to fit and even the collate_fn has to give ground.
What a four-times width swing actually costs
It is tempting to think the collate_fn solves the problem outright, and across most of the range it does. But padding-to-widest has a failure mode that the binary pipeline, with its batch of one, is curiously immune to. When a single very wide log lands in a batch, every other image in that batch gets padded up to its width. A batch of sixteen logs that happens to include one 12800-pixel monster now costs as much as sixteen monsters, because they were all stretched to match it. The widest member sets the price for everyone.
That is why the ceiling exists at all. For most batches, drawn from the bulk of the distribution, padding to the widest member is cheap and batch 16 fits comfortably. It is only when the draw is dominated by the broad end of the width range that the padded tensor swells past the memory budget and the effective batch has to shrink. The lever to manage this, had we needed to push it harder, is to sort the dataset by width and form batches from nearby widths so that the padding within each batch is small. We did not have to, but the principle is the same one the whole post turns on: the cost of a batch is set by its widest member, so the cheapest batches are the ones whose members are closest in size.
The narrow lesson, stated narrowly
Measure the shape of your inputs before you reach for a resize: that is the whole operating rule a piece this specific has to offer. We almost did not. The instinct to normalise everything to a square is so ingrained that it took looking at the raw width histogram, a four-to-one spread on one axis and almost nothing on the other, to see that the conventional move would have quietly ruined the data. Once that was clear the engineering was small: a custom collate_fn, a 32-pixel alignment rule inherited from the encoder depth, and a batch size that went from 1 to 16 for the cost of about thirty lines in the data loader.
None of this is a claim that variable-size training is always worth the trouble. For many tasks a fixed resize is genuinely free and the data loader should stay boring. The point is narrower than that: when your input is variable in a way that carries the signal, as a one-pixel curve on a four-times-variable-width ribbon is, the resize is not preprocessing, it is data destruction, and the right place to absorb the variability is the batch, not the image.
Key takeaways
- A raster well log is a variable-width ribbon: width runs from about 3200 to 12800 pixels (a 4x spread) while height stays in a tight 480 to 640 pixel band, so per-image GPU memory is essentially a function of width alone.
- Fixed resizing is the wrong reflex on this data. Downsizing pushes a one-to-three-pixel curve below one pixel and breaks it; upsizing wastes most of every forward pass convolving over padding. A fully convolutional segmenter needs no fixed input size; the constraint lives in the data loader, not the model.
- Variable size bites at the batch, not the image. The binary pipeline could not stack variable widths and fell back to batch size 1; the multiclass pipeline used a custom collate_fn that pads each batch to its widest member and held batch 16, sixteen times the gradient signal per step.
- The five stride-2 encoder stages impose a 32x downsample, so input width must be a multiple of 32 or the decoder's skip connections fail to line up. The collate_fn pads to the widest member and then rounds that width up to the nearest 32.
- Padding-to-widest makes the widest log in a batch set the cost for every member, which is why a ceiling appears at the broad end of the width range. The general move is to batch images of similar width so the padding stays small.
References
[1] Long, J., Shelhamer, E., and Darrell, T. Fully Convolutional Networks for Semantic Segmentation. CVPR (2015). The result that a convolutional segmenter accepts an input of arbitrary spatial size and emits a correspondingly sized prediction. https://arxiv.org/abs/1411.4038
[2] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The symmetric encoder-decoder whose downsample-then-upsample structure imposes the input-size divisibility rule. https://arxiv.org/abs/1505.04597
[3] Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. ECCV (2018). The output-stride parameter is the downsample-factor knob that fixes input-output commensurability. https://arxiv.org/abs/1802.02611
[4] Yuan, B., and Yang, Q. Digitization of Well-Logging Parameter Graphs Based on Gridlines-Elimination Approach. Journal of Petroleum Exploration and Production Technology (2019). A classical, non-learning baseline for extracting curves from raster log graphs. https://doi.org/10.1007/s13202-019-0625-x