Skip to main content

Case Study

Padding to 32-Pixel Alignment to Run One Model Across Every Log Size

The raster archive we had to read was 136,771 scanned logs that almost never shared a shape, with widths from 3,200 to 12,800 pixels and heights from 480 to 640. The naive answer is a separate run per size, or one image at a time. We wrote a single collate-and-pad pipeline that rounds every scan up to 32-pixel alignment, stacks them into one batch, and lets one trained model serve the whole archive. This is the build story.

Case study

The archive we were handed did not have a shape. It had 136,771 shapes. The operator's legacy raster logs, pulled from the Texas Railroad Commission collection, were scans of paper made over decades by different vendors at different resolutions, and the one property they shared was that almost no two of them were the same size. Widths ran from 3,200 pixels on a narrow single-track strip to 12,800 pixels on a wide composite. Heights drifted between 480 and 640. We had a model that could read a well-log scan and pull the curves out of it. What we did not have, on day one, was a way to feed it the archive without either resizing every scan into a lie or running them one at a time forever. This is the story of the small piece of plumbing that turned out to be the whole game: a custom collate function that pads every scan to 32-pixel alignment so a single trained model can read all of them.

The shape of the problem was that there was no shape

A deep network does not, by itself, care that images differ in size. The fully convolutional design that segmentation inherited [1] slides the same filters across whatever spatial extent you give it, so a convolutional stack will happily process a 3,200-pixel scan and a 12,800-pixel scan with the identical weights. The trouble is not the math of one image. The trouble is the tensor of many.

To train at any reasonable speed you batch. A batch is a single rectangular tensor of shape batch by channels by height by width, and a rectangle has exactly one height and one width. The moment you try to stack a 3,200 by 480 scan on top of a 12,800 by 640 scan, the stack fails, because there is no single rectangle that holds both. So the default loader does the only thing it can: it gives up on batching and hands the model one image at a time. That is precisely where our first segmentation stage had been living. The binary segmenter trained at a batch size of 1, not by choice but because the variable image dimensions left it no other option. One scan in, one gradient step, repeat 2,000 times per epoch. It worked, and it was slow, and it did not generalize to the volume of the real archive.

What we actually needed to build

The goal was blunt: one model, every size, no per-size special cases, batched hard enough to train the larger multiclass stage in hours rather than days. The constraint that made it interesting was that we refused to resize. A well-log curve is often one pixel wide. Squash a 12,800-pixel scan down to a common width and you have smeared the very feature you are trying to detect into the gutter between two pixels. Resizing to a fixed grid is the obvious fix and it is the wrong one, because it destroys the signal. Whatever we did to make the scans batchable had to be non-destructive: the original pixels had to survive untouched, and the model had to be able to tell the real scan from whatever we added to make the rectangle close.

That left padding. Pad each scan out to a common size with neutral filler, batch the padded tensors, and let the network learn that the filler is not a curve. Padding keeps every original pixel exactly where it was. But padding to what size? Pad to the largest scan in the batch and a batch of mostly-small scans wastes enormous memory on emptiness. And there was a second, sharper constraint hiding underneath: the network itself has an opinion about acceptable sizes.

The 32 came from the architecture, not from us

Our segmentation backbone is an encoder-decoder. The encoder runs five stride-2 stages, each halving the spatial resolution, and the decoder runs five matching upsampling stages back up. This is the standard contracting-then-expanding shape that U-Net made the default for dense prediction [2], and that the encoder-decoder segmentation lineage carried forward [3]. Five halvings means the encoder divides the input dimensions by two, five times over, which is a division by 32. If an input dimension is not a clean multiple of 32, the halvings do not land on integers, the decoder's upsampled feature map comes back a pixel or two off from the encoder's skip-connection feature map, and the skip concatenation either crashes or, worse, silently misaligns.

So the alignment number was not a hyperparameter we tuned. It was dictated by the depth of the network: five stride-2 stages demand inputs that are multiples of 32. That single fact reframed the padding question. We did not need to pad to the largest scan in the batch. We needed to pad every scan up to its own next multiple of 32, and then to the per-batch maximum of those aligned sizes. A 3,200-pixel width is already 100 times 32, so it pads to itself. A 481-pixel height rounds up to 512. The filler is small, the alignment is exact, and the encoder's halvings land cleanly every time.

Writing the collate function that did the stacking

In the data loader, the function that turns a list of individual samples into one batch tensor is the collate function, and the default one assumes every sample is already the same size. We replaced it. Our custom collate function takes the list of variable-size scans in a batch, computes the next multiple of 32 at or above each scan's height and width, takes the maximum aligned height and the maximum aligned width across the batch, allocates one zero tensor at that common aligned size, and copies each real scan into the top-left corner of its slot. Alongside the image tensor it carries the original unpadded dimensions, so that after inference we crop every prediction back to the true scan extent and never report a curve sitting out in the padded margin.

The effect was immediate and, frankly, a little anticlimactic for how much it changed. With the collate function in place, the multiclass stage trained at a batch size of 16 instead of 1. The same scans that the binary stage had been forced to dribble through one at a time now stacked sixteen-high into a single padded tensor and went through the model in one pass. We trained the multiclass model on 15,000 synthetic instances at this batch size over 50 epochs in about 10 hours, against the roughly 2 hours the smaller 2,000-instance binary stage took at batch 1. The throughput came from batching, and the batching came from one function that did nothing more clever than round up to 32 and pad with zeros.

The exhibit below is that pipeline, animated. Scrub the lever or press play and watch a row of variable-size scans move through the four lanes: intake at their native dimensions, padding to the next multiple of 32 with the orange alignment gutters appearing, collation into a single batch tensor, and inference by one model. Flip between the two production lines to see the contrast that made the case for the rebuild.

ONE COLLATE_FN, ONE MODEL, EVERY SCAN SIZE16scans per batch on the collate_fnPADDED, STACKED, BATCHED 16Push variable-size scans through the padding pipelineScrub or play the stage. Every scan is padded to the next multiple of 32, then one model reads the batch.Multiclass linebatch 16 · collate_fnBinary linebatch 1 · no paddingARCHIVE136,771 TIF · 7,781 LAS1 · Intakeraw scan, native dims2 · Pad to 32gutters to next /323 · Collatestack into one batch4 · Inferone model, one passnext multiple of 32SCRUB THE STAGEPauseWITH THE COLLATE_FNbatch 16 · 80/20 splitone model serves every sizeraster scanpad gutter to /32batch tensorSourced: 136,771 TIF + 7,781 LAS; widths 3200 to 12800 px, heights 480 to 640 px; batch 16 multiclass vs 1 binary; 80/20 split; 10h / 2h over 50 epochs; pad to 32.Per-tile pixel dims drawn on each scan are illustrative samples from the sourced ranges; alignment, batch sizes, ranges and counts are not.
The padding pipeline that lets one trained model serve every scan size in the archive. The raster collection holds scans whose width ranges from 3200 to 12800 pixels and whose height ranges from 480 to 640 pixels, so two scans rarely share a shape and a naive loader cannot stack them into one tensor. The custom collate_fn pads every scan up to the next multiple of 32, which is the alignment the five stride-2 encoder stages need, then assembles a single batch that one model infers in one pass. Scrub the lever or press play to push a row of variable-size scans through the four lanes: intake, pad to 32, collate, and infer. Flip between the two production lines to see the contrast the stage argues: the binary stage was forced to batch 1 because nothing padded its variable dimensions, while the multiclass stage runs at batch 16 on the collate_fn. Sourced: 136,771 TIF files and 7,781 LAS files in the Texas Railroad Commission archive; synthetic widths 3200 to 12800 pixels and heights 480 to 640 pixels; batch 16 multiclass versus batch 1 binary; an 80 percent train and validation split; ten-hour multiclass and two-hour binary training over 50 epochs. The per-tile pixel dimensions drawn on each scan are illustrative samples from the sourced ranges; the alignment, batch sizes, ranges, and counts are not.

The turns we did not expect

It would be a tidy story if the collate function were the only thing we touched. It was not, and the messy parts are the parts worth keeping. The first surprise was normalization across a batch of padded scans. Any operation that computes a statistic over the spatial extent, the obvious one being a normalization layer that averages over height and width, will happily average the zeros we padded in alongside the real pixels, which drags the statistic toward the filler and varies it with how much padding a given batch happened to need. We moved the model onto group normalization, which normalizes over channel groups rather than over the contaminated spatial dimensions, with a group size that floors at 16 so the grouping stays valid no matter the channel count. That decoupled the normalization from the padding, and the per-batch jitter went away.

The second surprise was the train and validation split, which sounds trivial and was not. With an 80 percent training and 20 percent validation split over a synthetic corpus whose widths span 3,200 to 12,800 pixels, a random split can quietly load all the very wide scans into one side and starve the other. A model that never sees a 12,800-pixel scan in training will meet one at validation and pad-and-batch it correctly while still segmenting it badly, because width that extreme carries aspect ratios the training set never showed it. We had to make sure both sides of the 80/20 split covered the full size range, not just the same count of scans. Padding made every size batchable; it did not make every size familiar, and those are different problems.

What the 32 taught us about the rest of the build

The lesson we carried out of this was not about padding specifically. It was about where to look when a model refuses to scale. We had spent real time treating the batch-size-1 ceiling as a memory problem, reaching for gradient accumulation and smaller models and a bigger GPU, when it was never a memory problem at all. It was a shape problem, and the shape constraint was written into the architecture we had already chosen. The five stride-2 stages were telling us, the whole time, that they wanted multiples of 32, and once we listened to that and built the collate function around it, the thing we had been treating as a scaling wall turned into a thirty-line function.

There is a particular satisfaction in a fix that comes from the system telling you what it already needs rather than from forcing it to accept what you want. The archive had 136,771 shapes and we flattened none of them into a common size. We rounded each one up to the alignment the network was built to expect, padded the small difference with zeros, cropped the predictions back to the truth, and let one model read all of them. The collate function is not the part of VeerNet anyone demos. It is the part that let everything downstream be a single model instead of a fleet of size-specific ones, which on an archive this varied was the difference between a pipeline and a pile of one-offs.

References

  1. Long, J., Shelhamer, E., and Darrell, T. (2015). Fully Convolutional Networks for Semantic Segmentation. CVPR 2015. https://arxiv.org/abs/1411.4038

  2. Ronneberger, O., Fischer, P., and Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. https://arxiv.org/abs/1505.04597

  3. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. (2018). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. ECCV 2018. https://arxiv.org/abs/1802.02611

Go to Top

© 2026 Copyright. Earthscan