Why Document Images Are Different From Photographs for Computer Vision

The quiet assumption behind most computer vision is that an image looks like a photograph. Three colour channels, a frame that is roughly as tall as it is wide, a subject that occupies a good share of the pixels. That assumption is not written down anywhere, but it is baked into the datasets the field trained on, the default input sizes, the normalisation layers, and the batch sizes that make training efficient. ImageNet made the natural photograph the reference object for a decade of architecture design [1], and almost everything downstream inherited its shape. A scanned well log violates every part of that assumption, not as a matter of taste or domain flavour but as a matter of measurable pixel statistics. This note is about those statistics, and about why they, rather than any narrative about oil and gas being special, force the architecture choices behind VeerNet, the encoder-decoder EarthScan uses to lift curves off scanned raster logs.

Insisting on statistics rather than domain talk is practical. Call a log "a hard image" and you learn nothing actionable, because hardness does not tell you which layer to change. Measure the three ways a log differs from a photograph in numbers and each number points at a specific default that is now wrong and a specific replacement that is now right. The differences also chain: the shape forces the batch size, the batch size forces the normalisation, and the emptiness forces the loss.

One channel, not three

Start with the simplest difference. A natural photograph is three colour channels because colour carries information about the world it depicts. A scanned well log is a single grayscale channel, because the ink on paper is monochrome and colour would be inventing signal that is not there. At the input layer this is nearly trivial: the first convolution takes one channel instead of three. It is worth stating because it is the first place a stock pipeline silently does the wrong thing. A model that expects three channels will either refuse the input or accept a grayscale image triplicated into three identical channels, wasting two thirds of its first-layer parameters on a colour structure that does not exist. The correct default for a document image is one input channel, and the only way you know to set it is by having looked at what the image actually is.

The shape that photographs never take

The second difference does the most damage if you ignore it, and it is the reason the instrument below exists. A photograph, by convention and by the ergonomics of a viewfinder, is close to square. The canonical ImageNet crop is 224 pixels by 224 pixels, an aspect ratio of one to one. A synthetic log in our training set is between 3,200 and 12,800 pixels wide and only 480 to 640 pixels tall. At the extreme that is an aspect ratio of roughly twenty-seven to one. No natural photograph is shaped like that, and no square resize preserves a log without either crushing the depth axis into unreadability or throwing away most of the width. The strip is the image; its elongation is not an inconvenience to normalise away, it is the defining property of the object.

That shape reaches straight into the training loop. When a single image can be 12,800 pixels wide, you cannot stack many of them into a batch and still fit in GPU memory, and because the widths vary you cannot even pad them into a common tensor without waste. The practical answer for the binary segmentation runs was to set the batch size to one. That is a decision the aspect ratio makes for you, and it is the hinge the rest of the architecture turns on.

The same scan read as a pixel object next to a natural photograph, on the three axes that decide the architecture. A photo is three colour channels in a near-square 224 by 224 crop with a subject that fills much of the frame; a scanned well log is a single grayscale channel, 3,200 to 12,800 pixels wide against 480 to 640 pixels tall, with roughly 97 percent of its pixels being empty background and only 3 output classes, background plus two thin curves. Drag the width lever across the sourced 3,200 to 12,800 pixel range and watch the log's aspect ratio blow out while the photo card stays square. The orange marker is the only element that argues: the aspect ratio verdict that tracks the lever, the number that shows a log is not a crop of a photograph but a different kind of image. That extreme shape is also why the batch size is pinned at 1, which in turn forces GroupNorm with a group size of 16 in place of the BatchNorm an ImageNet-era pipeline assumes. The channel count, pixel dimensions, background fraction, class count, batch size, and GroupNorm group size are sourced from the engagement archive; the photograph figures are the conventional ImageNet-era reference, shown to name the assumption being broken, not measured from our data.

A batch of one breaks the layer you were counting on

Here is where the chain of consequences bites. Batch normalisation, the default normalisation layer in most convolutional networks since 2015, works by computing the mean and variance of activations across the images in a batch and using those statistics to standardise the layer's output [2]. The method is fast and it is a genuine part of why deep networks became trainable, but it has a dependency that is invisible until you hit it: it needs a batch large enough for the per-batch statistics to be a decent estimate. With a batch of one, the "statistics over the batch" are the statistics of a single image, which is to say they are noise, and the layer that was supposed to stabilise training instead destabilises it.

The fix is to normalise over something other than the batch. Group normalisation computes its statistics over groups of channels within each individual image, so its behaviour does not depend on the batch size at all [3]. That independence is the entire reason it is the right choice here: a batch of one is fatal to BatchNorm and a non-event for GroupNorm. In our network the group size is forced to 16, chosen so that the channels divide evenly into groups regardless of the channel count at a given depth. Notice the path we just walked. The aspect ratio was extreme, so the batch became one, so BatchNorm became unusable, so the normalisation became GroupNorm at group 16. Not one of those steps was a preference. Each was forced by the number before it, and the first number was a property of the image, not of the model.

Almost all background, and three classes to say so

The third difference is emptiness. In a photograph the subject fills a meaningful share of the frame, which is why a classifier can lean on large regions of relevant pixels. In a well log the two curves we care about are thin lines drawn across an otherwise blank strip, and roughly 97 percent of the pixels are background carrying no curve at all. The segmentation task has three output classes, background plus two curves, and two of the three are vanishingly rare compared to the first. This is the extreme foreground-background imbalance that dense-prediction work has to confront head on [4], and it is why the loss function cannot be a naive per-pixel average that a model can minimise almost perfectly by predicting background everywhere and being right 97 percent of the time.

The emptiness is not a defect of the data to be cleaned up. It is a true property of the object, and treating it as one, an explicit third class of background rather than an inconvenient absence, is what lets the model reason about where a curve is and is not. The encoder-decoder shape, with its skip connections carrying fine spatial detail past the bottleneck, is well suited to recovering thin structures on a mostly empty field precisely because it was designed for dense labelling of non-photographic imagery from limited data [5]. The architecture family fits the statistics, not the other way around.

Why the framing matters

Put the three differences together and the point is not that a log is a difficult photograph. It is that a log is not a photograph. It is a different pixel object: one channel where a photo has three, an aspect ratio no viewfinder produces, and a foreground budget so small that predicting nothing scores well. Each is measurable, and each points at a default ImageNet-era vision would set wrongly and a replacement the number itself justifies. The single channel sets the input width. The aspect ratio forces the batch to one. The batch of one forces GroupNorm at group 16 in place of BatchNorm. The 97 percent background forces a loss and a class structure that take the imbalance seriously. Read the image as statistics first and reach for the toolkit second, because the toolkit's defaults were fitted to a distribution this image does not belong to.

Limitations

The numbers here are the real dimensions of our synthetic training logs and the real settings of the network that consumed them, but they describe one task, not a law about document images in general. The 3,200 to 12,800 pixel widths, the 480 to 640 pixel heights, the single channel, the batch size of one, the GroupNorm group size of 16, the three classes, and the roughly 97 percent background are all sourced from the engagement; the photograph comparison, the three channels, the square 224 by 224 crop, the subject-filled frame, is the conventional ImageNet-era reference, included to name the assumption being broken and not measured from our data. A different document type, a form or a map or a printed page, would have its own statistics, and its aspect ratio and foreground budget would point at different defaults; nothing here transfers as a constant. The chain from aspect ratio to batch to normalisation is the one we actually walked, but it is not the only response to the same constraints, and on the multiclass runs with a custom collation the batch could be larger, which weakens the batch-of-one link there. And measurable difference in the input says nothing on its own about whether the resulting model is any good; that is settled by held-out performance on real scans, which these statistics motivate but do not establish.

The image tells you the architecture

The habit worth keeping is to distrust inherited defaults the moment the input stops looking like a photograph. Every convenient assumption in a vision stack, three channels, near-square crops, batches big enough for BatchNorm, a foreground worth averaging over, is a bet that the image is drawn from roughly the same distribution as the photographs the tools were built on. A scanned log is a standing counterexample to all four bets at once, and you find that out not by intuition but by measuring the channels, the aspect ratio, and the foreground fraction, and letting each number retire the default it contradicts. The architecture is not something you impose on the image. If you read the image honestly, it is something the image hands you.

References

[1] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. The dataset whose natural-photograph distribution became the implicit shape assumption for a decade of vision models. https://ieeexplore.ieee.org/document/5206848

[2] Ioffe, S., and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. International Conference on Machine Learning (ICML), 2015. Normalization over the batch dimension, whose statistics become unreliable as the batch shrinks. https://arxiv.org/abs/1502.03167

[3] Wu, Y., and He, K. Group Normalization. European Conference on Computer Vision (ECCV), 2018. Normalization over channel groups rather than the batch, with behaviour independent of batch size. https://arxiv.org/abs/1803.08494

[4] Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. Focal Loss for Dense Object Detection. IEEE International Conference on Computer Vision (ICCV), 2017. The treatment of extreme foreground-background imbalance in dense prediction. https://arxiv.org/abs/1708.02002

[5] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015. The encoder-decoder with skip connections for dense per-pixel prediction on non-photographic imagery from few labels. https://arxiv.org/abs/1505.04597

Why Document Images Are Different From Photographs for Computer Vision

One channel, not three

The shape that photographs never take

A batch of one breaks the layer you were counting on

Almost all background, and three classes to say so

Why the framing matters

Limitations

The image tells you the architecture

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on