Class Imbalance Beyond Loss Functions: Sampling and Tiling

Most write-ups of class imbalance arrive at the loss function and stop there. Swap cross-entropy for a weighted version, attach a multiplier to the rare class, perhaps reach for focal or Dice, and the problem is declared handled. That advice is correct as far as it goes, and it does not go far enough. The loss is the last thing in the pipeline to touch the data. By the time it runs, the batch has already been assembled, and if that batch is a faithful copy of the raw frame then the loss is being asked to repair a distribution that sampling could have softened upstream for free. This post is about the upstream half: how you draw and crop the data before the optimiser ever computes a gradient, and why on a scanned well log that decision moves more than any class weight we tried.

Where the imbalance is actually born

A digitised raster log is a tall narrow grayscale strip, and the curves a petrophysicist wants are thin ink traces a few pixels wide. Everything else is paper, grid, and margin. Set up pixel segmentation over three classes (background plus two curves) and the label tensor is overwhelmingly one value: background runs on the order of 97 percent of the pixels, and the two curve classes share the thin remainder. That ratio is roughly thirty-to-one against the thing you came to find.

The standard reading is that this is a loss problem, and the standard fix is to reweight. There is a quieter reading that sits one step earlier. The imbalance the loss sees is not a property of the dataset; it is a property of how you sampled the dataset into a batch. If you feed whole frames, the batch inherits the 97 percent figure exactly. If you crop, oversample, or stratify, the batch can carry a very different ratio while still being drawn from the same images. The loss never sees the archive. It sees whatever your data loader handed it, and that hand-off is yours to shape.

The literature has known this for a long time without always saying it loudly. Buda and colleagues ran a systematic study of imbalance in convolutional networks and found that simple oversampling of the minority class was, across their settings, the most reliable single remedy, often matching or beating loss-level corrections [1]. Focal loss, the canonical loss-side answer, was itself motivated by the observation that the easy abundant background dominates the gradient in dense detection [2]; the data-side reading of that same observation is that you can starve the easy class at the source instead of down-weighting it after the fact. The two halves are not rivals. They compose.

Two ways to redraw the batch

There are two levers worth separating because they act on different things.

The first is the crop, or tile. Instead of presenting a whole 12,800-pixel-wide strip, you train on smaller windows cut from it. This is not only a memory convenience, though on logs that span thousands of pixels it is also that; the overlap-tile training strategy was part of the original U-Net design precisely because biomedical images were too large to feed whole [3]. The relevant effect for imbalance is geometric. A curve passing through a small tile occupies a larger fraction of that tile than it does of the full frame, because the background it is competing against has been cropped away. Shrink the window and the same ink becomes a bigger share of the pixels. You have not changed a single label; you have changed the denominator.

The second is stratified sampling, of which oversampling is the bluntest form. Rather than drawing tiles uniformly at random, you bias the draw toward tiles that contain curve pixels. A tile that is pure background teaches the model almost nothing it does not already know, so you draw it less often; a tile straddling a curve carries the gradient you actually want, so you draw it more. Over an epoch this raises the share of curve pixels the optimiser sees without altering any individual example. Oversampling sits inside the broader toolkit of data-level interventions catalogued in the augmentation literature, where resampling and synthesis are treated as first-class alternatives to loss surgery rather than afterthoughts [4].

What the levers do to the ratio

The point of separating the two levers is that you can watch each one move the imbalance independently, and then stack them. The board below makes the raw whole-frame split visible as a reference and lets you drag the tile size and the oversampling factor to see the rebalanced curve share the model would actually be trained on. The orange tick is the raw share the loss would inherit if you fed whole frames; the teal extent is what sampling hands it instead.

An interactive board for the data-side half of the class-imbalance fix on scanned well logs. The whole-frame label split is roughly 97 percent background with the thin remainder shared by two curve classes (background, curve1, curve2); a model trained on that raw split peaked at an Intersection over Union of 0.51, and a weighted binary cross-entropy alone had to push the positive-class weight to 42. Drag the tile-size lever to crop a smaller window, which concentrates the same curve ink into fewer pixels and raises the per-tile curve share, and drag the oversampling lever to draw curve-bearing tiles more often, which raises the share of curve pixels the optimiser sees each epoch. The teal stacked bar is the rebalanced share fed to the model and the orange dashed tick marks the raw whole-frame curve share, so the gap between them is what sampling buys before the loss is ever reweighted. The four anchor figures are sourced from the engagement archive; the rebalance response surface is an illustrative model of how tiling and oversampling move the sampled foreground fraction.

Two things are worth reading off it. First, tiling and oversampling both help, and they help most when combined, because they attack different denominators: tiling shrinks the per-example background, oversampling shrinks the per-epoch background. Second, neither lever pretends to make the problem disappear. Even pushed hard, a scanned log stays a background-leaning input, because the curves are genuinely thin and no honest crop turns a one-pixel trace into half the frame. What sampling does is hand the loss a thirty-to-one problem already softened to something far more tractable, so the class weight that follows has less work to do and less room to overshoot.

Why doing this first changes the weight you need

On our own VeerNet pipeline for a Texas onshore operator, the do-nothing baseline is the honest anchor: a model trained on the raw split peaked at an Intersection-over-Union of 0.51, and the weighted binary cross-entropy we reached for had to push the positive-class weight all the way to 42 to stop the model collapsing to all-background. Forty-two is a large weight. It is roughly the inverse class frequency, which is the natural starting point, but a weight that large is also a sharp instrument: it lifts recall fast and tends to drag precision down with it, because every curve pixel now screams and the model fires generously to avoid the penalty.

That is the connection between the two halves. A weight is a correction applied to a ratio. If sampling has already reduced the ratio, the correction you need is smaller, and a smaller correction has gentler side effects. The data-side levers do not replace the loss-side ones; they change the operating point the loss-side ones start from. Reweighting from a five-to-one batch is a different, safer act than reweighting from a thirty-to-one frame, even though the loss code is identical. The order of operations is the whole argument: shape the batch, then weight what is left.

A note on what tiling costs

None of this is free, and pretending otherwise is how teams get surprised in validation. Cropping to small tiles severs long-range context, and on a log the depth axis carries real structure: a curve's value at one depth constrains its plausible value at the next. Tile too small and you hand the model windows so local it cannot use that continuity, which can hurt the very thin-structure recovery you were trying to help. Oversampling has its own trap: lean on it too hard and the model overfits the handful of curve-rich tiles you keep redrawing, learning their idiosyncrasies rather than the general shape of a curve. The overlap-tile strategy exists partly to buy back context at the seams [3], and a sensible oversampling factor is a tuned hyperparameter, not a dial turned to maximum. The levers have a sweet spot, and the board is meant to build intuition for where it sits, not to argue that more is always better.

Key takeaways

Imbalance is born in the data loader, not the loss. A scanned log is roughly 97% background (about thirty-to-one against the two curve classes), and a batch of whole frames inherits that ratio exactly; the loss only ever sees what sampling handed it.
Tiling shrinks the per-example denominator. A curve fills a larger fraction of a small crop than of a full 12,800-pixel strip, so cropping raises the foreground share with no relabeling. The overlap-tile idea traces to the original U-Net.
Oversampling shrinks the per-epoch denominator. Biasing the draw toward curve-bearing tiles raises the share of curve pixels seen each epoch; Buda et al. found simple minority oversampling among the most reliable single remedies for CNN imbalance.
The two levers stack and they bound the work the loss must do. On our VeerNet pipeline the raw baseline peaked at IoU 0.51 and demanded a weighted-BCE class_weight of 42; reweighting from a pre-softened batch needs a smaller, safer correction.
Tiling has a cost. Crops too small sever the depth-axis continuity thin curves rely on, and aggressive oversampling overfits the redrawn tiles. Both levers have a sweet spot rather than a maximum.

The reflex this should install is a question, asked before you ever open the loss file: what ratio is my data loader actually serving the model? Answer that honestly and a surprising amount of what gets blamed on the loss turns out to be a sampling decision in disguise. The loss is the last lever in the chain, and the worst one to overload when three cheaper ones sit upstream of it, untouched, waiting to be turned.

References

[1] Buda, M., Maki, A., and Mazurowski, M. A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 106 (2018), 249-259. Finds simple minority oversampling among the most consistent remedies across imbalance settings. https://arxiv.org/abs/1710.05381

[2] Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. Focal Loss for Dense Object Detection. ICCV (2017). Motivates the easy-background-dominates-the-gradient view that the data side answers by starving the easy class at the source. https://arxiv.org/abs/1708.02002

[3] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). Introduces the overlap-tile training strategy for images too large to feed whole. https://arxiv.org/abs/1505.04597

[4] Shorten, C., and Khoshgoftaar, T. M. A survey on Image Data Augmentation for Deep Learning. Journal of Big Data 6, 60 (2019). Treats resampling and synthesis as first-class data-level interventions alongside augmentation. https://doi.org/10.1186/s40537-019-0197-0

Class Imbalance Beyond Loss Functions: Sampling and Tiling

Where the imbalance is actually born

Two ways to redraw the batch

What the levers do to the ratio

Why doing this first changes the weight you need

A note on what tiling costs

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on