The Tilt of a Real Scan: Perspective Augmentation for Field Photographs

A synthetic well log knows exactly where its depth axis is. The renderer that draws it places every ruled line on an integer pixel row, runs each curve along a column it chose, and hands the segmentation network a canvas where the grid is square to the frame because nothing else was ever possible. That is the quiet luxury of training on simulation: the geometry is not estimated, it is authored. A photograph of the same log, taped to a desk and captured on a field engineer's phone, has no such luxury. The page lifts at one corner, the lens sits a little off centre, and the rectangle of the page projects onto the sensor as a trapezoid. The depth grid the model learned to anchor on is suddenly leaning, and the further down the page you read, the more the lean has displaced the curve from where a square render would have put it.

This is one small, concrete instance of the simulator-to-field divide, and it is a useful one precisely because it is so narrow. The whole reality gap is a sprawling problem; this slice of it is a single geometric transform with a name. A camera viewing a planar surface from an arbitrary angle produces a perspective warp, a homography, and the cheapest way to make a model robust to a homography it will meet at test time is to apply that homography to the training data it sees at train time. This piece walks through why axis-aligned synthetic logs are exactly the wrong training distribution for keystoned field photographs, where this fits in the literature on closing simulation gaps, and how we use perspective-warp augmentation in our raster well-log digitisation network, VeerNet.

What a phone photo does to a square page

Start from the physics, because it dictates the transform. A flatbed scanner presses the page flat and captures it from directly above, so a scanned log is close to axis aligned with only a mild rotational skew from a crooked feed. A phone photo is a different acquisition channel entirely. The camera is handheld, the page is rarely flat, and the optical axis is almost never perpendicular to the paper. The result is a projective transformation of the page plane onto the image plane, and the visible signature of that transformation is keystoning: parallel edges of the page converge, the ruled depth lines that were horizontal fan out, and a curve that ran straight down a column on the synthetic canvas now drifts across columns as it descends.

The reason this matters more for a well log than for, say, a photographed business card is that the model is not classifying the page, it is reconstructing a thin curve as a numeric trace, and it does so by reading the curve against the depth grid. Keystoning attacks exactly the structure the network relies on. A degree or two of tilt near the top of the page is almost nothing; the same tilt accumulated over a tall canvas can push a curve several pixels off its true column by the bottom, and on a one-pixel-wide curve a several-pixel displacement is the difference between a clean trace and a broken one. The canvas dimensions make this concrete. Our synthetic logs are rendered at widths from 3,200 to 12,800 pixels and heights from 480 to 640 pixels, with two constant curves per log against a background class in the multiclass setting. A wide canvas gives a keystone more horizontal room to drag a curve across before the model loses it, which is why the same tilt angle is not equally forgiving across the width range.

The gap is not noise, it is geometry

It is tempting to lump every sim-to-real discrepancy together as domain shift and reach for the same generic remedies. That instinct undersells what is specific here. The dominant failure of an axis-aligned-only training set on field photographs is not that the photo is grainier or dimmer than the render, though it is both. It is that the photo carries a structured geometric corruption the render categorically lacks. You can add all the Gaussian blur and paper noise you like and never manufacture a keystone, because blur and noise are pixel-local and a homography is global: it moves the whole grid coherently. A model can be perfectly robust to ink fade and still fall over on a five-degree tilt, because the two corruptions live on different axes.

This distinction is exactly the one the corruption-robustness literature draws when it separates families of perturbations and measures a model against each in turn, rather than collapsing them into one accuracy number (Hendrycks and Dietterich, 2019). Geometric corruptions are their own family, and the architecture community recognised early that a network can be built to undo them. Spatial Transformer Networks made the case directly: a module can learn the affine or projective parameters that straighten an input before the rest of the network reads it, which means invariance to a warp is learnable, but only if the warp is in the training distribution to begin with (Jaderberg et al., 2015). The same period saw the homography itself treated as a quantity a network could regress from a pair of images, which underscored that the keystone is not mysterious distortion but a parametric transform with eight degrees of freedom you can sample, apply, and invert at will (DeTone et al., 2016).

Where perspective warp sits among sim-to-real recipes

The broadest answer the field developed for training on simulation and deploying on reality is domain randomisation: rather than painstakingly matching the simulator to the real world, you randomise the simulator so widely that the real world looks like just another sample from the training distribution. Tobin and colleagues showed that randomising textures, lighting, and camera pose in a renderer was enough to transfer an object detector from simulation to a physical robot with no real images at all (Tobin et al., 2017). Tremblay and colleagues extended the same idea to richer scenes and made the point that flooding the model with implausible variation can beat carefully photorealistic rendering, because variety, not fidelity, is what forces invariance (Tremblay et al., 2018).

Perspective warp is the well-log-specific instance of camera-pose randomisation from that lineage, narrowed to the one degree of freedom that actually moves on our problem. We are not randomising a 3D scene; there is no scene, only a planar page viewed off square. So the randomisation collapses to sampling a homography that tilts the rendered page within the range a handheld phone realistically produces, then re-rendering the synthetic log under that warp before it enters training. The pixel-perfect mask the renderer produced for free is warped by the identical homography, so the supervision stays exact through the transform, which is the part that makes synthetic data so well suited to this: there is never any question of where the curve went, because we moved it ourselves. This is also why the encoder-decoder shape we inherit lends itself to the approach. The original U-Net result leaned on elastic and geometric deformation rather than colour tricks precisely because structured imagery degrades geometrically, and a warp on the input paired with the same warp on the label is the cleanest deformation you can give it (Ronneberger et al., 2015).

Reading the robustness fan

The instrument below makes the tradeoff tactile. It sweeps the keystone tilt angle along the horizontal axis and reads off how curve-1 reconstruction quality, measured as R-squared between the recovered trace and ground truth, holds up as the page leans further off square. The two anchors are real. The upper anchor is the best-case curve-1 result under our Tversky-loss multiclass model on a clean, axis-aligned render, an R-squared of 0.9891, and the lower anchor is the harder mid case at 0.8126, both from the engagement's results. Those two numbers describe the same network on logs that differ only in how favourable the curve geometry is; they are the ceiling and a realistic middle, not a clean-versus-degraded pair on their own.

What the fan adds is the second axis nobody sees in a single accuracy figure: how that quality decays as a real-world keystone is introduced, and how far perspective-warp augmentation flattens the decay. Drag the tilt lever and the unaugmented curves bend downward as the angle grows; flip the augmentation toggle and the falloff lifts, because a model that trained on tilted pages meets a tilted page as a familiar input rather than a novel one. The shaded recovery window between the two is the part of the score the warp buys back. The canvas-width stepper closes the loop with the dimensions above: widen the synthetic canvas toward 12,800 pixels and the same tilt costs more, because a wider page gives the keystone more horizontal travel to displace a one-pixel curve.

Synthetic training logs are rendered perfectly axis aligned, but a phone photo of a paper log is keystoned, so the ruled depth grid the network anchors on tilts a few degrees off square. Drag the keystone-tilt lever and the fan of curve-1 reconstruction quality fans out. The top anchor is the best-case Tversky multiclass result on a clean render (R-squared 0.9891) and the lower anchor is the harder mid case (0.8126); the shaded window is what perspective-warp augmentation buys back, and the toggle turns that augmentation on and off. A wider synthetic canvas (widths run 3200 to 12800 px, heights 480 to 640 px) gives a keystone more pixels to drag a curve across, so the same angle bites harder. The two anchor R-squared values and the canvas dimensions are sourced from the engagement archive; the per-angle robustness fall-off, the tilt axis, and the recovery band are an illustrative robustness model.

Two honesties about the exhibit. The two R-squared anchors and the 3,200-to-12,800-pixel canvas range are measured figures from the work; the per-angle falloff curves, the tilt axis itself, and the recovery band are an illustrative robustness model, flagged in the instrument's method line, meant to show the shape of the argument rather than to report a controlled tilt sweep. And the augmentation toggle is a teaching device: it contrasts a model that saw warps in training against one that did not, which is the comparison the whole piece is about, not a second measured ablation table.

Why this is the cheap bridge, not the only one

There are heavier ways to handle keystoned inputs. You could detect the page corners and rectify every photo before inference, undoing the homography with a measured one. You could bolt a spatial transformer onto the front of the network and let it learn to straighten inputs end to end. Both work, and both add a moving part that can fail on its own: corner detection breaks on a torn or partially photographed page, and a learned rectifier is one more thing to train and debug. Perspective-warp augmentation asks for none of that at inference. It is a transform applied once, in the data pipeline, on a synthetic canvas where the ground-truth mask can ride along through the same warp at zero labelling cost. The model that comes out the other side simply does not find a tilted page surprising, and there is nothing extra to run when a field photo arrives.

That cheapness is the entire reason it earns priority among the transforms worth adding. The loss it feeds into has to be able to see the thin curve the warp is meant to protect, which is why we pair it with a precision-recall tunable objective, Tversky, for the heavily imbalanced thin-structure segmentation a one-pixel curve demands (Salehi et al., 2017). Augmentation and loss are a matched pair here: there is no point manufacturing a realistic keystone if the loss cannot register the displaced curve the keystone was supposed to make the model robust to.

The one-line recipe

Before reaching for page-corner detection or a learned rectifier, sample a realistic homography, apply it to the synthetic canvas and its free pixel-perfect mask together, and train on the result. The model meets keystoning as a known input, and inference stays a single forward pass with no extra stage to break.

Synthetic data gives you a perfect world and the field gives you a leaning one, and the distance between them, on this particular problem, is mostly a few degrees of tilt you can author back into training for almost nothing. Match the transform to the geometry the camera actually imposes, carry the label through the warp so the supervision never drifts, and the squareness the renderer gave you for free stops being a liability the moment a phone gets involved.

References

[1] M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu. Spatial Transformer Networks. NeurIPS 2015. https://arxiv.org/abs/1506.02025

[2] D. DeTone, T. Malisiewicz, A. Rabinovich. Deep Image Homography Estimation. RSS Workshop 2016. https://arxiv.org/abs/1606.03798

[3] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, P. Abbeel. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. IROS 2017. https://arxiv.org/abs/1703.06907

[4] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, S. Birchfield. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. CVPR Workshops 2018. https://arxiv.org/abs/1804.06516

[5] O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. https://arxiv.org/abs/1505.04597

[6] D. Hendrycks, T. Dietterich. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. ICLR 2019. https://arxiv.org/abs/1903.12261

[7] S. S. M. Salehi, D. Erdogmus, A. Gholipour. Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. MLMI Workshop, MICCAI 2017. https://arxiv.org/abs/1706.05721

The Tilt of a Real Scan: Perspective Augmentation for Field Photographs

What a phone photo does to a square page

The gap is not noise, it is geometry

Where perspective warp sits among sim-to-real recipes

Reading the robustness fan

Why this is the cheap bridge, not the only one

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on