The Coffee Ring and the Crease: Recovering Folded, Stained, Skewed Field Scans

The architecture diagrams we drew for VeerNet, our raster-log digitization model, all started at the same place: an image goes in, a segmentation mask comes out. That box on the left, the one labelled "input image," is a lie of omission. It quietly assumes the input is an image in the sense the segmenter expects, a flat, square, evenly lit picture of a well log. The scans the operator actually handed us were nothing of the kind. They were photographs of paper that had spent three decades folded in a drawer, laid on a flatbed at whatever angle the technician happened to drop them, and in more than one case stamped with the perfect brown circle of a coffee cup. Before any of the segmentation work that fills the rest of these write-ups could mean anything, that physical damage had to be undone. This is the part of the pipeline that never makes it into the diagram, and the part that decided whether everything after it had a chance.

The sheet, not the model, was the first thing broken

It is worth being concrete about what arrived. The full archive we were working against held 136,771 raster TIF images and 7,781 LAS files, the latter being the small minority of logs that had already been digitized into curves at some point in their life. The vast majority were pictures, and they had the full catalogue of paper-document afflictions. Skew was nearly universal, because a human placing a long fold-out log on a scanner glass does not align it to the pixel. Fold lines ran across most of the older sheets, sometimes as a bright crease where the paper had cracked and the scanner light blew through, sometimes as a dark valley of shadow, and in both cases the crease landed straight across the curve traces and severed them. And then there were the stains: coffee rings, water marks, the brown gradient of a sheet that had yellowed unevenly, and the soft diagonal shadow a flatbed lid leaves when it does not close flush.

What made these afflictions dangerous was not that they looked bad. It was that they looked like signal. A binariser trying to separate ink from paper has no concept of "coffee" or "shadow"; it sees darker pixels and calls them ink. A fold that brightens a band of the page erases the curve there as surely as if someone had taken an eraser to it. We were not dealing with cosmetic noise that a robust model would shrug off. We were dealing with damage that actively impersonated the thing we were trying to detect, or actively destroyed it, and it did so before the network ever ran.

Why we refused to fake the damage and move on

The tempting shortcut, and the one a lot of the literature reaches for, is to leave the real scans alone and model the degradation synthetically instead: take clean images, apply random rotations and simulated creases and procedural stains, and trust the model to generalise back to the real mess. We did build a large synthetic pipeline, and it earned its place elsewhere in this project for manufacturing ground truth at a scale the real archive never could. But we drew a hard line: synthetic degradation teaches a segmenter robustness; it does not repair the actual sheet in front of you. A synthetic coffee ring is one you chose to add and therefore know how to ignore. A real one is geometry you have never seen, on paper that aged in its own way, and no amount of procedural noise covers the long tail of how a physical object can be wrecked.

So we set the goal narrowly and operationally. Our job in this stage was not to improve the model. It was to hand the model a sheet that was at least as readable as the clean scans it had been validated on, by repairing the specific, physical, real damage on each incoming page. The success criterion was equally narrow: a repaired scan should let the downstream segmenter recover a curve as close as possible to the score it reached on clean inputs, which on our best runs peaked at an intersection-over-union of 0.51, a recall of 0.97, and an F1 of 0.55. Those numbers are a clean-image ceiling. The entire preprocessing stage exists to get a damaged sheet back up to it.

Three repairs, in the only order that worked

We settled on three stages, and the order between them was not a matter of taste. It fell out of how each kind of damage interacts with the others.

Deskew came first, always. A rotated sheet poisons everything downstream, because every later step that reasons about rows, columns, or straight reference lines assumes the page is square. We estimated the dominant skew angle from the long oblique structures on the page, the tracks and gridlines a log is full of, in the spirit of the classic projection and oblique-structure methods for document skew [1], and rotated the page back to square before touching anything else. The decision to deskew first is the single most consequential one in the stage, because a fold-line repair or a stain estimate computed on a tilted page is computed in the wrong coordinate frame and has to be thrown away once you straighten it.

Fold-line repair came second. With the page square, a crease is a roughly horizontal or vertical band where the curve has been severed, and the task is to reconnect the trace across it without inventing a curve that was never there. We treated the crease as a region to inpaint and bridged the gap from the surrounding ink, drawing on the fast-marching inpainting of Telea [2] for the thin, well-bounded creases and the exemplar-based region-filling of Criminisi and colleagues [3] where the damaged band was wide enough that simple diffusion would have smeared it. The honest constraint we held throughout was that inpainting fills a gap from its neighbourhood; it does not resurrect information that the fold destroyed. Where a crease had obliterated a long stretch of a curve, we marked it as a genuine gap and let the later spline-bridging stage carry it, rather than fabricating a confident line through a region where we had nothing.

Stain and shadow normalisation came last. With the page square and the creases bridged, the remaining problem was illumination and contamination: the coffee ring, the water mark, the diagonal lid-shadow, the uneven yellowing. We estimated the slowly varying background, the shading and the stain gradients, and divided it out so that what remained was the high-frequency ink on a flat field, the same move that adaptive document binarization makes when it thresholds against a local rather than a global background [4], and that document-shadow-removal work makes when it models and removes the illumination surface before reading the page [5]. Doing this after deskew and fold-repair, rather than before, meant the background estimate was computed on a page whose geometry was already correct, so the stain model was not fighting a tilt or a severed curve at the same time.

Reading the recovery climb

The exhibit below is the bench we used to reason about this stage, and to explain it to a petrophysicist who, reasonably, did not care about inpainting algorithms and only wanted to know whether the curve would come back. It draws a single damaged field scan on the left: skewed on the platen, creased by a fold that cuts the trace, and marked by a coffee ring that bleeds across the page. You set how badly the incoming sheet is damaged with the lever at the bottom, and then you switch the three repairs on one at a time. The scan visibly squares up, the severed trace gets bridged, and the stain fades out of the way. The ladder in the centre tracks the recovered-curve intersection-over-union as it climbs from the raw-scan floor, where nothing has been repaired, toward the clean-image ceiling, the real 0.51 our best runs reached on undamaged inputs.

The physical-damage recovery stage that runs before any segmentation. A field scan arrives rotated on the platen (skew), creased along an archival fold that severs a curve, and marked by a coffee ring or a shadow gradient that the binariser would otherwise read as ink. Drag the sheet-damage lever to set how badly the incoming sheet is degraded, then switch the three repair stages on one at a time: deskew rotation squares the sheet, fold-line repair bridges the severed run, and stain normalisation flattens the shadow so it stops becoming false ink. The left panel redraws the scan as each stage acts on it; the centre ladder shows the recovered-curve IoU climbing from the raw-scan floor toward the clean-image ceiling. The clean-image ceiling of 0.51 IoU is the engagement's real peak curve-mask score; the per-stage increments, the raw-scan floor, the severity scaling, and the scan geometry are an illustrative teaching model of which damage costs what, not measured per-stage scores.

Read the bench with the damage lever pushed high, the regime where these scans actually lived, and the shape of the stage is plain. With nothing repaired, the recovered curve sits well below the ceiling, because the segmenter is being asked to read a tilted page with a cut curve under a brown blot. Switch on deskew and the largest single jump happens, because straightening the page is what lets every later assumption hold. Add fold-repair and the severed run reconnects, closing the gap that no amount of clean-image accuracy could have papered over. Add stain normalisation and the last increment lands, as the false ink stops competing with the real curve. The bench will not reach the ceiling exactly, and that is deliberate: no repair is perfect, and a sheet that arrived ruined does not come back pristine. The point the climb makes is the one we needed the operator to believe, which is that the gap between a raw field scan and a usable one is closed in preprocessing, not in the model.

What thirty-year-old paper taught us about pipelines

The lesson we carried out of this stage was about where to look when a model underperforms on real inputs. For weeks we had been tuning the segmenter, the loss function, the synthetic data mix, treating a disappointing curve as evidence that the network needed to be better. It did not. The network was fine. It was receiving a curve that a fold had cut and a page that a scanner had tilted, and it was failing at exactly the task we had unwittingly set it: to read damage we should have repaired upstream. The fix was never in the architecture. It was in admitting that the box labelled "input image" held a physical object with a thirty-year history of being folded, spilled on, and badly photographed, and that the object had to be restored before it could be interpreted.

That reframing changed how we built the rest of VeerNet. We stopped drawing the pipeline as a model with some cleanup bolted on the front, and started drawing it as a restoration stage that happens to feed a model. Deskew, fold-repair, and stain normalisation are not preprocessing in the dismissive sense of the word, the unglamorous chores you rush through to get to the real work. On a real archive of damaged paper, they are the work that decides whether the real work ever gets a fair input, and we have never since trusted a curve score from a sheet we did not first make readable.

Limitations

The repairs in this stage restore readability; they do not restore information that the damage destroyed. Inpainting bridges a fold from its neighbourhood and is honest only over creases narrow enough that the surrounding ink genuinely constrains the fill; across a wide crease that obliterated a long stretch of a curve, no inpainting method recovers what is not there, and we deliberately routed those cases to a marked gap rather than a fabricated line. Skew estimation from oblique page structure degrades on sheets that are sparse, heavily stained, or warped non-rigidly, where there is no clean dominant angle to recover, and a non-planar fold introduces local distortion that a single global rotation cannot correct. Stain normalisation assumes the contamination is a slowly varying background that can be separated from high-frequency ink; a dark, sharp-edged stain that overlaps a curve at the curve's own spatial frequency cannot be divided out without taking the curve with it. The clean-image ceiling we measure against, an intersection-over-union of 0.51 with recall 0.97 and F1 0.55, is itself a bound from the segmentation regime of this engagement, so the recovery climb shown here is a path toward that ceiling and not toward perfect reconstruction. Finally, the per-stage increments, the raw-scan floor, and the scan geometry in the exhibit are an illustrative teaching model of which damage costs what and in what order it is best repaired; they communicate the structure of the stage rather than reporting measured per-stage scores.

References

Postl, W. (1986). Detection of Linear Oblique Structures and Skew Scan in Digitized Documents. ICPR 1986. https://ieeexplore.ieee.org/document/PostL1986
Telea, A. (2004). An Image Inpainting Technique Based on the Fast Marching Method. Journal of Graphics Tools, 9(1). https://doi.org/10.1080/10867651.2004.10487596
Criminisi, A., Perez, P., and Toyama, K. (2004). Region Filling and Object Removal by Exemplar-Based Image Inpainting. IEEE Transactions on Image Processing, 13(9). https://doi.org/10.1109/TIP.2004.833105
Sauvola, J., and Pietikainen, M. (2000). Adaptive Document Image Binarization. Pattern Recognition, 33(2). https://doi.org/10.1016/S0031-3203(99)00055-2
Bako, S., Darabi, S., Shechtman, E., Wang, J., Sunkavalli, K., and Sen, P. (2016). Removing Shadows from Images of Documents. ACCV 2016. https://doi.org/10.1007/978-3-319-54187-7_12
Jaderberg, M., Simonyan, K., Zisserman, A., and Kavukcuoglu, K. (2015). Spatial Transformer Networks. NeurIPS 2015. https://arxiv.org/abs/1506.02025

The Coffee Ring and the Crease: Recovering Folded, Stained, Skewed Field Scans

The sheet, not the model, was the first thing broken

Why we refused to fake the damage and move on

Three repairs, in the only order that worked

Reading the recovery climb

What thirty-year-old paper taught us about pipelines

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on