Document AI for Engineering Archives: Methods Across Scanned Charts, Maps, and Logs

Abstract

A scanned chart with a plotted line, a paper map with inked features over a base sheet, and a strip well log with two curves down a depth axis each has its own research community and vocabulary, and each solved its problem without much reference to the other two. This survey reads the three as one problem. We credit the lineages each domain built, trace the four-stage recipe they share, segmentation then centreline extraction then calibration then vectorisation, and argue that the shared spine rather than any shared network is what transfers between them. The finding to carry away is that only the first stage is really a learning problem; the three after it are geometry, and geometry is the same in every domain.

The oldest ancestor these domains share is not a network but a voting scheme. Hough patented a method for detecting lines by having each edge pixel vote for the parameters of every line that could pass through it [1], and Duda and Hart gave it the polar rho-theta parameterisation that made it practical, writing the first general recipe for turning scattered raster ink into geometric primitives [2]. A chart line, a map road, and a log curve are all thin marks a reader wants back as geometry rather than as pixels.

The route from a filled region to a line then ran through morphology: thinning algorithms eroded a foreground mask to a single-pixel skeleton that preserved its topology, catalogued in Lam, Lee, and Suen's survey [3]. Segmentation was rewritten in the middle of the last decade, when Long, Shelhamer, and Darrell classified every pixel in one convolutional forward pass [4] and Ronneberger, Fischer, and Brox added the encoder-decoder with skip connections that became the default backbone for thin, sparse structures [5]. Both changed the first stage of every domain's pipeline at once, itself evidence that the domains share that stage. The chart and map communities specialised the rest without renaming it: Scatteract detects plot objects, reads tick labels with optical character recognition, and recovers data coordinates [6], while Chiang, Leyk, and Knoblock's cartographic survey runs the same stages under other names, separate ink from base sheet, thin to skeletons, georeference, emit vectors [7].

Method

This is a structured reading of published document-image work across three archive domains, not a new experiment. We fixed the unit of comparison as the pipeline stage rather than the network, because networks differ across domains for incidental reasons, image size, colour, foreground density, while the stages recur. For each domain we found where the raster enters and where structured data leaves, then decomposed the path between into the smallest stages common to every domain's published pipelines: segmentation (pixel to class), centreline extraction (mask to path), calibration (pixel to real units), and vectorisation (path to resampled geometry). Each has a named counterpart in the chart-extraction [6], cartographic [7], and segmentation [4] [5] literatures, and the spine traces to its Hough-transform and thinning roots [1] [2] [3]. We ground the crosswalk on one instance from our own archive, VeerNet: a 136,771-TIF raster set, a three-class output of background plus two curves, an encoder and decoder of five stages each with two attention layers, and a curve resampled to 300 interpolated depth points. Those figures are sourced; the per-domain wording of each stage is a qualitative account of published practice, captioned illustrative.

The shared spine

The claim the survey turns on is that the four stages are the same operation in each domain, wearing three sets of clothes. Segmentation asks everywhere which pixels are the mark and which are the background or the base sheet. Centreline extraction is the same wherever it appears, because a filled band, a plotted series, an inked road, or a logged curve, has to collapse to a single path before it becomes geometry [3]. Calibration reads the frame, tick labels on a chart, a projection on a map, a depth scale on a log, and converts pixel coordinates into real units [6] [7]. Vectorisation resamples the calibrated path onto a regular grid. Naming the stages once and showing that each domain fills all four is the whole argument.

A crosswalk arguing that scanned charts, cartographic maps, and strip well logs are three faces of one raster-to-structure problem. The three rows are the archive domains; the four columns are the shared method stages every one of them runs, segmentation, centreline extraction, calibration, and vectorisation. Selecting a stage lights that single column in orange across all three rows, the visual assertion that the same step recurs in each domain under a different name. The right-hand inset grounds the recipe on the sourced well-log instance: a 136,771-TIF raster archive resolved to a 3-class output, background plus two curves, resampled to 300 interpolated depth points, where an encoder of 5 stages and a decoder of 5 stages with 2 attention layers realise the segmentation stage and the later stages are pure geometry. The output classes, depth-point count, encoder and decoder stage counts, attention-layer count, and archive scale are sourced from the engagement; the per-domain wording of each stage is an illustrative crosswalk of published practice, and the inset curve shape is decorative rather than a measured trace.

The exhibit lays the three domains against the four stages and lets a reader light one stage at a time; its column brightens across all three rows at once, the visual form of the claim that the same step recurs in every domain under a different name. The per-domain cell wording is the illustrative crosswalk, and only the grounded well-log figures on the right are sourced.

Only the first stage is learning

Reading the spine this way exposes a distinction that is easy to lose when a whole pipeline is called a model. Only segmentation is genuinely a learning problem, and it is where the domains differ most, because a chart's clean rendering and a field log's stained scan pose very different difficulties. The three stages after it are geometry: thinning a mask to a centreline is a deterministic operation on a binary image [3], reading a depth scale or tick labels into a coordinate transform is arithmetic once the labels are located [6], and resampling a path onto a fixed grid is interpolation. None requires training, and a centreline extractor written for a map skeleton will thin a log curve without modification.

That is why the recipe transfers even though the networks do not. A team that has solved chart extraction cannot hand a well-log team a trained model, because segmentation was learned on the wrong pixels; what it hands over is the spine, that the remaining work is centreline, calibration, and vectorisation, and those three lift almost verbatim. In our grounded instance the encoder and decoder of five stages each, with two attention layers, exist entirely to serve segmentation; everything downstream of the three-class mask, the collapse to a per-depth position and the resampling to 300 depth points, is the same geometry a cartographer would run.

Where the domains genuinely diverge

The crosswalk should not flatten real differences, and two are worth marking. A chart figure and a map tile are bounded, but a well log runs to many thousands of pixels down its depth axis, forcing tiling and memory choices in segmentation that the chart and map communities rarely face. And a map sheet can be dense with ink, whereas a well-log curve is a thin thread across a mostly empty strip, so the class imbalance is far more severe for logs and the losses that work for a busy map do not carry over. Both divergences live inside the first stage, consistent with the survey's central reading: the learning stage is where domains differ, and the geometry stages are where they agree.

Discussion

Reading these archives as one problem tells a team where to spend and where to borrow. Spend on segmentation, the learned, domain-specific, data-hungry stage where a scanned log's stains, skew, and scarcity make the problem genuinely yours. Borrow the rest, because centreline extraction, calibration, and vectorisation are geometry the chart and map literatures worked out long ago and that transfer with little change [3] [6] [7]. The Hough-transform lineage is worth remembering not as a method to deploy but as proof that this was always one problem, split into communities for historical reasons rather than because the task differs [1] [2]. For an operator sitting on a raster archive, adjacent-domain solutions are more reusable than they look, provided the reuse is scoped to the geometry stages and segmentation is trained on the operator's own scans.

Limitations

This is a survey and inherits a survey's limits. It synthesises published document-image work across three domains and does not re-implement or benchmark any method it discusses. The four-stage decomposition is our reading, chosen because it is the coarsest split under which every domain fills every stage; a finer one would expose real per-domain steps, colour separation on maps, legend parsing on charts, that the coarse spine folds away. The only sourced quantities are the well-log figures, the 136,771-TIF archive scale, the three output classes, the five encoder and five decoder stages, the two attention layers, and the 300 interpolated depth points; the per-domain stage wording is a qualitative account of published practice, flagged illustrative on the canvas and here. We make no claim that a model trained in one domain transfers to another; the claim is narrower, that the geometry stages transfer while the learned segmentation stage does not. Take this as a map of where adjacent archive domains agree, not a substitute for training a segmenter on your own scans.

References

[1] Hough, P. V. C. Method and Means for Recognizing Complex Patterns. U.S. Patent 3,069,654 (1962). The parameter-space voting scheme for detecting lines in images that opens the raster-to-structure lineage all three domains inherit. https://patents.google.com/patent/US3069654A/en

[2] Duda, R. O., and Hart, P. E. Use of the Hough Transformation to Detect Lines and Curves in Pictures. Communications of the ACM, 15(1), 11 to 15 (1972). The rho-theta polar parameterisation that made line and curve extraction practical. https://dl.acm.org/doi/10.1145/361237.361242

[3] Lam, L., Lee, S.-W., and Suen, C. Y. Thinning Methodologies: A Comprehensive Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(9), 869 to 885 (1992). The reference survey of skeletonisation, the classical route from a filled mask to a single-pixel centreline. https://doi.org/10.1109/34.161346

[4] Long, J., Shelhamer, E., and Darrell, T. Fully Convolutional Networks for Semantic Segmentation. CVPR (2015). The pixel-to-pixel classification model that replaced hand-built detectors as the segmentation stage across document-image tasks. https://arxiv.org/abs/1411.4038

[5] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015). The encoder-decoder with skip connections that became the default segmentation backbone for thin-structure imagery. https://arxiv.org/abs/1505.04597

[6] Cliche, M., Rosenberg, D., Madeka, D., and Yee, C. Scatteract: Automated Extraction of Data from Scatter Plots. ECML PKDD (2017). A chart-image pipeline that detects plot objects, reads tick labels with optical character recognition, and recovers data coordinates, the chart-domain instance of the same recipe. https://arxiv.org/abs/1704.06687

[7] Chiang, Y.-Y., Leyk, S., and Knoblock, C. A. A Survey of Digital Map Processing Techniques. ACM Computing Surveys, 47(1), 1 to 44 (2014). A survey of extracting geographic features from scanned maps, the cartographic instance of segmentation to vectorisation. https://doi.org/10.1145/2557423

Document AI for Engineering Archives: Methods Across Scanned Charts, Maps, and Logs

Abstract

Method

The shared spine

Only the first stage is learning

Where the domains genuinely diverge

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on