The Geoscience-to-CSV Pipeline as a Reusable Pattern

We built a system to read scanned well logs, and somewhere in the middle we stopped believing it was about well logs. The tell was how little of the code knew what a well log is. The classifier that separated a curve trace from the paper had never heard of a gamma-ray sonde. The routine that thinned a prediction to a single line did not know depth from temperature. The step that wrote the table out did not care whether the x-axis was depth or wavelength or years. What we had was not a petroleum tool with some vision inside it. It was a general image-to-tabular flow with a petroleum costume on, and the costume came off easily. This note reads that flow as what it is: a reusable four-stage pattern for turning a picture of a chart into the numbers behind the chart.

Being precise about the claim matters, because the adjacent work is easy to confuse it with. This is not a re-tour of the VeerNet architecture, the encoder-decoder with a transformer bottleneck we published for this task [5], nor the serving economics of running that model, nor the synthetic-data generator we built to feed it. Those all sit on top of a shape, and the shape is the subject here: four stages. A grayscale raster goes in, a per-pixel classifier assigns every pixel to a class, a centreline extraction thins each class region to a one-pixel line, and a resample writes that line out as a fixed number of table rows. Each stage is a stock computer-vision operation with its own decades-old literature, and the argument is that wiring those four together is a template you can carry to any chart-digitisation problem, not a one-off we assembled for logs.

The four stages, stripped of petroleum

Start at the input, where the generality is most obvious. The model takes a single grayscale channel. Not three colour channels, not a stack of geophysical inputs, one channel of intensity. A scanned log strip, a scanned exchange-rate chart, and a scanned ECG trace are the same object at this stage: a 2D array of brightness where dark pixels are ink and light pixels are paper. Nothing about 1 input channel is petroleum-specific, and choosing it was choosing to treat the problem as pure image, which is what let everything downstream stay domain-blind.

Stage two is the per-pixel classifier, the stage with the most pedigree. Assigning a class to every pixel is semantic segmentation, the operation Long, Shelhamer, and Darrell framed as a general dense-prediction task rather than an image-specific trick [1], and the encoder-decoder we run is the U-Net template Ronneberger and colleagues published for cell microscopy and the field then reused nearly everywhere [2]. Our classifier emits 3 output classes: background, the first curve, the second curve. The count is a property of the chart, not of geology: a single-trace chart is 2 classes, a five-trace chart is 6, and the machinery is identical. What the stage produces is a set of masks, one filled region per trace, and a mask is not yet a number. It is a blob of pixels shaped like a line.

Stage three turns the blob back into a line. A predicted mask is several pixels thick, and a table needs a single value per position, so the region has to be reduced to a one-pixel centreline without breaking into fragments. That is thinning, and the algorithm underneath it is Zhang and Suen's parallel skeletonization from 1984 [3], old enough to be furniture and general enough to have no idea what it is thinning. Feed it a curve from a well log or a boundary from a floor plan and it does the same thing: peel pixels off the edges until one connected line remains. This stage exposes the pattern most cleanly, because it is pure geometry with zero domain content.

Stage four is the resample, the step that makes the output a table instead of a picture. The centreline is a sequence of pixel coordinates at whatever spacing the image gave; a consumer wants values on a regular grid. So we interpolate the line onto a fixed axis, in our case 300 resampled depth points per curve in the validation notebooks, and write one row per point. Swap depth for the horizontal axis of any other chart and the operation is unchanged: read the calibrated axis, interpolate the centreline onto a regular sampling of it, emit rows. The word depth is the only petroleum in the stage, and it is a label, not logic.

Why the seams are the reusable part

The stages are stock, but stock parts do not make a reusable system on their own. What makes this a pattern rather than four separate tricks is that the interfaces between the stages are clean and domain-free. Stage one hands stage two a grayscale array. Stage two hands stage three a set of masks. Stage three hands stage four a set of centrelines. Stage four hands the caller a table. No interface carries a well name, a curve mnemonic, or a unit; each is a plain image-or-geometry object. That is what lets you re-skin the whole flow by changing labels and class counts while leaving the pipeline structure, the channel count, and the resample step exactly where they are.

The chart-extraction literature had already pointed at this. Cliche and colleagues' Scatteract treated a rendered scatter plot as a recoverable coordinate system and showed that reading data back out of a plot is a learnable task, not a bespoke parser per chart type [4]. Our contribution is the specific four-stage decomposition that carries a continuous ruled line chart, not a scatter of marks, from raster to table, and the observation that each seam is domain-agnostic by construction rather than by luck. When the interfaces are clean, porting the flow to a new chart type is a configuration change, how many classes and what the axis means, rather than a rewrite.

The image-to-tabular flow that turned raster well logs into CSV, drawn as four fixed stages: a grayscale raster in on 1 input channel, per-pixel classification into 3 classes (background plus 2 curves), centreline extraction that thins each mask to a one-pixel trace, and a tabular resample onto 300 depth-indexed points. The toggle re-skins the same four stages for any ruled line chart without touching the numeric spine, which is the argument: the stages, the channel count, the class count, and the resample step are structural, not petroleum-specific. The only orange element is the output-quality readout at the tail, peak R-squared 0.9891 and mean CSV MAE 0.0277 on curve1 under Tversky loss, the proof that the reusable flow lands a usable table rather than a plausible mask. The stage figures, the output metrics, and the 136,771 source TIFs are sourced from the engagement archive; the generic-chart re-skin is a template view, not a second measured run.

The abstraction has to survive contact with a metric

A pattern that produces plausible masks but not usable numbers is not worth reusing, so the honest test of the abstraction is whether the tabular output is any good. Here the sourced numbers do the arguing. On the multiclass runs under Tversky loss, the best coefficient of determination between the digitised curve and the reference was 0.9891, and the mean absolute error of the resampled CSV on the first curve was 0.0277. Those are not mask-quality numbers; they are measured against the table, at the far end of all four stages, after the classifier and the thinning and the resample have each had a chance to introduce error. The pattern is judged where it delivers, which is the CSV, not the pixels.

That framing also explains why intersection-over-union on the raw masks, the natural segmentation metric, is the wrong headline. A mask can be fat or ragged and still thin to a centreline that lands the right value, so mask overlap systematically understates how good the final table is. Any team porting the flow should carry that discipline: evaluate the CSV, not the intermediate blob, because the blob is not what anyone downstream consumes.

What the pattern was built against

The reason we trust the abstraction is the scale it was pressure-tested at. The engagement's raster corpus held 136,771 source TIFs, a spread of scan qualities, vintages, and layouts wide enough that a flow which only worked on clean inputs would have fallen over early. It did not, and its failures were the generic ones the pattern predicts rather than anything petroleum-flavoured: touching traces the classifier merged, faint curves segmentation missed against a busy background, thinning artefacts where two lines crossed. Every one is a chart-digitisation failure mode, not a well-log one. The pattern's weak points are as domain-agnostic as its strengths.

The practical payoff of naming the pattern is reuse without relearning. Once you see the flow as grayscale-in, classify-pixels, thin-to-centreline, resample-to-table, a new chart-digitisation ask is not a new research project. It is the same four stages with a different class count and axis meaning, evaluated against the same output contract, inheriting the same known failure modes to test for first. That is what a reusable pattern buys: not that the code transfers unchanged, but that the shape of the solution, the questions to ask, and the metric to trust all transfer, so the next chart starts most of the way home instead of at zero.

Limitations

This is a claim about shape, and it should be read within its limits. The metrics are real archive numbers, the input channel, class count, and 300-point resample are the sourced configuration, and the 136,771 TIFs are the measured corpus, but the generic-chart reading in the instrument is a template view, not a second measured run: we have digitised well logs at scale, not exchange-rate charts at scale, and asserting that the flow ports is an argument from the domain-agnosticism of each stage, not a benchmark on a second domain. The four stages also assume a chart that is a set of ruled continuous lines on a calibrated pair of axes; a pie chart, a stacked area, or a plot where traces routinely occlude breaks the one-region-per-trace assumption stage two leans on. The peak R-squared of 0.9891 is a best case on selected examples under one loss, not a corpus-wide expectation, and the mean MAE of 0.0277 is for the first curve specifically; a second, harder curve scores worse, as the archive records elsewhere. Finally, this note stops at the tabular output and says nothing about whether the axis calibration feeding the resample was correct, a separate error source a faithful centreline cannot fix on its own.

The costume, not the body

The habit this left us with is to distrust the domain label on a vision problem until we have checked how much of the solution actually depends on it. For raster log digitisation the answer was almost none: four stock stages with clean seams, a costume of petroleum vocabulary over a body of general computer vision. Naming that body is not a modesty exercise about how little we invented. It is the useful part, because a pattern you can name is a pattern you can reuse, and the next chart that needs reading is not a new problem so much as the same four stages waiting for a different set of labels.

References

[1] Long, J., Shelhamer, E., and Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015). The framing of dense per-pixel classification as a general operation independent of image domain. https://openaccess.thecvf.com/content_cvpr_2015/html/Long_Fully_Convolutional_Networks_2015_CVPR_paper.html

[2] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015, Lecture Notes in Computer Science 9351, Springer, pp. 234-241. The encoder-decoder template the stage-two classifier follows, published for cell images and reused far beyond them. https://link.springer.com/chapter/10.1007/978-3-319-24574-4_28

[3] Zhang, T. Y., and Suen, C. Y. A Fast Parallel Algorithm for Thinning Digital Patterns. Communications of the ACM 27(3), 1984, pp. 236-239. The classical thinning algorithm behind centreline extraction. https://dl.acm.org/doi/10.1145/357994.358023

[4] Cliche, M., Rosenberg, D., Madeka, D., and Yee, C. Scatteract: Automated Extraction of Data from Scatter Plots. ECML PKDD 2017, Lecture Notes in Computer Science 10534, Springer. The prior that reading data back out of a rendered plot is a learnable, general task. https://arxiv.org/abs/1704.06687

[5] Maiti, T., et al. VeerNet: A Neural Network for Detection and Digitization of Raster Well Log Curves. MDPI Journal of Imaging 9(7), 136, 2023. The architecture this pattern was instantiated as, and the source of the sourced ablation metrics. https://www.mdpi.com/2313-433X/9/7/136

The Geoscience-to-CSV Pipeline as a Reusable Pattern

The four stages, stripped of petroleum

Why the seams are the reusable part

The abstraction has to survive contact with a metric

What the pattern was built against

Limitations

The costume, not the body

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on