Abstract
A digitised well-log curve has two coordinates, a value and a depth, and the depth is the one the scanned image does not store in its pixels. The pixels record only row positions; the mapping from a row to a true measured depth lives in the header scale, where two printed depth ticks anchor the vertical axis. This paper isolates that mapping as its own evaluation problem, deliberately separate from the curve interpolation that consumes it. We treat calibration as a two-point affine fit, a depth-per-pixel slope and an offset recovered from the upper and lower header anchors, and we ask a single question of it: when an anchor is read a few pixels off true, how does that error reach the final CSV. We derive the propagation analytically, show that it is not a constant offset but a slope tilt whose residual is zero at the top anchor and grows linearly to the deepest sample, and validate the model against the 300 interpolated depth points the engagement scores each curve on. We then compare the residual a header-read slip injects against the model's own reported error floors of 0.0132 mean absolute error and 0.0004 mean squared error, so the two error sources are read on the same axis. The finding is that depth calibration is a silent multiplier on curve accuracy: a small anchor mis-read can place more error at the bottom of a log than the segmentation model contributes anywhere, and it does so precisely at the depths an interpreter cares about most. The instrument that argues this is ours; the chart-extraction and segmentation methods it builds on are credited to their authors.
Where this sits next to the public literature
Reading numbers back out of a plotted image is a studied problem, and the part of it this paper isolates, recovering the axis itself rather than the trace, has a clear lineage we want to credit before describing our own work. The earliest fully automated pipeline in that lineage we lean on is Scatteract, which made the coordinate transform explicit: detect the axis ticks, read their labels with optical character recognition, and fit the pixel-to-data mapping with a regression robust to the OCR mistakes [1]. That is exactly the shape of depth calibration. The header depth ticks are the labels, the detected tick rows are the positions, and the fit is the affine map we evaluate here. Scatteract's lesson, that the accuracy of the whole extraction is won or lost in the coordinate transform rather than in the detection, is the premise this study takes literally and tests.
The modern recipe sharpens the division of labour. ChartOCR detects structural key points with a deep network and then hands the recovered positions to explicit geometry that turns them into values [2], and a depth-indexed log reader is built the same way: a segmentation network finds where the curve and the header ticks are, and a deterministic affine fit converts those rows into depth. LineEX brings the case closest to a log by detecting data points and axis ticks jointly on continuous line charts with a transformer key-point head [3], which is the public work whose problem shape, a trace plus an axis to calibrate it against, most resembles ours. The ICPR 2020 CHART competition is the release that legitimises treating the axis on its own: among its named sub-tasks it lists axis analysis as separate from data extraction [4], which is the precedent for this paper studying depth calibration in isolation rather than folding it into the trace recovery. None of these public efforts targets a borehole log, and we claim none of them; the curve rows we calibrate come from VeerNet, our segmentation model in the U-Net encoder-decoder lineage [5], trained on a synthetic corpus we built.
Method
We model depth calibration as the two-point affine map a petrophysicist would draw by hand. Let the upper header anchor sit at image row r_top with printed depth d_top, and the lower anchor at row r_bottom with printed depth d_bottom. Any detected curve row r is mapped to a measured depth by
The middle factor is the depth-per-pixel slope, the engine of the whole conversion. Its denominator is the pixel span between the two anchors, which on a correctly detected strip equals the image height. In the engagement the synthetic logs were rendered across a height band of 480 to 640 pixels and a width band of 3,200 to 12,800 pixels, so the same physical depth interval is carried by a different number of rows on every log, and the slope has to be re-fitted per image rather than assumed.
To study error propagation we perturb a single anchor. We hold the top anchor exact, which is the realistic case because the top of scale is usually the cleanest tick, and we displace the lower anchor by a mis-read of e pixels, the kind of slip a thick printed tick mark or a slightly skewed scan produces. The fit then divides the true depth interval by an apparent span that is e pixels short, so the fitted slope is steeper than the true one. The residual depth error at a curve row that truly sits a fraction t of the way down the log, with t running from 0 at the top anchor to 1 at the bottom, works out to
which has two properties worth stating plainly. It is linear in t, so the error is zero at the top anchor and largest at the deepest sample, and it is linear in the mis-read e for small e, so the propagation has no threshold below which a slip is free. This is the analytical core of the paper, and everything the instrument shows is this expression evaluated live.
We evaluate the residual on the same grid the engagement validates curves on: 300 interpolated depth points per curve. Because the residual is a linear ramp from zero, its mean absolute error across those 300 points is exactly half the worst-case residual at the deepest point, and its mean squared error is the worst-case squared over three. Reporting calibration error as an MAE and an MSE over the 300 points, rather than as a single worst number, lets us read it against the model's curve-recovery errors in the same units, which is the comparison the results section turns on. We express depth in the normalised unit band the validation CSV uses, where the full logged interval spans one unit, so a residual of 0.01 means one percent of the logged interval.
Results
The instrument evaluates the propagation expression directly. The left panel is the calibration ladder: the teal rungs are the true depth ladder fixed by the two correctly placed anchors, and the orange dashed rungs are the ladder a mis-read lower anchor fits, tilting away from the truth and opening fastest near the bottom. The right panel is the consequence on the 300 interpolated depth points: a residual fan that starts at zero under the top anchor and ramps to its worst value at the deepest sample, with the model's own lowest reported MAE drawn across it as a reference line.
Three readings come straight off it. First, the shape of a calibration error is a tilt, not a shift. A reader expecting a constant bias to subtract out will find none; the residual is a fan, near zero where the top tick anchors it and accumulating with every pixel of depth, which is why a depth-axis slip is so easy to miss in a spot-check near the top of a log and so damaging deep in it. Second, the propagation has real teeth at small mis-reads. Because the residual scales with the mis-read divided by the anchor span, a few pixels off a 480 to 640 pixel span is a percent-level tilt, and a percent of a logged interval is a depth error of meters on a typical log, far larger than the sub-percent value error the curve model contributes. Third, and this is the comparison the instrument is built to force, the residual MAE crosses the model's 0.0132 MAE floor at a small mis-read. Below that crossing the depth axis is not the bottleneck and curve accuracy dominates the CSV; above it the header read is the dominant error source and no improvement to the segmentation model can recover what the axis lost.
That crossing reframes where engineering effort belongs. The loss sweep that produced the 0.0132 MAE and 0.0004 MSE floors, and the curve-1 mean MAE of 0.0277 under Tversky loss that sets the scale of value-side error, were hard-won on the curve-recovery side of the pipeline. The instrument shows that all of that work can be undone downstream by an axis the same pipeline never scored, because depth calibration ran as an un-evaluated deterministic step between the mask and the CSV. The residual fan makes the un-evaluated step legible, and legible in the same MAE units the rest of the pipeline already reports.
Discussion
The practical message is that depth-axis calibration deserves an acceptance metric of its own, computed and gated like any other stage, rather than trusted as plumbing. The chart-extraction field already separates axis analysis from data extraction as a named task [4], and the coordinate-transform-first stance that Scatteract established [1] says the transform is where accuracy is decided; our results give the borehole-log version of that claim a number. Because the propagation is analytic, the metric is cheap to compute: a calibration residual MAE over the 300 validation points, derived from the fitted slope and the anchor uncertainty, sits beside the curve MAE in the same report, and the crossing point where one overtakes the other tells an operator whether to invest in a better segmentation model or in a more reliable header-tick detector.
It also reframes what the detect-then-transform recipe owes the transform half. ChartOCR's design, a network for where and explicit geometry for how much [2], and LineEX's joint detection of points and ticks [3], both put real modelling effort into locating the axis, not just the trace. Our propagation result is the justification for that effort stated in error terms: the geometry half is not bookkeeping after the interesting part, it is a multiplier on everything the network produced, and on a tall log it can be the largest single term in the final error. The mask from a U-Net-style segmenter [5] is necessary but not sufficient; the depth it is plotted against is a second, independent thing to get right and to measure.
Two further consequences follow. One is that anchor placement should prefer the widest available span: the residual scales with the mis-read over the anchor span, so calibrating from two ticks far apart on the header tolerates a mis-read far better than calibrating from two close ones, even when the close pair is read more confidently. The other is that the top-anchor-exact assumption we made is a best case; when both anchors carry uncertainty the fan acquires an offset as well as a tilt, and the residual no longer passes through zero, which only strengthens the argument that the axis is a measured quantity and not a given.
Limitations
This study has clear edges. The propagation model is a two-anchor affine fit, which is the right model for a linear header depth scale but not for logs whose depth axis is non-linear or piecewise, where a two-point fit would carry a structural bias the residual expression does not capture. We perturb a single anchor and hold the other exact to isolate the slope-tilt behaviour; real scans carry uncertainty on both anchors and on the detected curve rows themselves, and the combined error fan would be wider and would generally not pass through zero, so the residuals here are a lower bound on what a noisy header produces rather than a full error budget. The header tick geometry drawn in the instrument is illustrative; only the propagation arithmetic and the engagement's sourced figures, the 300 interpolated depth points, the 480 to 640 pixel height band, the 3,200 to 12,800 pixel width range, the 0.0132 MAE and 0.0004 MSE floors, and the 0.0277 curve-1 Tversky MAE, are real, and the depths are expressed in the validation's normalised unit band rather than in physical meters, which vary per log. The crossing point at which calibration error overtakes the model's MAE floor depends on the depth interval each header spans, so the specific mis-read that triggers it is log-dependent and the instrument's value is the relationship it exposes, not a universal threshold. Finally, we validate the propagation against our own synthetic corpus and its 300-point protocol; a controlled study that perturbs anchors on a set of physical scanned headers and measures the realised CSV error against ground-truth depth would test the model more rigorously than this analytical and synthetic evaluation does.
References
[1] Cliche, M., Rosenberg, D., Madeka, D., and Yee, C. Scatteract: automated extraction of data from scatter plots. Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2017. Fits the pixel-to-data coordinate transform from OCR-read tick labels with robust regression, the canonical step that turns detected positions into real-world values. https://arxiv.org/abs/1704.06687
[2] Luo, J., Li, Z., Wang, J., and Lin, C.-Y. ChartOCR: data extraction from charts images via a deep hybrid framework. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021. Detects structural key points with a network and then applies explicit geometry to recover values, the division of labour a depth-indexed log reader follows. https://openaccess.thecvf.com/content/WACV2021/html/Luo_ChartOCR_Data_Extraction_From_Charts_Images_via_a_Deep_Hybrid_WACV_2021_paper.html
[3] Shivasankaran, V. P., Hassan, M. Y., and Singh, M. LineEX: data extraction from scientific line charts. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023. A transformer key-point detector that locates data points and axis ticks together, the closest public analogue to recovering a depth axis from a continuous trace. https://openaccess.thecvf.com/content/WACV2023/html/P._LineEX_Data_Extraction_From_Scientific_Line_Charts_WACV_2023_paper.html
[4] Davila, K., Kota, B. U., Setlur, S., Govindaraju, V., Tensmeyer, C., Shekhar, S., and Chaudhry, R. ICPR 2020 competition on harvesting raw tables from infographics (CHART-Infographics). International Conference on Pattern Recognition (ICPR), 2021. Names axis analysis as a distinct chart-recognition sub-task, which is the framing this study adopts in treating depth calibration on its own. https://link.springer.com/chapter/10.1007/978-3-030-68793-9_27
[5] Ronneberger, O., Fischer, P., and Brox, T. U-Net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015. The encoder-decoder segmentation backbone whose pixel masks supply the detected curve rows that calibration then maps to depth. https://arxiv.org/abs/1505.04597