Chart Data Extraction: Datasets, Tasks, and Methods in the Literature

Abstract

Recovering the numbers that a chart image was drawn from is a mature computer-vision subfield with its own public datasets, a stable task taxonomy, and methods that have evolved from hand-tuned pipelines to end-to-end neural systems. This paper surveys that subfield and credits the groups who built it. We organise the public corpora by the task they evaluate and the chart modality they target: Scatteract for scatter plots [1], PlotQA for reasoning over bar, line, and scatter plots [2], ChartOCR for hybrid data extraction across chart types [3], the ICPR 2020 CHART competition corpus for multi-task chart parsing [4], and LineEX for line charts [5]. We then position raster well-log digitisation inside the same map. A scanned well log is the same fundamental problem the chart community works on, turning a raster picture of plotted curves back into the numbers behind them, yet it is a hard and under-benchmarked instance of it. The label space is unusually narrow, only three output classes (background plus two curves), while the input is far wider than any public chart figure, spanning 3,200 to 12,800 pixels, and our validation scores recovery at 300 interpolated depth points. The finding is that the chart-extraction literature supplies the right problem framing and the right methods for logs, but no public benchmark that matches a log's geometry or its scarce-class structure, which is why VeerNet, our encoder-decoder for raster-log digitisation, had to be trained on a corpus we synthesised rather than on any chart release.

What the chart-extraction subfield contains

A chart is a lossy rendering of a table. Somewhere upstream there were numbers, and a plotting library turned them into pixels by mapping data coordinates through axes, ticks, and a colour or marker convention. Chart data extraction is the inverse problem: take only the pixels and recover the table. It is a hard inverse because the rendering throws information away, axes may be linear or logarithmic, tick labels arrive through noisy optical character recognition, and overlapping marks have to be disentangled. The field has treated this as a research target for years, and we survey it here as outsiders to the chart community, crediting each release to its authors and claiming none of these datasets as ours.

The cleanest way to read the literature is along two axes that the field itself uses. The first is the extraction task, which runs from coarse to fine: classifying what kind of chart an image shows, detecting its structural elements such as axes and legends, extracting the individual data points, and finally recovering the full set of values end to end. The second axis is the chart modality the dataset targets, most commonly bar, scatter, line, or a mixture. Every public release sits somewhere on this task-by-modality grid, and a well-log digitiser sits on it too, which is the comparison this paper is built around.

Scatter plots

The earliest fully automated system in this lineage that we consider is Scatteract, which extracts the numerical coordinates of points from scatter-plot images [1]. Its pipeline is instructive because it lays out the canonical stages the whole subfield inherited: detect the chart objects, read the tick labels with OCR, fit the pixel-to-data coordinate transform with robust regression that tolerates OCR mistakes, and report the recovered points. Scatter plots are a forgiving modality for this because the marks are discrete and sparse, so the authors could report fully automatic recovery on the large majority of their test plots. The lesson we took from it is not the dataset, which is scatter-specific, but the architecture of the problem: extraction is detection plus a learned coordinate transform, and the coordinate transform is where the accuracy is won or lost.

Mixed modalities and reasoning

Two releases widen the modality coverage and raise the difficulty. PlotQA assembled a large corpus of bar, line, and scatter plots rendered from real-world data sources, and framed the task as answering questions whose answers require recovering and reasoning over the plotted values rather than merely classifying the image [2]. It set out to fix a weakness in earlier synthetic plot datasets, which lacked variability in axis labels and real-valued data, so models trained on them never had to read the numbers. ChartOCR took the extraction goal head on with a deep hybrid framework that detects chart type and structural key points with a neural network and then applies rule-based reasoning to rebuild the values across bar, line, and pie charts [3]. ChartOCR matters to our reading because it states the modern recipe most clearly: let a network find where the marks and axes are, and let explicit geometry turn those locations back into numbers. That division of labour is exactly the one a log digitiser needs.

Structured multi-task parsing

The most comprehensive public effort we surveyed is the ICPR 2020 competition on harvesting raw tables from infographics, the CHART corpus [4]. Rather than treating extraction as a single step, it decomposes chart recognition into named tasks: image classification, text detection and recognition, text-role classification, axis analysis, legend analysis, plot-element detection, data extraction, and end-to-end recovery. It ships two training sets, a synthetic set generated from real data with a plotting library and a manually annotated set drawn from the open-access section of PubMed Central. The CHART corpus is the closest thing the field has to a shared, multi-task benchmark, and its decomposition is the taxonomy we adopt for the survey grid because it is the one the community converged on.

Line charts

The release that comes nearest to a well log in modality is LineEX, a transformer key-point detector built specifically for line charts that locates data points, ticks, and legend mappings, trained on a large synthetic line-chart corpus [5]. A line chart and a well-log track are close cousins: both are continuous traces threading across an image, both demand that the model follow a curve rather than count discrete marks, and both reduce, in the end, to reading a value off the trace at each horizontal position. LineEX is the public dataset whose problem shape is most like ours. It is still a chart, though, drawn at figure proportions, with many possible curve colours and styles and a label space defined by however many series the figure happens to plot.

How we placed well logs on the map

Our method here is a positioning exercise rather than a training run. For each public release we recorded the extraction task it evaluates, the chart modality it targets, and the typical pixel width of its figures, then placed it as a marker on the task-by-modality grid. Against that same grid we placed the well-log digitisation setting using the engagement's own numbers, so the comparison is concrete rather than rhetorical.

Three properties define where a well log lands. First, the task: log digitisation is data-point extraction in the chart taxonomy, recovering a value at every depth, and at the limit it is end-to-end value recovery, so it occupies the fine end of the task axis. Second, the modality: a log track is a continuous trace, which makes it a line-like problem and seats it in the same quadrant as LineEX. Third, and this is where it diverges sharply from every chart release, the label space and the input. A well-log model in our multiclass setting predicts only three output classes, the background plus two curves, a far narrower label space than a general chart parser that must cope with arbitrary series counts, colours, legends, and text. Yet the input is enormous: a scanned log strip runs from 3,200 to 12,800 pixels wide, an order of magnitude beyond the page-figure widths the chart datasets are built from. We score recovery at 300 interpolated depth points per validated curve, the resolution at which the extracted trace is compared against ground truth. The combination, scarce classes on a very wide input, is the signature that no chart benchmark carries.

Results

The taxonomy below draws this directly. Each public chart-extraction dataset is a teal marker in the task-by-modality cell it addresses, sized by how many releases share the cell, and the orange diamond is where the well-log setting falls.

A taxonomy matrix of the public chart-extraction datasets that existed when we surveyed the field, arranged as a grid of extraction task (rows: chart-type classification, structural element detection, data-point extraction, end-to-end value recovery) against the chart modality each release primarily targets (columns: bar, scatter, line, mixed). Each teal marker is a public release placed in the cell it addresses, sized by how many releases share the cell: Scatteract for scatter-point extraction, ChartOCR and PlotQA across mixed modalities, the ICPR 2020 CHART corpus for structural elements, and LineEX for line-chart points. The single orange diamond is where well-log digitisation falls. It is a data-point extraction task on a line-like modality, the same quadrant the chart community works in, but with a much narrower label space of only 3 output classes (background plus 2 curves) and a far wider input, 3,200 to 12,800 pixels, validated at 300 interpolated depth points. Click any cell to read which public datasets occupy it; drag the input-width lever to compare figure widths, and the read-out shows the chart datasets clustering at page-figure widths while only the well-log band reaches the wide end of the axis. The well-log figures are the build's own numbers; the dataset cell placements and typical figure-width ranges are an illustrative reading of each public release, not a per-image measurement.

Two readings come straight off the grid. First, the well-log setting is not in some exotic corner of the taxonomy; it sits squarely in the data-point-extraction row on a line-like modality, the same cell the chart community already crowds into with ChartOCR and LineEX. The chart literature is therefore the right neighbourhood: the task framing, the detect-then-transform recipe, and the line-following machinery all transfer in principle. Second, the moment the input-width lever moves toward the wide end of the axis, the public chart datasets drop away. Their figures live at page widths of a few hundred to a couple of thousand pixels, and only the well-log band reaches into the 3,200-to-12,800 pixel range. A model trained on chart figures has never processed a trace that runs uninterrupted across twelve thousand pixels of width, and the read-out makes that gap explicit: across the whole wide-input band, the count of public chart datasets that reach it stays at zero.

The narrow-class structure compounds the geometric gap. Because a log has only background and two curves to separate, the hard part is not classifying among many series but holding two faint, often overlapping traces apart across an extreme width while almost every pixel is background. That is a severe class-imbalance regime, and it is precisely the regime that a three-class label on a 12,800-pixel-wide image creates. The chart datasets, with their richer but page-sized figures, never stress a model on that particular axis. So the survey's empirical core is a double mismatch: the chart-extraction field gives logs the right problem and the right methods, but its benchmarks match neither the input geometry nor the scarce-class imbalance that define the log instance.

Discussion

The honest summary is that well-log digitisation is a member of the chart-extraction family that the family has not benchmarked. Everything the subfield learned applies. The detect-then-transform decomposition that Scatteract introduced [1] is the right backbone for a log digitiser. PlotQA's insistence that a real benchmark must contain real-valued, readable numbers rather than decorative plots [2] is exactly the discipline a log validation set needs. ChartOCR's hybrid stance, a network for where and explicit geometry for how much [3], is the design a depth-indexed log reader should follow. The ICPR CHART task decomposition [4] gives the vocabulary to say precisely which sub-task a log digitiser is solving. And LineEX [5] shows that the line-following case is tractable with modern key-point and transformer machinery. We credit all of them, and a team building a log digitiser should read this literature first.

What none of them supplies is a corpus in the log's own regime. The releases are page-figure-sized and richly multi-class; a log is order-of-magnitude wider and starkly few-class. Because that gap is in the data rather than in the ideas, the route forward is to build the data in the right shape and bring the field's methods to it. VeerNet, our encoder-decoder for raster-log digitisation, is that route. It is a segmentation model in the U-Net lineage [6], trained to separate the three classes a log presents, on a synthetic corpus we rendered at the full 3,200-to-12,800 pixel width range so the extreme geometry is present from the first epoch, and validated by tracing each recovered curve and comparing it against ground truth at 300 interpolated depth points. VeerNet is ours; the chart datasets and methods that frame the problem are not, and we make no claim on them.

The relationship this survey establishes is easy to overstate in either direction, so we state it plainly. We are not contributing a chart benchmark, and a single-purpose synthetic well-log corpus is not a general chart-extraction dataset, so we make no such claim. What we contribute is a placement: a demonstration, using the chart field's own taxonomy and its own public releases as the coordinate system, that a real and recurring class of raster-to-data images, well logs, falls inside the field's task structure but outside the geometry and class balance its datasets cover. For anyone digitising logs, seismic strips, or other very wide scientific traces, the chart-extraction literature is the correct source of method and the wrong source of data.

Limitations

This survey has explicit edges. We treat five public releases in depth, chosen because they span the task taxonomy and the modalities most relevant to a continuous-trace digitiser, but the subfield is larger than five datasets, and our claim is about the dominant page-figure, multi-class character of the public landscape rather than an exhaustive census. The task-by-modality placement of each dataset reads off what each release primarily targets; several of them, the ICPR CHART corpus especially, span multiple tasks at once, and the single cell we draw is a simplification of a release that legitimately occupies more. The figure-width ranges are illustrative of common chart-image sizes rather than per-image measurements, so individual figures may sit outside the band we drew; the order-of-magnitude separation from a log's width is wide enough that this does not change the conclusion. The well-log figures we plot, three output classes, the 3,200-to-12,800 pixel width range, and validation at 300 interpolated depth points, are the engagement's own numbers describing our synthetic corpus and validation protocol, not the universe of physical logs, some of which are scanned at other settings. Finally, our claim that chart-trained models fail to transfer to log geometry and class balance rests on the structure of the problem and on our own results rather than on a controlled cross-dataset transfer study; a systematic transfer benchmark from the chart releases named here to real logs would test it more rigorously than this positioning does.

References

[1] Cliche, M., Rosenberg, D., Madeka, D., and Yee, C. Scatteract: automated extraction of data from scatter plots. Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2017. A fully automated pipeline that detects chart objects, reads tick labels with OCR, and recovers data point coordinates from scatter plot images via robust regression. https://arxiv.org/abs/1704.06687

[2] Methani, N., Ganguly, P., Khapra, M. M., and Kumar, P. PlotQA: reasoning over scientific plots. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020. A large dataset of bar, line, and scatter plots built from real-world data sources, with question-answer pairs that require recovering and reasoning over the underlying values. https://arxiv.org/abs/1909.00997

[3] Luo, J., Li, Z., Wang, J., and Lin, C.-Y. ChartOCR: data extraction from charts images via a deep hybrid framework. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021. A hybrid method that detects chart type and structural key points with a deep network, then applies rule-based reasoning to reconstruct data values across bar, line, and pie charts. https://openaccess.thecvf.com/content/WACV2021/html/Luo_ChartOCR_Data_Extraction_From_Charts_Images_via_a_Deep_Hybrid_WACV_2021_paper.html

[4] Davila, K., Kota, B. U., Setlur, S., Govindaraju, V., Tensmeyer, C., Shekhar, S., and Chaudhry, R. ICPR 2020 competition on harvesting raw tables from infographics (CHART-Infographics). International Conference on Pattern Recognition (ICPR), 2021. A multi-task chart-recognition benchmark with synthetic and manually annotated PubMed Central chart images spanning classification, text and axis analysis, element detection, and data extraction. https://link.springer.com/chapter/10.1007/978-3-030-68793-9_27

[5] Shivasankaran, V. P., Hassan, M. Y., and Singh, M. LineEX: data extraction from scientific line charts. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023. A transformer key-point detector for line charts that locates data points, ticks, and legend mappings, trained on a large synthetic line-chart corpus. https://openaccess.thecvf.com/content/WACV2023/html/P._LineEX_Data_Extraction_From_Scientific_Line_Charts_WACV_2023_paper.html

[6] Ronneberger, O., Fischer, P., and Brox, T. U-Net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015. The symmetric encoder-decoder with skip connections that learns dense pixel masks from limited labelled data, the segmentation backbone the raster-to-data field reuses. https://arxiv.org/abs/1505.04597