Synthetic Data for Chart and Plot Understanding: Datasets, Generators, and Gaps

Abstract

Almost every model that reads numbers back out of a chart image is trained on charts that were never drawn by a person. The reason is not convenience, it is mechanics. A plotting routine that renders a figure already holds the coordinates it is plotting, so it can write out a label of arbitrary precision in the same pass that paints the pixels. That property, and not any dataset in particular, is what makes synthetic data the default fuel of chart understanding. This survey is about the generators, not the extractors. A companion survey maps the methods that consume these corpora and credits the groups who built them; here we look one step upstream, at the engines that emit the training pairs, and at the exact places where the label a generator writes stops describing what a real scanned figure would look like. We read the synthetic splits behind ChartOCR [1], LineEX [2], and the ICPR 2020 chart competition [3] as programmatic engines, and we place our own render-and-emit-mask generator beside them as a worked parallel drawn from a raster well-log digitisation engagement: 15,000 synthetic curves at pixel-perfect mask fidelity, spanning 3,200 to 12,800 pixels wide and 480 to 640 tall. The central finding is a fidelity cliff that every one of these engines shares. They are exact on geometry, because a renderer cannot be wrong about the coordinates it drew, and they thin out on style, occlusion, and annotation realism, because none of those properties is something a renderer must reconcile against a scanner. A synthetic corpus does not have a fidelity; it has a fidelity profile, and reading that profile is what separates a generator that transfers from one that teaches a model an invariance the field never had.

Why chart corpora are synthetic in the first place

The inverse task, taking pixels and recovering the table behind them, has a peculiar and convenient forward direction. To make a supervised pair for most vision tasks you need a human to draw a box or paint a mask, and the label is an approximation of a truth nobody recorded exactly. Charts invert that. The truth was recorded exactly, upstream, as the numbers the plotting library was handed, and the image is the lossy thing. So the cheapest way to build a labelled chart is to run the pipeline forwards on purpose: sample some numbers, render them, and keep both ends. The label is not estimated, it is the input to the render.

This is why the synthetic corpora exist and why they are large. ChartOCR trains its type and key-point detectors on programmatically produced charts because a generator can stamp out structural key points at the pixel a bar edge or a line vertex actually lands on, with none of the annotator disagreement a human labelling job would carry [1]. LineEX generates line charts with the point, tick, and legend-mapping labels varied by construction, so its transformer key-point model sees a controlled distribution of exactly-located targets rather than a scraped one [2]. The ICPR 2020 competition pairs a synthetically generated split, whose structural labels are exact by the same argument, with a smaller manually annotated real split precisely so that a method can be trained on the abundant exact data and then measured against the scarce real data [3]. In each case the generator is doing the labelling, and it is doing it perfectly along one axis. The question this survey presses is which axis, and what happens on the others.

What a generator actually emits

It helps to be concrete about the object a chart generator produces, because the fidelity argument lives in its parts. A single training pair is a rendered raster and a bundle of labels, and the labels split cleanly into two kinds. The first kind is geometric: where the axes sit, where each tick lands, the pixel coordinates of every plotted point or bar edge, and, for a segmentation-style target, the mask of which pixels belong to which series. The second kind is everything about how the figure looks and behaves as an artifact: the line style and colour palette, whether two curves cross and which one is drawn on top, the fonts and rotation of the tick text, the compression noise and the scanner streaks a real reproduction would carry.

The generator knows the first kind by construction and can only guess at the second. It placed the tick, so the tick label is exact to the pixel. It chose the palette from a list it wrote, so the palette is real but it is the generator's list, not the world's. It decided which curve occludes which, so the occlusion label is internally consistent, but the way real overlapping ink bleeds and anti-aliases at a crossing is not something the draw call ever had to model. And the annotations, the small human residue of a real figure, hand-added arrows, smudged legends, a caption bleeding into the plot, are simply absent unless the generator was told to fake them, in which case they are as real as the faking. Geometry is emitted; the rest is invented, and invented well or badly.

The fidelity cliff

Put the four generators next to one another and rate each on how faithfully its emitted labels stand in for a real scanned figure along each of those dimensions, and the same shape appears every time. Fidelity is near its ceiling on geometry and falls off a cliff into the columns to its right. This is the argument the exhibit below makes, and it is worth stating why the shape is structural rather than a property of any one engine's care.

A generator-versus-gap ledger over the synthetic chart and plot corpora that extraction models train on. Four programmatic engines sit as rows: our own render-and-emit-mask procedural-log generator, and the synthetic splits behind ChartOCR, LineEX, and the ICPR 2020 chart competition. Four label-fidelity dimensions sit as columns, ordered from the one a renderer knows exactly because it drew it (geometry) to the ones it never has to reconcile with a scanner (style, occlusion, annotation realism). Each cell rates how faithfully that engine's emitted labels stand in for what a real scanned figure would demand along that dimension, and the teal fill deepens with fidelity. The orange line is the only element that argues: the fidelity cliff between the geometry column, where every generator holds, and the leak columns, where they all give way. The vantage lever reads the same cells first as labels written to disk, then as their distance from a real scan, and the leak columns widen sharply while geometry barely moves. Only our generator row is sourced (15,000 synthetic curves at pixel-perfect mask fidelity, 3,200-12,800 by 480-640 pixels); the public-corpus per-dimension gap ratings are illustrative and flagged as such, and this cross-references the chart-extraction survey rather than re-surveying its methods.

Geometry is high because it is the only dimension where the generator's label and the world's truth are the same object. When a renderer records that a line vertex sits at a given pixel, there is no gap between the label and reality, because reality here just is the render. Our own generator makes this its sharpest point: because the segmentation mask is emitted from the same draw call that lays down the curve, the mask geometry is not merely accurate, it is exact, and there is no downstream re-annotation that could improve it. That is the property we lean on for 15,000 curves at pixel-perfect mask fidelity, and it is the same property that lets the public engines stamp out exact key points at scale.

The leak columns are where every one of these engines gives, and they give because the properties there are not things a renderer is forced to get right about its own output. Style fidelity depends on the generator's palette and stroke library matching the wild distribution of real charts, which no fixed list does. Occlusion fidelity depends on modelling how real overlapping ink behaves at a crossing, which a clean vector draw skips. Annotation realism depends on reproducing the human mess of a real figure, which a generator only has if someone wrote the mess in. The cliff is not a quality gap between careful and careless generators; it is the boundary between the one thing a renderer cannot fake, its own geometry, and the several things it can only approximate, the appearance of a scan it never made.

Reading our own generator against the public ones

The parallel with our raster-log work is exact enough to be useful. A scanned well log is a chart in every mechanical sense, curves plotted against a depth axis, and the same forward trick applies: we generate a synthetic log by sampling curve values, rendering them onto a canvas, and emitting the per-pixel mask in the same step, which is where the pixel-perfect fidelity comes from. The synthetic set that trained the digitiser was 15,000 such curves, and the images were deliberately wide and short, 3,200 to 12,800 pixels across and 480 to 640 tall, because that is the aspect a real log strip takes and because it stresses the geometry the generator is best at emitting.

But the leak columns bite our generator exactly as they bite the public ones, and pretending otherwise is how a synthetic-trained model surprises its owner in the field. Our curves are drawn from a style library that is ours, so a real log printed on a plotter we never sampled sits outside it. Our two-curve crossings are clean vector overlaps, so the smeared ink of a genuine crossing on aged paper is under-represented. Our annotations are the ones we chose to add, so a real log's hand-scrawled corrections and stamps are thin in the training set. The geometry we emit is unimprovable and the appearance we emit is a hypothesis, and the honest reading of the ledger is that our generator sits in the same profile as ChartOCR, LineEX, and the ICPR split rather than above them: exact where a renderer must be, approximate where it may be.

What the cliff should change in practice

The practical consequence is that the fidelity number a corpus advertises is the wrong summary. A generator that reports high overall label accuracy is almost certainly reporting its geometry, which is the easy column, and saying nothing about the leak columns where transfer actually fails. The useful thing to demand of a synthetic corpus, ours included, is the profile: not how faithful are the labels, but which labels, on which dimension, and how far from a scan. A model trained on a corpus that is exact on geometry and thin on style will read a clean synthetic chart perfectly and then mistake a scanner streak for a curve, because it was never shown that the streak is not signal.

That reframing also tells a generator's author where to spend. Improving geometry is wasted motion, since it is already exact by construction. The returns are entirely in the leak columns: broadening the style library toward the real distribution, modelling occlusion the way overlapping ink actually behaves, and injecting the annotation mess a real figure carries. The vantage lever in the exhibit makes the stakes visible by reading the same cells first as labels on disk and then as their distance from a real scan; geometry barely moves between the two readings, and the leak columns widen sharply, which is the whole reason a synthetic pipeline earns its next hour of engineering on the right of the cliff rather than the left.

Discussion

The clean way to hold this survey next to the method survey it complements is by direction. That survey looks downstream, at the extractors that consume chart corpora and at where a well log sits on the field's task-by-modality map. This one looks upstream, at the generators that emit the corpora and at where their emitted labels stop matching a scan. The two meet at the corpus, and the meeting point is the fidelity profile: an extractor is only ever as honest as the leak columns of the generator that trained it, so a method that scores well on a synthetic test split may be scoring on the exact geometry the generator could never get wrong, while inheriting the generator's blind spots on style, occlusion, and annotation as its own.

Where our own work sits is as a worked instance rather than a new benchmark. The 15,000-curve, render-and-emit-mask generator behind the well-log digitiser is a chart generator by another name, and it shows the same profile as the public engines because the profile is a property of the render-and-emit trick, not of the domain. That is the useful transfer of this reading: the reason we can put an oil-and-gas log generator on the same ledger as ChartOCR, LineEX, and the ICPR split is that all four are exact for the same mechanical reason and leak for the same mechanical reason, and a team building any of them should budget its effort by the cliff rather than by an overall fidelity number that flatters the easy column.

Limitations

This is a survey of generators and it inherits a survey's limits. It reads the public synthetic splits behind ChartOCR, LineEX, and the ICPR 2020 chart competition as programmatic engines from their published descriptions, and it does not re-run those generators or re-measure any extraction method against them; the per-dimension gap ratings in the exhibit are illustrative and are flagged as such on the canvas, chosen to argue the structural shape of the cliff rather than to report a measured audit of each corpus. The only sourced numbers are our own generator's footprint, 15,000 synthetic curves at pixel-perfect mask fidelity spanning 3,200 to 12,800 pixels wide and 480 to 640 tall, and even there the pixel-perfect claim is a statement about mask geometry emitted in the draw call, not about the leak columns, where our generator is a hypothesis like any other. We did not conduct a controlled transfer study measuring how far each leak dimension degrades a downstream model, so the ordering of the leak columns, style before occlusion before annotation, is an argued profile and not a ranked experimental result. The survey also scopes itself to three public synthetic splits and one of our own and stops at the close of its quarter, so later chart-generation work and the newer real-chart corpora that the field has since released are out of frame. A reader should take this as a map of why synthetic chart labels are exact on geometry and soft everywhere else, and as a prompt to demand a corpus's fidelity profile rather than its headline number, not as a substitute for auditing the leak columns on their own generator and their own task.

References

[1] Luo, J., Li, Z., Wang, J., and Lin, C.-Y. ChartOCR: data extraction from charts images via a deep hybrid framework. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021. A hybrid chart-extraction system whose training leans on programmatically generated charts carrying type and structural key-point labels. https://openaccess.thecvf.com/content/WACV2021/html/Luo_ChartOCR_Data_Extraction_From_Charts_Images_via_a_Deep_Hybrid_WACV_2021_paper.html

[2] Shivasankaran, V. P., Hassan, M. Y., and Singh, M. LineEX: data extraction from scientific line charts. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023. A transformer key-point detector trained on a large synthetic line-chart corpus generated with programmatically varied point, tick, and legend labels. https://openaccess.thecvf.com/content/WACV2023/html/P._LineEX_Data_Extraction_From_Scientific_Line_Charts_WACV_2023_paper.html

[3] Davila, K., Kota, B. U., Setlur, S., Govindaraju, V., Tensmeyer, C., Shekhar, S., and Chaudhry, R. ICPR 2020 competition on harvesting raw tables from infographics (CHART-Infographics). International Conference on Pattern Recognition (ICPR), 2021. A multi-task chart-recognition benchmark that pairs a synthetically generated split carrying exact structural labels with a smaller manually annotated real split. https://link.springer.com/chapter/10.1007/978-3-030-68793-9_27